Cluster resource goes comatose with NSS error 20892

  • 7005375
  • 22-Feb-2010
  • 08-Nov-2012

Environment

Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 1
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 2

Servers are running Novell Clustering

Situation

When migrating a resource to another node, the resource is comatose

When loading a resource the resource goes comatose

When checking /var/opt/novell/logs/ncs/resource.load.out the following error is being displayed at the end:
Thu Nov 12 21:56:58 CET 2009
++ IP_ADDR=/etc/ha.d/resource.d/IPaddr2
++ FILE_SYSTEM=/etc/ha.d/resource.d/Filesystem
++ OCF_DIR=/usr/lib/ocf/resource.d/heartbeat
++ PATH=/bin:/sbin:/usr/bin:/usr/sbin:/opt/novell/afptcpd/bin/:/opt/novell/bin
+ exit_on_error nss /poolact=G013
+ nss /poolact=G013
Error 20892
Pool state was not changed successfully
+ rc=28
+ '[' '!' 28 -eq 0 ']'
+ exit 28

Resolution

This has been solved in the novell-ncpserv from January 29, 2010 or later.

Additional Information

The problem is caused by the fact that ncpcon is not able to unload the volume and in /var/opt/novell/log/ncs/resource.unload.out the unload looks like:
CRM: Wed Feb 17 16:29:18 2010
++ IP_ADDR=/etc/ha.d/resource.d/IPaddr2
++ FILE_SYSTEM=/etc/ha.d/resource.d/Filesystem
++ OCF_DIR=/usr/lib/ocf/resource.d/heartbeat
++ PATH=/bin:/sbin:/usr/bin:/usr/sbin:/opt/novell/afptcpd/bin/:/opt/novell/bin
+ ignore_error cluster_afp.sh del BCC1-G013-V 10.11.51.163
+ cluster_afp.sh del BCC1-G013-V 10.11.51.163
+ return 0
+ ignore_error ncpcon unbind --ncpservername=BCC1-G013-V --ipaddress=10.11.51.163
+ ncpcon unbind --ncpservername=BCC1-G013-V --ipaddress=10.11.51.163
... Executing " unbind"

... completed OK [elapsed time = 2 Seconds 126 msecs 263 usecs]
+ return 0
+ ignore_error ncpcon dismount G013
+ ncpcon dismount G013
... Executing " dismount G013"

Due to this the NSS-pool is not deactivated and this will generate the NSS error 20892 when trying to activate the pool. The root cause of the problem is that NCPCON bind in defined in the load script is sometimes ignoring the provided device ID. This is visible when performing a ncpcon volume volumename
The ID displayed should be the same as in the load script and also under status you should see"cluster resource". When this is not correct, every unload of the resource will fail.