Cluster resources go comatose because ncpcon bind statement fails

  • 7008963
  • 08-Jul-2011
  • 27-Apr-2012

Environment

Novell Cluster Services
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 2
Novell Open Enterprise Server 2 (OES 2) Linux Support Pack 3

Situation

Cluster resources will go comatose because the ncpcon bind command fails to execute.

The /var/opt/novell/log/ncs/resourcename.load.out file will contain an error similar to this:
+    ncpcon bind --ncpservername=RESOURCE --ipaddress=10.10.10.10
... Executing " bind"
... FAILED completion [elapsed time = 20 Seconds 142 msecs 359 usecs]
+ rc=1

If the log level for /var/opt/novell/log/ncpserv.log is set to debug you may see the following errors in the ncpserv.log:
AdvertiseVirtualServer: AdvertiseThruSLP retry count=1
AdvertiseVirtualServer: AdvertiseThruSLP retry count=2
AdvertiseVirtualServer: AdvertiseThruSLP retry count=3
AdvertiseVirtualServer: AdvertiseThruSLP retry count=4
AdvertiseVirtualServer: AdvertiseThruSLP retry count=5
AdvertiseVirtualServer: AdvertiseThruSLP retry failed rc=-255




Resolution

This is fixed in OpenSLP version 1.2.0-22.36.4 or newer, available in the SLES patch channel.  

Additional Information

There was a problem discovered with the way slpd was closing sockets.  This would cause ndsd to re-use a socket that was no longer valid and the attempts to advertise the resource through SLP would fail.  Since SLP is required for the ncpcon bind and unbind commands, when the advertisement through SLP fails, ncpcon will retry the bind statement 5 times and then timeout causing the resource to go comatose.