NSS Cluster Volumes hang when they get full on OES Linux.

  • 7008628
  • 23-May-2011
  • 27-Apr-2012

Environment

Novell Linux
Novell Cluster Services 1.8
Novell Open Enterprise Server 1 (OES 1) Linux Support Pack 2 Linux

Situation

NSS Cluster Volumes hang when they get full on OES Linux.
The server that is hosting the full volume will hang and will only respond to pings. 
All resources on the hanging server will go comatose.
Once in this situation the only way to get your resources back online is to reboot the problem server.

Resolution

This problem has been reported to engineering.   
Workaround 1 is to disable salvage on the volume.  Keep in mind you will not be able to recover any deleted files if you do this.
Workaround 2 is to set quotas so the volume never fills up.  In my testing I allowed it to fill up to 75% of the volume.  This workaround allows you to keep salvage enabled.

Additional Information

Test scenario to duplicate the problem.

Made a 2 GB Pool MYPOOL and cluster enabled it.
Filled it with a directory of 300 MB called TEST1 and then copied  TEST1 to
TEST2, TEST3, TEST4, TEST5, TEST6
The first time it would tell me it would stop copying and let me know it was
full then all was okay on the server.
I then deleted one of the directories TEST3 and copied TEST1 to TEST3 it would
then hang (no switching screens, no SSH, no keyboard input) part way through
the copy.
The cluster resources on the hanging server would all go comatose but they
would still respond to pings on their resource IP address.
If you look at the other server in the cluster you see that it issued a poison
pill to the hanging server but the hanging server never reboots.
All of these resources located on the hung server will go into a comatose state
until the hanging server is rebooted and the IP addresses are released.

Salvage is turned on. 

After the reboot the server would come up fine with about 30MB free space on my
MYPOOL.

Fixed in OES2 and later.


Formerly known as TID# 10100732