How To Recover From a Failed Cluster Pool

  • 7000862
  • 03-Mar-2009
  • 27-Apr-2012

Environment

Novell Cluster Services 1.8.3
Novell Cluster Services 1.8.4
Novell NetWare 6.5
Novell Open Enterprise Server (NetWare based)

Situation

When a cluster-enabled NSS pool fails and must be re-created, this requires a series of steps to get the new pool back into the cluster with the same resource name.

Resolution

1.  From within the cluster container, delete the cluster resource object.  This should automatically delete the associated virtual server object and cluster volume object from the cluster's parent container.
2.  From the cluster's parent container, verify that virtual server object and cluster volume objects have been deleted.  Then delete the volume object(s) and pool object that belong to the failed pool.  After these objects have been deleted, it is important to allow eDirectory time to replicate these changes to all of the servers in the replica ring.  It shouldn't take very long, unless there is a problem with one or more of the replica holders that are preventing the process.
3.  If you want to create the new resource with the old name, you will need to restart then entire cluster.  This is necessary because the old resource name is held in cache which is shared by all nodes in the cluster.  This cache is not released until all of the clustering modules are unloaded on all nodes in the cluster.
4.  Start the cluster modules on one of the cluster nodes.
5.  With the cluster modules loaded, create the new pool with the original name, also supplying the original IP Address, Virtual Server Name, etc of the associated cluster resource.  Specify that you want to on-line the resource upon pool creation.
6.  Create the volume(s) with the original names.
7.  Bring up the other nodes in the cluster, and test the migration of the newly created resource.

Additional Information

There are other scenarios that may require a modification of the above steps.  For example, if the pool and volumes are healthy, but the volume resource object is accidentally deleted, you will need to perform steps 2 and 3, re-create the pool and volume eDirectory objects (see KB 10099908 for an example of how to do this using ConsoleOne), perform step 4, and finally create the cluster volume resource.