Re-added node won't join the cluster after rolling upgrade procedure

  • 7015092
  • 21-May-2014
  • 21-May-2014

Environment

OES 11 sp2 NCS cluster

Situation

The customer upgraded his OES2 sp3 NCS cluster to OES11 sp2 using the instructions in the documentation entitled "Adding OES 11x Nodes to an OES2 SP3 cluster (Rolling Cluster Upgrade)".  The customer did this by removing the node from the cluster and from edirectory, installing OES 11 sp2 on the same hardware with the same name and IP address, and then attempting to add the node back into the cluster. 
Symptom:
The node is not able to join the cluster. 
The var/log/messages file for the node attempting to join reports the error: " Join retry, some other node acquired the cluster lock". 
The other nodes in the cluster will eventually issue a poison pill to the node.

Resolution

There is residual information concerning what node ID this node name and ip_address are assigned to.  To fix this you will need to modify the "NCS:GIPC Node Number" attribute for this new clustered node object in eDirectory to match the previous node ID of this node.  The previous node ID can be seen using the "cluster stats display" command on the master node or by viewing the /var/opt/novell/ncs/gipc.conf file.  Rebooting the nodes will also clean this up, but is often not a good option for the customer.

After modifying this attribute, you will need to execute the  "ncs-configd.py -init" command on each node so that it will sync this information down to the cluster nodes.  You should then be able to join the node.  Because of previous attempt to join the cluster, you will likely see a duplicate of this new node in iManager which may have a negative impact on the migration of resources to this node.  You can fix this by having the master node leave the cluster which will force the  migration of the master_ip_address resource to another node.

To prevent this problem from occurring in the first case, you should adjust the procedure as follows:

1) Note the node number assignment of the cluster node that is to be rebuilt before starting the installation.

2) During re-installation of a cluster node server, make sure to deselect the box labeled  "Start Cluster Service now" in step 2d of the installation instructions for "Adding OES 11x Nodes to an OES2 SP3 cluster (Rolling Cluster Upgrade)".

3) Then after completing step 2 in the instructions for the cluster rolling upgrade, you should at that point have a cluster server object, but you have not at that point attempted to join the cluster, so we should not have a duplicate node in memory.  You should then be able to modify the "NCS:GIPC Node Number" attribute of this new node cluster server object to the original Node ID and continue with the installation as outlined in the documentation.

Cause

The installation script for adding a node to an existing cluster will search for the first available empty node ID or slot in which to insert the new node.  Since this cluster has been through many upgrades, the node ID "0" was empty, so the installation attempted to use that ID and was unaware that this node name and ip address had formerly been using node ID "3".