BCC: Using Virtual NIC's in Cluster Scripts may cause delays in processing these scripts.

  • 3735490
  • 11-Oct-2007
  • 07-Jun-2013

Environment

Products:
Novell NetWare 6.5 Support Pack 6
Novell Business Continuity Clustering 1.1

Situation

Purpose:
Raise awareness about a BCC Cluster specific 'performance' issue that may be encountered when Virtual NIC's are being used in the Cluster resource Load/Unload scripts.

Symptom:
The issue customer encountered was seen when the customer configured Virtual NIC's (opposed to 'Secondary IP Addresses' you would usually see in Cluster resource scripts).
When a Cluster Resource Unload script was being run on any side of the cluster and it had this VNIC configured, running the resource Unload script in order to fail over resources from Cluster 1 to Cluster 2 took a very long time to process and finish.

Changes:
No changes were made, this problem has been prevalent since implementation time

Resolution

Corrective Steps:

Modified the default Cluster Resource Unload script :
unbind ip VNIC01 address=
CLUSTER CVSBIND DEL BCC1-BAV-DATA01-VS
NUDP DEL BCC1-BAV-DATA01-VS
nss /pooldeactivate=BAV_DATA01 /overridetype=question

to reflect the following :
CLUSTER CVSBIND DEL BCC1-BAV-DATA01-VS
NUDP DEL BCC1-BAV-DATA01-VS
unbind ip VNIC01 address=
nss /pooldeactivate=BAV_DATA01 /overridetype=question

Additional Information

Troubleshooting Steps:
What we tried to speed things up was using the 'nudp odel' command as described in KB 10086057 (see snippet explaining this command below). This has been used in 'normal' cluster environments and also performed better for the customer, however, we have chosen for the approach to gracefully clean the connection, and as such choose to modify the Unload script.

snip/
Changes were made to NUDP to allow for a more graceful client disconnect. The new way waits until all NCP connections have terminated which can vary in time depending on what the connections are doing (file copy, etc.). Using the ODEL parameter forces the virtual server to be deleted after a fixed amount of time.

The difference between NUDP DEL and NUDP ODEL is that NUDP DEL will wait to be notified by each connection that it has fully closed, where as, NUDP ODEL will simply notify the connections that they need to close and then proceed with the rest of the volume migration. The NUDP DEL command is a little bit cleaner with respect to client connections allowing the reconnect to happen faster and more efficient.

The sequence of events during a volume migration should be that the clients are notified that the connection is lost and they simply do a reconnect. File handles are not reconnected since they do not exist on the new server. Any pending file operations would error out and the application should handle that. In the case of a simple file copy, the user should get an error message that the file copy failed so they can restart it.

ODEL says finish no matter what the state of the TCP connection is. There will be no hang because ODEL doesn't wait for the TCP connection to clean up properly - so when ODEL completes, the TCP connection will probably still be in the closeWait state as shown in tcpcon. But ODEL also deletes the secondary IP address, so if the client hadn't finished its handshake it will never be able to. This usually results in the client eventually timing out and resetting the connection sometime later.
/snip

Change Log

7 Jun 2013 - Removed 'Reported to Engineering' status.