Cluster resource application becomes non-reachable/non-responsive to clients

  • 7001812
  • 06-Nov-2008
  • 27-Apr-2012

Environment

ALL

Situation

Cluster resource stops servicing client requests
 
From client workstations:
Cluster resource secondary IP address can be "pinged"
Cannot telnet to port being listened on by application resource "owning" secondary ip address
Can telnet, ssh, rdesktop, rconj on primary ip address(es) registered on host cluster node
 
From server console:
Cluster resource secondary IP address can be "pinged"
Linux: netstat -na|grep LISTEN|grep <port number> shows that the port is open and available for conversations
NetWare: TCPCON, Prototcols, TCP... shows the port as being open and available for conversations
*CAN* telnet to port being listened on by application "owning" secondary ip address
*CAN* telnet, ssh, rdesktop, rconj on primary ip address(es) registered on host cluster node
*CAN* run the application which would normally run from a client workstation on the physical server and it works.
 
Traceroute from workstations shows a path that goes to your data center routers and switches as expected.
 
A sniffer trace of the switch port that the server is plugged into shows NO traffic destined for the server on the ports that the client application would be utilizing, further, a trace taken on the server (Linux: tcpdump -s 0 -w <mycapture.cap> OR NetWare: pktscan) shows no traffic destined for the applications port, but DOES show other traffic destined for the server.
 
ARP entry tables show a MAC address that differs from the MAC address of the server when you first clear the arp tables of your workstation, then ping the secondary ip address.
 
Sniffer traces taken at the client show the target IP address and port as the destination server, however, the server responds that the port is invalid, and the trace contents show a mac address which is different than the target server.

Resolution

In this scenario, another device on the network assumed the IP address of the secondary IP address reserved for the cluster node resource.  Every time that the server with the duplicate IP address would assert itself onto the network, the routers would dutifully change their ARP table entries.  But because the "imposter" didn't have the application running which would normally be listening on the designated ports, the application attempting this conversation would timeout.

Additional Information

The issue which was originally reported was that a GroupWise resource was still showing as "running" when viewed through iManager or a Cluster Status command, yet no client workstations could connect.  An observation which was shared during trouble-shooting was that when the resource was moved from one node to another, connectivity was instantly reestablished for the client workstations.  The timing of this event appeared to be sporatic, didn't follow any pattern at all.
 
The battle between the two servers who were trying to assert their ownership of the IP address went something like this...  The secondary IP address was registered by the Cluster Resource, ARP table was updated, hours later... another server was having "connectivity issues", it was restarted...  The cluster resource would become non-responsive... The resource would move between nodes, ARP table was updated, hours later.... repeat.