Access Manager "There exists a configured cluster member that is not active" error

  • 7000836
  • 03-Jul-2008
  • 26-Apr-2012

Environment

Novell Access Management 3 Linux Access Gateway
Novell Access Management 3 Linux Novell Identity Server
Novell Access Manager 3 Support Pack 3

Situation

Load balancer setup to balance the traffic between four Novell Identity (IDP) Servers. 
The 4 IDP servers are all part of the same cluster configuration - 2 of these IDP servers
lie in one subnet, whilst the other two sit on another another. A firewall seperates the
two subnets.

As soon as the 4 IDP servers are brought up, LAN traces show that all 4 devices start to
communicate just fine on TCP 7801. However, within seconds the Administration Console shows
the IDP healthcheck reporting the following warning message:

There exists a configured cluster member that is not active!
Expected cluster members 124.33.32.99 124.33.32.100 124.33.32.35
124.33.32.36.
Activecluster members: 124.33.32.35.
(Required Action) Start the member that is not active in the cluster

When we remove two of the devices from the cluster that sit on the same subnet, then all cluster
healtheck reports green and working state.

Another symptom that becomes visible when this warning occurs is that new TCP listeners get spawned
for the cluster communication. A netstat -ntp output shows that the devices will no longer just listen
on TCP 7801 for the cluster traffic but also on 7802 and 7803 (depending on the number of nodes in the
cluster).

This problem will also be visible on Linux Access Gateways that are clustered together as it also
implements the jgroups clustering protocol.

Resolution

Make sure that the firewall allows all traffic from 7801 to [7801+n-1] where n is the number of devices in the cluster.

If the firewall configuration currently in place only allows 7801 through, make sure that requests through the firewall destined for 7802 to [7802+n-2] are not silently discared, but discarded with a TCP RST. Assuming that iptables is used as a firewall for example, the following syntax needs to be applied:

"iptables -A INPUT -p tcp --dport 7802:7804 -j REJECT --reject-with tcp-reset"

and not just

"iptables -A INPUT -p tcp --dport 7802:7804 -j REJECT"

jgroups has the requirement that unicasts destined for 7802 to 7802-n-2 from all cluster members need to be responded to with a TCP RST. Failure to do this will cause another listener to be spawned on the device so that the device sending the TCP SYN requests to these higher ports no longer kistens on TCP 7801 but 7802, 7803, etc depending on the number of devices in the cluster.

Additional Information

However, the local catalina.out logs show
that the dmessagebus PING requests never get answered by the other devices.

When all devices are on the same subnet, all works just great.

The issue lies with the fact that even though we communicate on 7801, the
cluster clients always try and establish TCP connections on 7802 and 7803.
Under normal circumstances, no listener exists on the remote host for these TCP
ports and the connections get reset.

When going thru the firewall, the requests are dropped silently ie. we never
get a TCP RST back. The end result is that the PING exchanges never register
the other devices and the following error will be reported

#######
There exists a configured cluster member that is not active!
Expected cluster members 24.33.32.99 24.33.32.100 24.33.32.35
24.33.32.36.
Activecluster members: 24.33.32.35.
(Required Action) Start the member that is not active in the cluster
#######