JGroups clustered nodes not synchronising when multiple nodes removed from cluster at the same time

Document ID:7014557
Creation Date:13-Feb-2014
Modified Date:01-Jul-2015
- Micro Focus Products:
  Access Manager (NAM)

Environment

NetIQ Access Manager 3.2

NetIQ Access Manager 3.2 Identity Server

NetIQ Access Manager 3.2 Access Gateway

Access Manager setup in clustered environment

Situation

A large NetIQ Access Manager setup exists with 7 nodes in an Access Gateway (AG) cluster. Using the statistics output option under Identity Server (IDP) logging, each nodes in the cluster sees all other nodes and all proxy'ing between the nodes works seamlessly.

As part of a disaster recover test, the administrator took 5 nodes out of the AG cluster at the same time and sent requests to the two remaining nodes. When these subsequent requests remained persistent to the same AG, then everything would work fine. If however the load balancer fronting these two remaining AGs bounced the user session between the two AGs, the users would loop between resources without ever getting access to the application.

For example, assuming AG1-AG7 are up and running, and the loadbalancer si configured so that sso.novell.nl points to agw1 and www.novell.nl points to agw7, the other agw are disabled for this links

Test 1: All agw running.

Go to link https://www.novell.nl/mwp2/faces/confidential/aanmelden.jspx
Login with user: piet and password
client redirect: https://logon.novell.nl/nidp/rsg/rsglogin.jsp
client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
client redirect: https://www.novell.nl/mwp2/faces/secure/gotoDashboard
Dashboard, user logged in: https://www.novell.nl/mwp2/faces/secure/dashboard.jspx?sc=0
: ) everything works fine, no problems

Test 2: Only AG1 and AG 7 are up; the remaining 5 AGs have just had network connectivity removed

Go to link https://www.novell.nl/mwp2/faces/confidential/aanmelden.jspx
Login with valid username and password
client redirect: https://logon.novell.nl/nidp/rsg/rsglogin.jsp
client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
client redirect: https://www.novell.nl/mwp2/faces/secure/gotoDashboard
client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
client redirect: https://www.novell.nl/mwp2/faces/secure/gotoDashboard
client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
etc.
User experiences looping without ever accessing the application

Resolution

Increase the jgroup timeouts for the MERGE and FD operations by modifying the /opt/novell/nids/lib/webapp/WEB-INF/web.xml on the AG server):

<context-param>
<param-name>JGroupsConfiguration</param-name>
<param-value>TCP(start_port=[nidp:ClusterPort];end_port=[nidp:ClusterPort][nidp:IfExternalAddress];external_addr=[nidp:ExternalAddress][nidp:EndIf]):TCPPING(initial_hosts=[nidp:ClusterMembers];port_range=1;timeout=3500;num_initial_members=2;up_thread=true;down_thread=true):MERGE2(min_interval=10000;max_interval=30000):FD_SOCK([nidp:IfExternalAddress]bind_addr=[nidp:ExternalAddress][nidp:EndIf]):FD(shun=true;timeout=5000;max_tries=5;up_thread=true;down_thread=true):VERIFY_SUSPECT(timeout=2000;down_thread=false;up_thread=false):pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):pbcast.STATE_TRANSFER():pbcast.GMS(merge_timeout=10000;join_timeout=5000;join_retry_timeout=2000;shun=true;print_local_addr=[nidp:DebugOn];down_thread=true;up_thread=true)</param-value>
</context-param>

Cause

Tracking the requests in the browser HTTP header side, one can see that the load balancer sent the request to an AG (AG7) that did not respond to the original request (AG1). In order to validate the user session, the AG7 processing the latest request must locate the user session on another AG and has to send a jgroups request out to identify which AG owns the user session.

In our case, we can see the following in the catalina log file of AG7 - note we send the jgroups request and after 15 secs (default timeout for failure), we see these messages:

- sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
- sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
- sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
- heartbeat missing from 172.26.0.199:7801 (number=1)
- heartbeat missing from 172.26.0.199:7801 (number=1)
- heartbeat missing from 172.26.0.199:7801 (number=1)
<amLogEntry> 2014-02-03T12:49:11Z DEBUG NIDS Application:
Method: DMessageBus.A
Thread: ajp-bio-127.0.0.1-9009-exec-21
DMessageBus Message Response: Elapsed Millis: 15002, Count: 1
Response #0: from member 172.26.0.199.
Was Received: false
Was Suspected: false
</amLogEntry>

With jgroups debug logging enabled on this server, we show that we send keep alives but never get responses:

1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD - sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD - heartbeat missing from 172.26.0.199:7801 (number=0)
1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD - heartbeat missing from 172.26.0.199:7801 (number=0)
1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD - heartbeat missing from 172.26.0.199:7801 (number=0)

Looking at both AG1 and AG7, we see that we do not merge successfully ... one of the jgrpups was not reset, but the other does NOT show the initialisation.

AG 1 shows following merge operation:

1577271 [MERGE2.FindSubgroups thread (channel=cn=SCC13BA39BD9D7F9B8D,cn=cluster,cn=nids,ou=accessManagerContainer,o=novellNIDPMessageBus)] DEBUG org.jgroups.protocols.MERGE2 - initial_mbrs=[[own_addr=172.26.0.45:7801, coord_addr=172.26.0.199:7801, is_server=true], [own_addr=172.26.0.199:7801, coord_addr=172.26.0.199:7801, is_server=true]]
1578806 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD - sending are-you-alive msg to 172.26.0.45:7801 (own address=172.26.0.199:7801)

AG7 has no merge operations at all ...