JGroups clustered nodes not synchronising when multiple nodes removed from cluster at the same time

  • 7014557
  • 13-Feb-2014
  • 01-Jul-2015

Environment

NetIQ Access Manager 3.2
NetIQ Access Manager 3.2 Identity Server
NetIQ Access Manager 3.2 Access Gateway
Access Manager setup in clustered environment

Situation

A large NetIQ Access Manager setup exists with 7 nodes in an Access Gateway (AG) cluster. Using the statistics output option under Identity Server (IDP) logging, each nodes in the cluster sees all other nodes and all proxy'ing between the nodes works seamlessly.

As part of a disaster recover test, the administrator took 5 nodes out of the AG cluster at the same time and sent requests to the two remaining nodes. When these subsequent requests remained persistent to the same AG, then everything would work fine. If however the load balancer fronting these two remaining AGs bounced the user session between the two AGs, the users would loop between resources without ever getting access to the application. 

For example, assuming AG1-AG7 are up and running, and the loadbalancer si configured so that sso.novell.nl points to agw1 and www.novell.nl points to agw7, the other agw are disabled for this links


Test 1: All agw running.
  1. Go to link  https://www.novell.nl/mwp2/faces/confidential/aanmelden.jspx
  2. Login with user: piet and password
  3. client redirect: https://logon.novell.nl/nidp/rsg/rsglogin.jsp
  4. client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
  5. client redirect: https://www.novell.nl/mwp2/faces/secure/gotoDashboard
  6. Dashboard, user logged in:  https://www.novell.nl/mwp2/faces/secure/dashboard.jspx?sc=0
  7. : ) everything works fine, no problems
Test 2: Only AG1 and AG 7 are up; the remaining 5 AGs have just had network connectivity removed
  1. Go to link  https://www.novell.nl/mwp2/faces/confidential/aanmelden.jspx
  2. Login with valid username and password
  3. client redirect: https://logon.novell.nl/nidp/rsg/rsglogin.jsp
  4. client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
  5. client redirect: https://www.novell.nl/mwp2/faces/secure/gotoDashboard
  6. client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
  7. client redirect: https://www.novell.nl/mwp2/faces/secure/gotoDashboard
  8. client redirect: https://sso.novell.nl/LAGBroker?c=MC/secure/name/password/uri&%22https://www.novell.nl/mwp2/faces/secure/gotoDashboard
  9. etc.
  10. User experiences looping without ever accessing the application

Resolution

Increase the jgroup timeouts for the MERGE and FD operations by modifying the /opt/novell/nids/lib/webapp/WEB-INF/web.xml on the AG server):
 
 <context-param>
  <param-name>JGroupsConfiguration</param-name>
  <param-value>TCP(start_port=[nidp:ClusterPort];end_port=[nidp:ClusterPort][nidp:IfExternalAddress];external_addr=[nidp:ExternalAddress][nidp:EndIf]):TCPPING(initial_hosts=[nidp:ClusterMembers];port_range=1;timeout=3500;num_initial_members=2;up_thread=true;down_thread=true):MERGE2(min_interval=10000;max_interval=30000):FD_SOCK([nidp:IfExternalAddress]bind_addr=[nidp:ExternalAddress][nidp:EndIf]):FD(shun=true;timeout=5000;max_tries=5;up_thread=true;down_thread=true):VERIFY_SUSPECT(timeout=2000;down_thread=false;up_thread=false):pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):pbcast.STATE_TRANSFER():pbcast.GMS(merge_timeout=10000;join_timeout=5000;join_retry_timeout=2000;shun=true;print_local_addr=[nidp:DebugOn];down_thread=true;up_thread=true)</param-value>
 </context-param>

Cause

Tracking the requests in the browser HTTP header side, one can see that the load balancer sent the request to an AG (AG7) that did not respond to the original request (AG1). In order to validate the user session, the AG7 processing the latest request must locate the user session on another AG and has to send a jgroups request out to identify which AG owns the user session.

In our case, we can see the following in the catalina log file of AG7 - note we send the jgroups request and  after 15 secs (default timeout for failure), we see these messages:

 


- sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
- sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
- sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
- heartbeat missing from 172.26.0.199:7801 (number=1)
- heartbeat missing from 172.26.0.199:7801 (number=1)
- heartbeat missing from 172.26.0.199:7801 (number=1)
<amLogEntry> 2014-02-03T12:49:11Z DEBUG NIDS Application:
Method: DMessageBus.A
Thread: ajp-bio-127.0.0.1-9009-exec-21
DMessageBus Message Response: Elapsed Millis: 15002, Count: 1
  Response #0: from member 172.26.0.199.
     Was Received: false
     Was Suspected: false
 </amLogEntry>

 

 

With jgroups debug logging enabled on this server, we show that we send keep alives but never get responses:

 

1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD  - sending are-you-alive msg to 172.26.0.199:7801 (own address=172.26.0.45:7801)
1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD  - heartbeat missing from 172.26.0.199:7801 (number=0)
1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD  - heartbeat missing from 172.26.0.199:7801 (number=0)
1848014 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD  - heartbeat missing from 172.26.0.199:7801 (number=0)

 

Looking at both AG1 and AG7, we see that we do not merge successfully ... one of the jgrpups was not reset, but the other does NOT show the initialisation. 

 

 AG 1 shows following merge operation:

 

1577271 [MERGE2.FindSubgroups thread (channel=cn=SCC13BA39BD9D7F9B8D,cn=cluster,cn=nids,ou=accessManagerContainer,o=novellNIDPMessageBus)] DEBUG org.jgroups.protocols.MERGE2  - initial_mbrs=[[own_addr=172.26.0.45:7801, coord_addr=172.26.0.199:7801, is_server=true], [own_addr=172.26.0.199:7801, coord_addr=172.26.0.199:7801, is_server=true]]
1578806 [TimeScheduler.Thread] DEBUG org.jgroups.protocols.FD  - sending are-you-alive msg to 172.26.0.45:7801 (own address=172.26.0.199:7801)

 

AG7  has no merge operations at all ...