Admin Console shows cluster devices stuck in pending issue for more than two minutes after applying changes to IDP, LAG and SSLVPN clusters

  • 7005944
  • 11-May-2010
  • 26-Apr-2012

Environment

Novell Access Manager 3.1 Linux Access Gateway
Novell Access Manager 3.1 Access Administration
Novell Access Manager 3.1 SSLVPN Server
Novell Access Manager 3.1 Linux Novell Identity Server
Novell Access Manager 3.1 Windows Novell Identity Server
Novell Access Manager 3.1 Support Pack 1 applied

Situation

When multiple nodes exist in the same cluster, and we push a change out to all nodes in the cluster (update ALL for cluster), some nodes get stuck in a pending state for a longer period of time (minutes). This does not happen when each node in the cluster is updated individually.

With SSLVPN cluster nodes , the problem was more visible and occured with about 30% of applies - this frequency is reduced with IDP or LAG clusters. The catalina.out log snippet below shows the problem where it takes 127 seconds to initialise the jgroups (DMessageBus):
>  NIDPMeEntity.commonInitialize(): Complete! Config Name: sv_ext
> Initializing system's DMessageBus!
> 16 Dec 2009 10:21:36,479 INFO ConnectionTable - server socket created

on
> 164.216.27.221:7801
> 16 Dec 2009 10:21:36,482 INFO ConnectionTable - created socket to
> 164.216.27.220:7801
> 16 Dec 2009 10:21:40,242 INFO ConnectionTable - exception is
> java.io.EOFException
> 16 Dec 2009 10:21:43,027 INFO ConnectionTable - input_cookie is bela
> 16 Dec 2009 10:21:46,531 INFO ConnectionTable - exception is
> java.io.EOFException
> 16 Dec 2009 10:21:53,986 INFO ConnectionTable - input_cookie is bela
> 16 Dec 2009 10:22:39,990 WARN GMS - join(164.216.27.221:7801) sent to
> 164.216.27.220:7801 timed out, retrying
> Initialized system's DMessageBus in 127066 milliseconds!
This shows that one node tried to connect to another node in the cluster at 164.216.27.220:7801 
but failed ... we eventually retried and successfully connected after that. The administrator
updated both cluster nodes at the same time. WHen you do this, the 7801 listener on one of the
clustering neighbours may be down due to it's reinitialisation and you see the problem.

Resolution

Add the following servlet config parameter the web.xml file on all cluster machines (/opt/novell/nesp/lib/webapp/WEB-INF/web.xml on LAG, /var/opt/novell/tomcat5/webapps/sslvpn/WEB-INF/web.xml on SSLVPN server and /opt/novell/nids/lib/webapp/WEB-INF/web.xml on the IDP server):
 
 <context-param>
  <param-name>JGroupsConfiguration</param-name>
  <param-value>TCP(start_port=[nidp:ClusterPort];end_port=[nidp:ClusterPort][nidp:IfExternalAddress];external_addr=[nidp:ExternalAddress][nidp:EndIf]):TCPPING(initial_hosts=[nidp:ClusterMembers];port_range=1;timeout=3500;num_initial_members=2;up_thread=true;down_thread=true):MERGE2(min_interval=5000;max_interval=10000):FD_SOCK([nidp:IfExternalAddress]bind_addr=[nidp:ExternalAddress][nidp:EndIf]):FD(shun=true;timeout=2500;max_tries=5;up_thread=true;down_thread=true):VERIFY_SUSPECT(timeout=2000;down_thread=false;up_thread=false):pbcast.NAKACK(down_thread=true;up_thread=true;gc_lag=100;retransmit_timeout=3000):pbcast.STABLE(desired_avg_gossip=20000;down_thread=false;up_thread=false):pbcast.STATE_TRANSFER():pbcast.GMS(merge_timeout=10000;join_timeout=5000;join_retry_timeout=2000;shun=true;print_local_addr=[nidp:DebugOn];down_thread=true;up_thread=true)</param-value>
 </context-param>
 
Make sure that the text inside of the <param-value> tags formats to all being on a single line after the copy to the WEB.XML file.

Additional Information

The only difference between this config and the one we use by default is that the join timeout 
is changed from 60 seconds down to 5 seconds and the join retry timeout from 60 seconds down to
two seconds.

The issue was that when both boxes were coming up at the same time, one would
send a JOIN_REQ message that would never get to the other box. That request
wouild wait for 60 seconds and then go into the join retry timeout period,
another 60 seconds, and then the subsequent join would work. Lowering the join
timeouts simply allowed the join to complete closer to the actual time that the
application and network states were finally initialized well enough.