We have four LAGs clustered behind an L4 servicing our users
I received reports that users attempting to access one of our domain-based resources received errors in the browser that it was unable to redirect. I went to the device manager screen and found the health of all four LAGs was reporting yellow. Drilling down into them, LAG1, LAG2 and LAG4 all indicated tht LAG3 was up but not an active member of the cluster. LAG3 reported that LAG4 was up but not an active member of the cluster. I then went to each LAG seperately and ran netcat 0 2300 to look at the current activity on the proxy console. LAG1, LAG2 and LAG3 all had approximately 500 HTTP requests under the This minute: statistic. That is abnormal by 10 times. LAG4 only showed the normal 40-50 requests. I then looked at the extended HTTP logs under /var/amlog/http-reverse/~domain/extended and found several of the same requests going over and over again to the other LAGs.
After contacting engineering about this, they found it to be a probem with JGroups communication. It ceased to be able to receive any inbound communication on LAG3 so when it was trying to get to a session a user had on another LAG it would send the multicast to find it, the other LAG would respond but LAG3 was unable to receive that response. It would return an error, the user would retry, the L4's sticky bit would send them back to LAG3, the user would claim to already have a session, LAG3 would multicast to find it, the other LAG would respond and the loop continued. Restarting Tomcat with /etc/init.d/novell-tomcat4 restart command fixed everything. Communication immediately began again and everybody continued working.