Random application failures and incomplete pages or files when going through Linux Access Gateway

  • 7006138
  • 27-May-2010
  • 26-Apr-2012

Environment

Novell Access Manager 3.1 Linux Access Gateway

Situation

Access Manager 3 setup with a Novell Identity server on the same host as the Admin Console, and a Linux Access Gateway (LAG) to proxy requests to back end web servers. Authentication and single sign on worked fine in this environment. Some users however started complaining about randomly

- not being able to access certain large PDF files through the LAG
- not being able to cleanly download large files through the LAG
- not being able to access applications generating large amount of data back to browser

Forcing HTTP 1.0 to origin server, disabling persistence to browser and Web server would fail to address the issue. The problem would never be experienced going directly to the Web server.

Resolution

Run "ethtool -K eth0 tso off" to disable TCP segmentation offload to the NIC. It would appear that the NIC driver that shipped with SLES9 SP3 (base OS for Linux Access Gateway) has an issue with the handling of the TCP segments. As soon as we did , everything started working perfectly.

The network driver was bnx2. NIC is Broadcom NetXtreme II BCM5708 1000Base-T (B2) PCI-X. H/W is HP ProLiant DL380 G5 Dual Core X5160 3GHzx1 1x4MB L2 2GB. The issue has also been seen with multiple newer LAN drivers that all offer on board TCP checksumming.

From /var/log/messages,
May 4 11:10:25 PIOJPPSNACMSU3 kernel: bnx2: module not supported by Novell,
setting U taint flag.
May 4 11:10:25 PIOJPPSNACMSU3 kernel: Broadcom NetXtreme II Gigabit
Ethernet Driver bnx2 v1.3.29 (October 6, 2005)
May 4 11:10:25 PIOJPPSNACMSU3 kernel: eth0: Broadcom NetXtreme II BCM5708
1000Base-T (B2) PCI-X 64-bit 133MHz found at mem f8000000


The LAG shipping with 3.1 Support Pack 2 will be based on SLES 11. SLES 11 will not have the same issue, so the above workaround will not be required. The problem only exists between the newer NICs and the older SLES 9 SP3 kernel.

Additional Information

Traces would indicate that not all the requested info from the Web server was being sent back to the browser. For some reason, the response from the Access Gateway to the browser would be missing TCP data, and tcpdump would report TCP checksum errors when viewed in wireshark (not sure if this was a wireshark decode error).

Running ethtool -S eth0 to dump the stats, we could see some transmit and receive errors increment on a regular basis on the NIC connecting the proxy to the browser. Suspecting the NIC, we replaced it to find that we could no longer duplicate the issue.