Page Fault Abend in LIBNSS|LB_memcpy+15

  • 3614515
  • 24-Mar-2008
  • 07-Jun-2013

Environment

Novell NetWare 6.5
VMWare ESX Server 3.5
Citrix Metaframe
XenApp Presentation Server


Situation

Novell has received reports from a number of customers that have encountered problems in various environments as described here:
- Physical servers running Novell NetWare 6.5
- VMWare ESX server 3.5, running virtualized instances of Novell NetWare 6.5, with the desktop to the end-user being presented via Citrix Metaframe orXenApp Presentation Server.


At irregular intervals the NetWare servers are abending with a Page Fault, and a stack that looks similar as listed below :


P00# sw
Current EIP: 85B7DF35 LIBNSS.NLM|LB_memcpy+15
8002BD4C 86158023 COMN.NSS|COMN_Write+B23
8002BDF8 85E12134 NSS.NLM|MSG_Call+B4
8002BE10 86009C3C NWSA.NSS|ZH_WriteFile64+EC
8002BEE4 86009D05 NWSA.NSS|ZH_WriteFile+55
8002BF10 8BB91F4E NCPIP.NLM|ProcessNCPRequest+8E
8002BF38 00369912 SERVER.NLM|StartWorkToDo+23
8002BF50 0022EB3B SERVER.NLM|kWorkerThread+DF
8002BF68 002285D8 SERVER.NLM|TcoNewSystemThreadEntryPoint+40

When users are working on MS Office Access/Excel files that reside in their user home directory, mostly during file write operations the server abended.As it turns out, the root cause for the problem is to be found in corrupted NCP packets traversing the wire, where the offset for bytes that needed to be written, became out of range. When the server detected the corruption it abended in order to maintain data integrity.

Further troubleshooting has shown that at least on one occasion a Windows 2000 Server that was part of the Citrix farm had a faulty network card, as such causing NCP packet corruption.

Another customer reported that the TCP Segment Offload settings on a network card from one of the Citrix servers were enabled, and after disabling this the abend problems were resolved.

Still another customer had changed the "Minimum NCP TCP receive window to advertise" setting on the server from the default value of 4096 to the maximum value of 16,384.  This exposed a TCP Windowing defect where the TCP Receive Window on a connection would go to 0.  If the NCP connection was an NCP  write, and the server advertised a small window ( For Example: 31 bytes) the workstation would send exactly that number of bytes.  If this data was not enough for a complete NCP header it could cause an invalid value for the NCP write.  The server would ABEND to avoid data corruption.

Resolution

The resolution for this problem is:
- Verify that the "Minimum NCP TCP receive window to advertise" is not larger than 4096. (Default)
- If it is at the default then identify where in the network the NCP packet corruption is coming from

A number of troubleshooting methods may be (but are not limited to):
- analyzing LAN traces
- check server NIC Driver statistics for suspecting amounts of retransmits and/or other errors
- check networking components for erroneous amounts of packet (re-)transmits or excessive amounts of failures, etc.
- check for firmwareupdatesfor your infra-structure/networking components that address communication issues
- etc

Additional Information

Another very similar instance was found where also corrupted NCP packets were abending the server in a similar fashion, with the same abends. The main reason for that problem was when a certain Cisco WAAS (Wide Area Application Service)device was installed in the customers infrastructure. A Cisco WAAS device is an application acceleration and WAN optimization solutionto improve the performance of any TCP/IP based application operating in a WAN environment. A machine was found with an older software version which was causing problems with packet aggregation.

Cisco has released a software update that registered users can download from the Wide Area Application Services (WAAS) area.
The customer confirmed that after installing software version v4.0.15 and having full optimization enabled again he no longer encountered the reported problem.