NFS client mount from exported OES NSS file system appears to hang

  • 7023677
  • 25-Jan-2019
  • 25-Jan-2019

Environment

Open Enterprise Server 2018 (OES 2018) Linux

Situation

After OES 2018 (on SLES 12 SP2) runs for a day or so, the NFS Server daemon will stop responding to NFS clients.  Client machine processes attempting to use those NFS mounts may appear to hang or stall.  Other processes waiting on those operations, or waiting on related resources, may also stall.  This can be temporarily corrected at the OES NFS Server machine with "systemctl restart nfsserver".  In the one system were this problem was seen, it was reported that this OES 2018 system used to be OES 11, which performed the same kind of NFS Server functions without problem.
 
When the system is in this state, "ps" will show that nfsd threads are in "D" status, and most of those are doing "iterate_dir":

server-90:/# ps -eo ppid,pid,user,stat,pcpu,comm,wchan:32
PPID   PID USER     STAT %CPU COMMAND         WCHAN
<snip>
    2  7140 root     D     0.0 nfsd            iterate_dir
    2  7141 root     D     0.0 nfsd            iterate_dir
    2  7142 root     D     0.0 nfsd            iterate_dir
    2  7143 root     D     0.0 nfsd            -
    2  7144 root     D     0.0 nfsd            iterate_dir

 
Also, network packet analysis shows that nfs request packets are being successfully delivered to the OES NFS Server.  The OES TCP layer is ACK-ing (acknowledging) those packets, yet the OES machine's SLES NFS Server daemon is never sending an actual answer to the requests.  In other words, TCP and network layers are working, but the NFS Server daemon or (more likely) the NSS file system, may be malfunctioning.

Resolution

A full analysis of this case has not been possible, and therefore a code level solution has not been made.
 
A workaround in configuration was found, by adding the option "nordirplus" to the nfs client mount syntax.  This will prevent the nfs client mount from using the nfs operation "readdirplus" and it will rely on the more basic "readdir" operation instead.  See "man nfs" for more details on this option.

Cause

It is suspected that the OES NSS file system can have a contention occur during nfs client readdirplus operations, when multiple processes are involved, resulting in a deadlock on a blocked mutex.  The problem is suspected within NSS file system code, not nfsd code.

Additional Information

While the "nordirplus" option can eliminate the symptom, this is only a workaround.  If this issue is encountered, a kdump of the OES NFS Server machine, in this state, may be helpful for OES development to investigate the problem further.
 
NOTE, however, that almost all cases of an nfs client mount appearing to hang will be due to TCP or network related communication issues, and will not match this particular case.  The key factors for a match are expected to be:
 
- The OES machine acting as the NFS Server has nfsd processes in "D" state, showing 'iterate_dir' (see ps command example above)
 
- The nfs client mount option "nodirplus" eliminates the symptoms being seen.
 
Without those factors, then an nfs client hang is more likely a network issue.  In that case, kdump would not likely be helpful.  Instead, network communication should be analyzed for problems.