Why are there so many LDAP queries being generated by a single server?

  • 7016619
  • 19-Jun-2015
  • 19-Jun-2015

Environment

Novell Open Enterprise Server 11 (OES 11) Linux
Novell Open Enterprise Server 2 (OES 2) Linux
Symantec AntiVirus 1.0.14-13

Situation

A single server, was displaying very high utilization.  Upon investigation, it was determined that:
  • This server held a replica of the entire tree
  • Over 30 OES servers used this server for it's LUM preferred-server.
ndstrace (with +tags +time +ldap) showed that once a minute, every server would query the preferred-server (in /etc/nam.conf), looking for a specific uidNumber.  The set of queries would include:
  • query for Unix Workstation object (found)
  • query for each LUM-enabled group associated to the workstation (found)
  • sub-tree query from the base-name in nam.conf for a particular uidNumber (not found)
  • do the following for each LUM-enabled group associated to the Unix Workstation object
    • base query the group for members
    • base query each member to see if they have the specific uidNumber (not found)

The more LUM-enabled groups and users associated to the Unix Workstation, the greater the volume of queries per server.

(note: for a useful ndstrace log, all ldap information needs to be enabled.  To quickly set this on a server, run
      ldapconfig set "LDAP Screen Level = all"
When prompted for credentials, use ndsd format.  For example, admin.novell).

Resolution

There was a local user with a uid number > 65535 (which is the size of a 16-bit integer).  Decreasing this user's uid number to <= 65533 (and ensure any files owned by the uid number are changed to the new uid number -- chown is a good tool to perform this).

Cause

Real time virus scanning (rtvscand) was active on the server.  The *real* uidNumber associated to a user in /etc/passwd was 80000 -- which is 0x13380 in hex (or 17bits long).  rtvscand was truncating everything over the lowest 16 bits and was searching for 0x3380 in hex (or 14464 in decimal).

Additional Information

The only way to tell which process was making the call for the bad uidNumber was by trial and error.  The first step was to take the wrong uidNumber (14464) and added 65536 to it until you find a uidNumber in use (compare to output of getent passwd).

Back tracking, we found that a service created the user with the given uid number.  Once we knew that process, we identified there were 2 processes that might call for this uidNumber -- the service and rtvscand.  Stopping the service, stopped the queries.  However, stopping rtvscand, while letting the original service run, also stopped the queries.

Running ltrace on each service displayed that rtvscand was making a call for the improper uidNumber.