Servers may become non-responsive and/or see elevated ndsd cpu utilization.

  • 7011961
  • 15-Mar-2013
  • 24-Jul-2013

Environment

Novell Open Enterprise Server 2 (OES 2)
Novell Open Enterprise Server 11 (OES 11)
Novell Storage Services (NSS) on Linux
NetIQ eDirectory (on Linux)

Situation

Some eDirectory servers (OES or non-OES) will sporadically become non-responsive (aka "lock up").  A packet trace and/or ndstrace shows many name resolve requests occurring and -603 errors seeking uidNumber attribute.

Resolution

An update to ncpserver is available for OES2 (sp3) and OES11. It enables configuration of:
  • if LUM UID information is sought, and
  • if so, how frequently it should occur

this information is configured via two set parameters in ncpcon:

  1. UID_UPDATE_ENABLED=[0-2, default=1]
    0 = uid update is off
    1 = uid update is done periodically
    2 = uid update is triggered for immediate update and then set off
    e.g. ncpcon set UID_UPDATE_ENABLED=1

  2. UID_UPDATE_PERIOD=[# of hours, 0.5 or greater, default=0.5]
    Note: this requires UID_UPDATE_ENABLED=1

See "man ncpcon" for further information or Novell Documentation.

NOTE: the default settings above ensure the code works as it always has, and *WILL* need to be modified to correct this problem.  Therefore, if you have few updates to trustee assignments, for LUM-enabled users, these would be a good starting value for these parameters:
ncpcon set UID_UPDATE_ENABLED=1
ncpcon set UID_UPDATE_PERIOD=24

If you find that trustee assignments (aka Rights) aren't available in a timely manner, then decrease the UID_UPDATE_PERIOD parameter.

Cause

As indicated above, traces show many name resolve quests for objects and their uidNumber and correlate to NSS trustee assignments. Each request requires an ndsd thread on an eDirectory holder (in order to perform the search for the attribute). If a server does not hold a replica of an object (i.e. ExRef object/server), it will forward the request to a replica holding server. This will utilize an ndsd thread on both of the servers involved.

The target/replica holding servers will process these requests as quickly as they can. However, if an abundance of requests are received, the server can run out of available ndsd threads (default tuning=128, max=512). At this point, the requesting server will receive a “server busy” response (seen in LAN trace) and they will get retried after a slight delay. Any additional requests will be queued on the requesting server. The number of queued is displayed in ncpcon threads output under the Async section → Number of Queued Requests.

Additional Information

Further info on new settings:
The UID_UPDATE_PERIOD setting is re-read every time the UID_UPDATE_PERIOD expires.  So if you change from the default of 0.5 to 24, the new value of 24 will only take effect *after* the 0.5 has expired and the update is triggered.

Background Information:
This situation is most frequently seen on ExRef servers that do not hold replicas of any objects.  As such, these servers need to get eDirectory/ndsd information from a replica holder. Some telltale signs of this issue are:

  • LAN traces show thousands of resolve name requests

  • ndstrace/dstrace logs, with +RSLV +AREQ on the target server shows resolve name for objects and then seeking the uidNumber attribute. A large number of these will result in a -603 (NO_SUCH_ATTRIBUTE) because the user is not LUM enabled.  In some cases, the amount of -603 returns can exceed 99%.

  • The uidNumber reqeusts are generated from servers with NSS volumes; where NSS trustee assignments are made either:

    • directly to a user
    • to a group that a user is a member
    • one eDirectory object and then another user/group is made security equal to the original object