Environment
Novell Open Enterprise Server 2 (OES 2) Linux
Novell Cluster Services (NCS)
Novell Storage Services (NSS)
Situation
In
Open Enterprise Server 2 (Linux) environments there have been issues
reported where the server was rebooted, and after a reboot sometimes it
would take NSS multiple minutes to properly initialize.
The symptoms encountered were numerous :
With ndpapp debugging [*1] enabled, the following can be found in /var/log/messages.
Server names and time stamps have been removed from the excerpt below:
stat of /dev/ndp at time tv_sec:1253868578 tv_usec:623707 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868578 tv_usec:727708 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868578 tv_usec:831713 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868578 tv_usec:935719 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868579 tv_usec:39721 errno=2 failed
ndpapp[9496]: stat: /dev/ndp: 9: Bad file descriptor
The "Bad file descriptor" as seen above indicates that after a 10 minute time-out, the ndp module has killed itself and is not available in user space
As described in section 29.8 from the NSS File System and Administration Guide:
NSS requires the NDP user space module (ndpapp) to be loaded and running when NSS starts. If ndpapp is not running, modules in NSS that attempt eDirectory operations fail and prevent NSS from loading.
In some environments, when the NDP module (ndpmod) attempts to register the /dev/ndp device, the kernel routine misc_register() registers the device inside the kernel, but does not make it available in user space until about 17 seconds later. Because of the delay, the NDP user space module kills itself for about 10 seconds. NSS cannot start until ndpapp reloads itself.
The symptoms encountered were numerous :
- NSS Pools can not be activated, and volumes can not be mounted.
- Clustering does not function due to the underlying NSS problems.
With ndpapp debugging [*1] enabled, the following can be found in /var/log/messages.
Server names and time stamps have been removed from the excerpt below:
stat of /dev/ndp at time tv_sec:1253868578 tv_usec:623707 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868578 tv_usec:727708 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868578 tv_usec:831713 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868578 tv_usec:935719 errno=2 failed
ndpapp[9496]: stat of /dev/ndp at time tv_sec:1253868579 tv_usec:39721 errno=2 failed
ndpapp[9496]: stat: /dev/ndp: 9: Bad file descriptor
The "Bad file descriptor" as seen above indicates that after a 10 minute time-out, the ndp module has killed itself and is not available in user space
As described in section 29.8 from the NSS File System and Administration Guide:
NSS requires the NDP user space module (ndpapp) to be loaded and running when NSS starts. If ndpapp is not running, modules in NSS that attempt eDirectory operations fail and prevent NSS from loading.
In some environments, when the NDP module (ndpmod) attempts to register the /dev/ndp device, the kernel routine misc_register() registers the device inside the kernel, but does not make it available in user space until about 17 seconds later. Because of the delay, the NDP user space module kills itself for about 10 seconds. NSS cannot start until ndpapp reloads itself.
Resolution
It was determined that the default maximum number of child processes udev can fork at a time as specified in /etc/sysconfig/udev where not sufficient for this environment for the server to accommodate all resources at startup.
As 'root' user, modifying the following defaults as shown below resolved the issue:
Please be aware that modifying this with inappropriate and unrealistic values, may have an adverse effect.
Therefor in order to effectively calculate what value best suits your environment, please use the following formula:
(128 + (125 * GB RAM)) [*2]
The value provided this way is what can be used safely to configure your server with.
In order to accommodate the changes made, it is required the server will be rebooted.
If there are still problems with NSS loading after modifying the UDEV parameters, it may help to modify the /etc/sysconfig/boot file and set RUN_PARALLEL="no".
As 'root' user, modifying the following defaults as shown below resolved the issue:
UDEVD_MAX_CHILDS=1024 (Default is 64)
UDEVD_MAX_CHILDS_RUNNING=1024 (Default is 16)
UDEVD_MAX_CHILDS_RUNNING=1024 (Default is 16)
Please be aware that modifying this with inappropriate and unrealistic values, may have an adverse effect.
Therefor in order to effectively calculate what value best suits your environment, please use the following formula:
(128 + (125 * GB RAM)) [*2]
The value provided this way is what can be used safely to configure your server with.
In order to accommodate the changes made, it is required the server will be rebooted.
If there are still problems with NSS loading after modifying the UDEV parameters, it may help to modify the /etc/sysconfig/boot file and set RUN_PARALLEL="no".
Additional Information
[1] ndpapp debugging
For instructions on how to enable and disable ndpapp and ndpmod debugging, please refer to section 30.6.4 Logging Communication between NSS and eDirectory, NICI, or Linux User Management section of the OES2 NSS File System Administration guide.
[2] udev defaults
For SLE10 the udev startup defaults are specified in the /etc/sysconfig/udev file.
For instructions on how to enable and disable ndpapp and ndpmod debugging, please refer to section 30.6.4 Logging Communication between NSS and eDirectory, NICI, or Linux User Management section of the OES2 NSS File System Administration guide.
[2] udev defaults
For SLE10 the udev startup defaults are specified in the /etc/sysconfig/udev file.