GroupWise cluster resource goes comatose on OES 2018 SP1 target node

  • 7024514
  • 30-Mar-2020
  • 16-Jul-2021

Environment

Open Enterprise Server 2018 SP1 (OES 2018 SP1) Linux
GroupWise 18.2

Situation

When trying to migrate a GroupWise cluster resource of a Post Office to an OES 2018 SP1 cluster node that is already running several other cluster-enabled GroupWise post offices (gwpoa) and a GroupWise domain (gwmta), the resource may go comatose on the target node with errors like the following in /var/opt/novell/log/<RESOURCE>.load.out:

/opt/novell/groupwise/admin/gwadminutil: fork: retry: No child processes
[0.003s][warning][os,thread] Failed to start thread - pthread_create failed (EAGAIN) for attributes: stacksize: 1024k, guardsize: 4k, detached.
Error occurred during initialization of VM
Could not create ConcurrentMarkThread
/opt/novell/ncs/lib/ncsfuncs: fork: retry: No child processes
/opt/novell/ncs/lib/ncsfuncs: fork: retry: Resource temporarily unavailable
/etc/init.d/grpwise: fork: retry: No child processes

/etc/init.d/grpwise: fork: retry: Resource temporarily unavailable

You may also see the following messages from "ncs-resourced" in /var/log/messages of the target node:


# There is insufficient memory for the Java Runtime Environment to continue.
# Cannot create worker GC thread. Out of system resources.
# An error report file with more information is saved as:
# //hs_err_pid16052.log
+ local rc=1


the "fork: retry: No child processes" and "fork: retry: Resource temporarily unavailable" are being logged repetitively.

Resolution

1) Increase the value of systemd parameter "DefaultTasksMax" from 512 to either the value of "max user processes" (see output of 'ulimit -a | grep "max user processes"' or 'ulimit -u') in /etc/systemd/system.conf with an ASCII text editor like "vi".

I recommend to make a backup of /etc/systemd/system.conf before you edit it, for example:

# cp --preserve=all /etc/systemd/system.conf /root

Before the change the parameter configuration line reads:

"
#DefaultTasksMax=512
"

After the change it would read for example (if the command "ulimit -u" returns 30321):

"
DefaultTasksMax=30321
"

Please note that I removed the hash/comment character ("#") in front of the configuration parameter and changed the value from "512" into the value returned by the command "ulimit -u". The returned max user processes value depends on the server's capacity..

2) Once you have saved the change, please run the following command to take the change into effect:

# systemctl daemon-reload

We would also advise extra step as the above command does not always update GW scripts. Therefore, reboot also all cluster nodes to make sure this change will take effect:

cluster offline <RESOURCE_NAME>

then reboot the server / cluster node and after that:


cluster online <RESOURCE_NAME> <NODE_NAME>


Also make sure that the resources come online at the node where he made the change.

This is just a workaround for 18.2 GW systems. In 18.3.x these steps shall not be necessary as  per the fix in 18.3 code where a support for the systemd was added natively.








Cause

The default value of systemd parameter "DefaultTasksMax" is 512. The messages "fork: retry: No child processes" and "fork: retry: Resource temporarily unavailable" occur when this value, or the value of "TasksMax" in the [Service] section of a service configuration file is too low for a particular service.

GroupWise agents do not have a service configuration file and hence are limited by the configuration of systemd global parameter DefaultTasksMax.