StoreOnce backups fail after media agent host OS upgrade to SLES 12.2

  • KM03157394
  • 08-May-2018
  • 08-May-2018

Summary

Following errors were observed in multiple user environments after updating media agent hosts to SLES 12.2 [Major] From: BMA@hostname "gateway_name [GW 65005:15:3711687428713648309]" Time: 25.10.2017 06:29:32 [90:51] \\storeonce\store_name\1c0112ac_59f01328_100d_d92b Cannot write to device (StoreOnce error: An unspecified (internal) error occurred)

Error

Following errors were observed in multiple user environments after updating media agent hosts to SLES 12.2
 
[Major] From: BMA@hostname "gateway_name [GW 65005:15:3711687428713648309]" Time: 25.10.2017 06:29:32
[90:51] \\storeonce\store_name\1c0112ac_59f01328_100d_d92b
Cannot write to device (StoreOnce error: An unspecified (internal) error occurred)
 
From corresponding catalyst client logs error is seen as: 
 
2017-10-25 04:29:32.545409 (local 06:29) : DEBUG : 31453_4160554816 : 1 : PHBW : BMA : osCltPln_WriteToHighBWWritePipeline : Ln 1999 : WRITEBYTESHIGHBW: DataBufferOffset: 1071997, dataToWriteToCurrentBuffer: 1025155 dataToWriteToNextBuffer: 23421
2017-10-25 04:29:32.545707 (local 06:29) : ERROR : 31453_4160554816 : 1 : PHBW : BMA : osCltThr_CreateThread : Ln 350 : Unable to create thread, error : 11.
2017-10-25 04:29:32.545728 (local 06:29) : ERROR : 31453_4160554816 : 1 : PHBW : BMA : osCltThr_CreateThread : Ln 352 : GOTO ReturnStatus = -1000 (OSCLT_ERR_INTERNAL_ERROR).
 

Cause

It's failing to create new thread with EAGAIN error message. 
 
Explanation for this behavior was found in https://www.suse.com/releasenotes/x86_64/SUSE-SLES/12-SP2/
 
"The version of systemd shipped in SLES 12 SP2 uses the PIDs cgroup controller. This provides some per-service fork() bomb protection, leading to a safer system.
However, under certain circumstances you may notice regressions. The limits have already been raised above the upstream default values to avoid this but the risk remains.
If you notice regressions, you can change a number of TasksMax settings.
To control the default TasksMax= setting for services and scopes running on the system, use the system.conf settingDefaultTasksMax=. This setting defaults to 512, which means services that are not explicitly configured otherwise will only be able to create 512 processes or threads at maximum.
 
For thread- or process-heavy services, you may need to set a higher TasksMax value. In such cases, set TasksMax directly in the specific unit files. Either choose a numeric value or even infinity."
 
As DP processes belong to xinetd cgroup, those were affected by system default limit of 512 (which can be checked in /sys/fs/cgroup/pids/system.slice/xinetd.service/pids.max file) In order to check if that's causing the actual error in DP sessions /sys/fs/cgroup/pids/system.slice/xinetd.service/pids.events file can be monitored
 
From https://patchwork.kernel.org/patch/9189215/ 
 
"This patch adds more visibility into the pids controller when the controller rejects a fork request. Whenever fork fails because the limit on the number of pids in the cgroup is reached, the controller will log this and also notify the newly added cgroups events file. The `max` key in the events file representsthe number of times fork failed because of the pids controller. This change also adds an atomic boolean to prevent logging too much (e.g. a forkbomb). The message is logged once per cgroup until the next time the pids limit changes."
 
 

 

Fix

To avoid this problem either global limit can be increased in /etc/system/system.conf  file, or alternatively increase only xinetd limit in corredponging unit file /usr/lib/systemd/system/xinetd.servic by adding
 
TasksMax=infinity