Environment
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 2
Novell Cluster Services
Novell Cluster Services
Situation
A large 12-Node OES11 SP2 NCS cluster, with a SBD partition off 1 GB in size, containing a large amount of cluster event log entries, customer was searching for a specific series of events, using the "cat /admin/Novell/Cluster/EventLog.xml" command.
Whilst quickly scrolling through a number of pages in the cluster event log, in order to get to the correct date and time for a particular event, the node crashed, with the following error :
Whilst quickly scrolling through a number of pages in the cluster event log, in order to get to the correct date and time for a particular event, the node crashed, with the following error :
[1811450.978550] CLUSTER-<FATAL>-<6121>: SbdReadRecord: I/O failed, mmStatus = FFFFFFFFFFFFFFFC.The ncs_check_fire message shown here would indicate a CPU clock jump triggering the node to crash, but the actual difference was showing a negligible difference, which (converted to seconds) was set to 0 seconds even, so this is determined to not be the reason for the crash.
[1811464.869551] ncs_check_fire: woke up on cpu 19/19 (current=4748673728, last=4748670065), gipc_tsc=4748670065, tsc_diff=3663(14652/14000 msecs).
[1811464.869559] Kernel panic - not syncing: ncs_timer is halting the machine.
[1811464.869561]
[1811464.869574] Pid: 0, comm: kworker/0:1 Tainted: GF N 3.0.101-0.29-default #1
[1811464.869583] Call Trace:
[1811464.869606] [<ffffffff81004935>] dump_trace+0x75/0x310
[1811464.869624] [<ffffffff8145e083>] dump_stack+0x69/0x6f
[1811464.869635] [<ffffffff8145e11c>] panic+0x93/0x201
[1811464.869653] [<ffffffffa04c30cb>] ncs_check_fire+0xcb/0x150 [ncs_timer]
[1811464.869681] [<ffffffff8106f4db>] call_timer_fn+0x6b/0x120
[1811464.869694] [<ffffffff810708f3>] run_timer_softirq+0x173/0x240
[1811464.869710] [<ffffffff8106770f>] __do_softirq+0x11f/0x260
[1811464.869724] [<ffffffff81469fdc>] call_softirq+0x1c/0x30
[1811464.869740] [<ffffffff81004435>] do_softirq+0x65/0xa0
[1811464.869750] [<ffffffff810674d5>] irq_exit+0xc5/0xe0
[1811464.869761] [<ffffffff81026588>] smp_apic_timer_interrupt+0x68/0xa0
[1811464.869773] [<ffffffff81469773>] apic_timer_interrupt+0x13/0x20
[1811464.869789] [<ffffffff8137a732>] poll_idle+0x32/0x70
[1811464.869801] [<ffffffff8137a9cb>] cpuidle_idle_call+0x11b/0x280
[1811464.869812] [<ffffffff81002126>] cpu_idle+0x66/0xb0
Resolution
A new sbdutil switch has been introduced going forward to display event log details, so that we read the event log details directly from the file,
This 'sbdutil -r' command will print cluster event log information, by default from oldest to newest.
This 'sbdutil -r' command will print cluster event log information, by default from oldest to newest.
Cause
The problem is a side effect of allowing NCS to create larger SBD partition sizes than it was initially designed for.
Accessing the SBD log through adminfs is a 'costly' operation as each entry moves through the kernel different times.
In the case where there are many log entries, continuously reading through the log may hog the CPU, so that the heartbeat process could be starved and the NCS timer would cast the node out of the cluster, which is what has happened here.
The SbdReadRecord I/O error appears to have occurred only _after_ the node was already cast out, but was not the root cause for the initial crash.
Accessing the SBD log through adminfs is a 'costly' operation as each entry moves through the kernel different times.
In the case where there are many log entries, continuously reading through the log may hog the CPU, so that the heartbeat process could be starved and the NCS timer would cast the node out of the cluster, which is what has happened here.
The SbdReadRecord I/O error appears to have occurred only _after_ the node was already cast out, but was not the root cause for the initial crash.