Environment
Novell Cluster Services
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 2
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 2
Situation
Customer has Novell Open Enterprise Server cluster consisting of 12 Nodes, and observed a hard lock-up on this one node.
The crash that was observed had the following characteristics :
The crash that was observed had the following characteristics :
[1377559.286998] CLUSTER-<FATAL>-<6122>: SbdInitRecordInfo: I/O failed, mmStatus = FFFFFFFFFFFFFFFC.and
[1377581.413811] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[1377581.413824] Pid: 0, comm: swapper Tainted: GF 3.0.101-0.29-default #1
[1377581.413830] Call Trace:
[1377581.413846] [<ffffffff81004935>] dump_trace+0x75/0x310
[1377581.413858] [<ffffffff8145e083>] dump_stack+0x69/0x6f
[1377581.413866] [<ffffffff8145e11c>] panic+0x93/0x201
[1377581.413875] [<ffffffff810c63e4>] watchdog_overflow_callback+0xb4/0xc0
[1377581.413884] [<ffffffff810f2b9a>] __perf_event_overflow+0xaa/0x230
[1377581.413895] [<ffffffff81018210>] intel_pmu_handle_irq+0x1a0/0x330
[1377581.413905] [<ffffffff81462af1>] perf_event_nmi_handler+0x31/0xa0
[1377581.413914] [<ffffffff81464c47>] notifier_call_chain+0x37/0x70
[1377581.413922] [<ffffffff81464c8d>] __atomic_notifier_call_chain+0xd/0x20
[1377581.413929] [<ffffffff81464cdd>] notify_die+0x2d/0x40
[1377581.413936] [<ffffffff81462137>] default_do_nmi+0x37/0x200
[1377581.413943] [<ffffffff81462368>] do_nmi+0x68/0x80
[1377581.413950] [<ffffffff814618b1>] restart_nmi+0x1e/0x2e
Resolution
As per the "CLUSTER-<FATAL>-<6122>: SbdInitRecordInfo: I/O failed, mmStatus = FFFFFFFFFFFFFFFC." error that were logged shortly prior the hard-lock, there already was a problem accessing the SBD partition.
As written in TID 7015908 , a crash in similar code area was resolved here earlier, however this code was not yet applied on this node. It was decided that this solution to be applied first.
As written in TID 7015908 , a crash in similar code area was resolved here earlier, however this code was not yet applied on this node. It was decided that this solution to be applied first.
Cause
The suspect root cause leading up to this particular crash is suspected to be found within in a number of customized scripts that were run gathering system statistics, and all individually able to lock the cluster.
Additional Information
The full back trace of the core :
PID: 0 TASK: ffffffff81a11020 CPU: 0 COMMAND: "swapper"
#0 [ffff88026fc07b20] machine_kexec at ffffffff8102bf7e
#1 [ffff88026fc07b70] crash_kexec at ffffffff810abd4a
#2 [ffff88026fc07c40] panic at ffffffff8145e153
#3 [ffff88026fc07cc0] watchdog_overflow_callback at ffffffff810c63e4
#4 [ffff88026fc07cd0] __perf_event_overflow at ffffffff810f2b9a
#5 [ffff88026fc07d50] intel_pmu_handle_irq at ffffffff81018210
#6 [ffff88026fc07e40] perf_event_nmi_handler at ffffffff81462af1
#7 [ffff88026fc07e50] notifier_call_chain at ffffffff81464c47
#8 [ffff88026fc07e80] __atomic_notifier_call_chain at ffffffff81464c8d
#9 [ffff88026fc07e90] notify_die at ffffffff81464cdd
#10 [ffff88026fc07ec0] default_do_nmi at ffffffff81462137
#11 [ffff88026fc07ee0] do_nmi at ffffffff81462368
#12 [ffff88026fc07ef0] restart_nmi at ffffffff814618b1
[exception RIP: _raw_spin_lock_irqsave+32]
RIP: ffffffff81460ed0 RSP: ffff88026fc03dd8 RFLAGS: 00000016
RAX: 0000000000000282 RBX: ffff88000a969a10 RCX: 0000000000009a30
RDX: 0000000000000a96 RSI: 0000000000000282 RDI: ffff88000a969a10
RBP: ffff88024e623380 R8: 0000000000000000 R9: 0000000114908903
R10: 0000000114908903 R11: ffffffffa0455180 R12: ffff880245ef0000
R13: 0000000000000000 R14: ffff880245ef01e8 R15: ffff880253915e48
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#13 [ffff88026fc03dd8] _raw_spin_lock_irqsave at ffffffff81460ed0
#14 [ffff88026fc03dd8] up at ffffffff810881f6
#15 [ffff88026fc03df8] MM_ObjectBlockingIOCallback at ffffffffa04551a7 [sbdlib]
#16 [ffff88026fc03e18] retry_callback at ffffffffa046a57e [nwraid]
#17 [ffff88026fc03e48] blk_update_request at ffffffff81226747
#18 [ffff88026fc03e98] dm_done at ffffffffa005cb43 [dm_mod]
#19 [ffff88026fc03ed8] blk_done_softirq at ffffffff8122dfb5
#20 [ffff88026fc03ef8] __do_softirq at ffffffff8106770f
#21 [ffff88026fc03f68] call_softirq at ffffffff81469fdc
#22 [ffff88026fc03f80] do_softirq at ffffffff81004435
#23 [ffff88026fc03fb0] call_function_single_interrupt at ffffffff81469c33
--- <IRQ stack> ---
#24 [ffffffff81a01e28] call_function_single_interrupt at ffffffff81469c33
[exception RIP: tick_nohz_stop_sched_tick+53]
RIP: ffffffff81094a55 RSP: ffffffff81a01ed8 RFLAGS: 00000286
RAX: ffff88026fc0cc80 RBX: ffffffff81a14980 RCX: 000000000000eec9
RDX: ffffffff81a00010 RSI: 0000000000000206 RDI: 0000000000000001
RBP: ffffffff81bd3bc0 R8: ffffffff81a00000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: ffffffff81469c2e
R13: ffffffff81a11020 R14: 0000000000011800 R15: ffffffff81a01fd8
ORIG_RAX: ffffffffffffff04 CS: 0010 SS: 0018
#25 [ffffffff81a01f30] cpu_idle at ffffffff810020f2