Novell NCS cluster node hard lockup in _raw_spin_lock_irqsave().

  • 7015910
  • 19-Nov-2014
  • 04-Dec-2014

Environment

Novell Cluster Services
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 2

Situation

Customer has Novell Open Enterprise Server cluster consisting of 12 Nodes, and observed a hard lock-up on this one node.

The crash that was observed had the following characteristics :

[1377559.286998] CLUSTER-<FATAL>-<6122>: SbdInitRecordInfo: I/O failed, mmStatus = FFFFFFFFFFFFFFFC.
[1377581.413811] Kernel panic - not syncing: Watchdog detected hard LOCKUP on cpu 0
[1377581.413824] Pid: 0, comm: swapper Tainted: GF             3.0.101-0.29-default #1
[1377581.413830] Call Trace:
[1377581.413846]  [<ffffffff81004935>] dump_trace+0x75/0x310
[1377581.413858]  [<ffffffff8145e083>] dump_stack+0x69/0x6f
[1377581.413866]  [<ffffffff8145e11c>] panic+0x93/0x201
[1377581.413875]  [<ffffffff810c63e4>] watchdog_overflow_callback+0xb4/0xc0
[1377581.413884]  [<ffffffff810f2b9a>] __perf_event_overflow+0xaa/0x230
[1377581.413895]  [<ffffffff81018210>] intel_pmu_handle_irq+0x1a0/0x330
[1377581.413905]  [<ffffffff81462af1>] perf_event_nmi_handler+0x31/0xa0
[1377581.413914]  [<ffffffff81464c47>] notifier_call_chain+0x37/0x70
[1377581.413922]  [<ffffffff81464c8d>] __atomic_notifier_call_chain+0xd/0x20
[1377581.413929]  [<ffffffff81464cdd>] notify_die+0x2d/0x40
[1377581.413936]  [<ffffffff81462137>] default_do_nmi+0x37/0x200
[1377581.413943]  [<ffffffff81462368>] do_nmi+0x68/0x80
[1377581.413950]  [<ffffffff814618b1>] restart_nmi+0x1e/0x2e
and

Resolution

As per the "CLUSTER-<FATAL>-<6122>: SbdInitRecordInfo: I/O failed, mmStatus = FFFFFFFFFFFFFFFC." error that were logged shortly prior the hard-lock, there already was a problem accessing the SBD partition.

As written in TID 7015908
, a crash in similar code area was resolved here earlier, however this code was not yet applied on this node. It was decided that this solution to be applied first.

Cause

The suspect root cause leading up to this particular crash is suspected to be found within in a number of customized scripts that were run gathering system statistics, and all individually able to lock the cluster.

Additional Information

The full back trace of the core :

PID: 0      TASK: ffffffff81a11020  CPU: 0   COMMAND: "swapper"
 #0 [ffff88026fc07b20] machine_kexec at ffffffff8102bf7e
 #1 [ffff88026fc07b70] crash_kexec at ffffffff810abd4a
 #2 [ffff88026fc07c40] panic at ffffffff8145e153
 #3 [ffff88026fc07cc0] watchdog_overflow_callback at ffffffff810c63e4
 #4 [ffff88026fc07cd0] __perf_event_overflow at ffffffff810f2b9a
 #5 [ffff88026fc07d50] intel_pmu_handle_irq at ffffffff81018210
 #6 [ffff88026fc07e40] perf_event_nmi_handler at ffffffff81462af1
 #7 [ffff88026fc07e50] notifier_call_chain at ffffffff81464c47
 #8 [ffff88026fc07e80] __atomic_notifier_call_chain at ffffffff81464c8d
 #9 [ffff88026fc07e90] notify_die at ffffffff81464cdd
#10 [ffff88026fc07ec0] default_do_nmi at ffffffff81462137
#11 [ffff88026fc07ee0] do_nmi at ffffffff81462368
#12 [ffff88026fc07ef0] restart_nmi at ffffffff814618b1
    [exception RIP: _raw_spin_lock_irqsave+32]
    RIP: ffffffff81460ed0  RSP: ffff88026fc03dd8  RFLAGS: 00000016
    RAX: 0000000000000282  RBX: ffff88000a969a10  RCX: 0000000000009a30
    RDX: 0000000000000a96  RSI: 0000000000000282  RDI: ffff88000a969a10
    RBP: ffff88024e623380   R8: 0000000000000000   R9: 0000000114908903
    R10: 0000000114908903  R11: ffffffffa0455180  R12: ffff880245ef0000
    R13: 0000000000000000  R14: ffff880245ef01e8  R15: ffff880253915e48
    ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
--- <NMI exception stack> ---
#13 [ffff88026fc03dd8] _raw_spin_lock_irqsave at ffffffff81460ed0
#14 [ffff88026fc03dd8] up at ffffffff810881f6
#15 [ffff88026fc03df8] MM_ObjectBlockingIOCallback at ffffffffa04551a7 [sbdlib]
#16 [ffff88026fc03e18] retry_callback at ffffffffa046a57e [nwraid]
#17 [ffff88026fc03e48] blk_update_request at ffffffff81226747
#18 [ffff88026fc03e98] dm_done at ffffffffa005cb43 [dm_mod]
#19 [ffff88026fc03ed8] blk_done_softirq at ffffffff8122dfb5
#20 [ffff88026fc03ef8] __do_softirq at ffffffff8106770f
#21 [ffff88026fc03f68] call_softirq at ffffffff81469fdc
#22 [ffff88026fc03f80] do_softirq at ffffffff81004435
#23 [ffff88026fc03fb0] call_function_single_interrupt at ffffffff81469c33
--- <IRQ stack> ---
#24 [ffffffff81a01e28] call_function_single_interrupt at ffffffff81469c33
    [exception RIP: tick_nohz_stop_sched_tick+53]
    RIP: ffffffff81094a55  RSP: ffffffff81a01ed8  RFLAGS: 00000286
    RAX: ffff88026fc0cc80  RBX: ffffffff81a14980  RCX: 000000000000eec9
    RDX: ffffffff81a00010  RSI: 0000000000000206  RDI: 0000000000000001
    RBP: ffffffff81bd3bc0   R8: ffffffff81a00000   R9: 0000000000000000
    R10: 0000000000000000  R11: 0000000000000000  R12: ffffffff81469c2e
    R13: ffffffff81a11020  R14: 0000000000011800  R15: ffffffff81a01fd8
    ORIG_RAX: ffffffffffffff04  CS: 0010  SS: 0018
#25 [ffffffff81a01f30] cpu_idle at ffffffff810020f2