Environment
Novell Open Enterprise Server 11 (OES 11) Linux
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 1
Novell Cluster Services 1.8.4
Novell Open Enterprise Server 11 (OES 11) Linux Support Pack 1
Novell Cluster Services 1.8.4
Situation
This problem was found after creating a number of cluster resources and specifying a wrong name.
In an attempt to correct that, after deleting the resource, and re-creating the same cluster resource using a proper name (but using the same cluster resource IP address) we observed all cluster nodes, other than the node running the Master_IP_Address resource, to be crashing in the kernel panic listed below.
After enabling CRM debugging (uncomment the line that reads "# echo -n "TRACE CRM ON" > /proc/ncs/cluster" in file '/opt/novell/ncs/bin/ldncs') and restarting Novell Cluster Services on all the nodes in the cluster, and repeating the test, the following kernel panic was observed :
"Kernel panic - not syncing: NWCLSTR_Abend: CRM:CRMSetResource: conflicting resource name"
Excerpt from the core log file :
[106021.635444] NWCLSTR_Abend: CRM:CRMSetResource: conflicting resource name
[106021.635445]
[106021.635447] Kernel panic - not syncing: NWCLSTR_Abend: CRM:CRMSetResource: conflicting resource name
[106021.635449]
[106021.635449]
[106021.635460] Pid: 15613, comm: NCS_CRM_RECEIVE Tainted: G X 2.6.32.49-0.3-default #1
[106021.635462] Call Trace:
[106021.635478] [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
[106021.635485] [<ffffffff8139aee6>] dump_stack+0x69/0x73
[106021.635489] [<ffffffff8139af68>] panic+0x78/0x19b
[106021.635504] [<ffffffffa096f88e>] NWCLSTR_Abend+0xae/0xb0 [clstrlib]
[106021.635526] [<ffffffffa0a3f33d>] CRMSetResource+0x3bd/0x550 [crm]
[106021.635534] [<ffffffffa0a3c21a>] CRMViaReceiveThread+0x13a/0x200 [crm]
[106021.635541] [<ffffffff810654b6>] kthread+0x96/0xa0
[106021.635545] [<ffffffff81003fba>] child_rip+0xa/0x20
In an attempt to correct that, after deleting the resource, and re-creating the same cluster resource using a proper name (but using the same cluster resource IP address) we observed all cluster nodes, other than the node running the Master_IP_Address resource, to be crashing in the kernel panic listed below.
After enabling CRM debugging (uncomment the line that reads "# echo -n "TRACE CRM ON" > /proc/ncs/cluster" in file '/opt/novell/ncs/bin/ldncs') and restarting Novell Cluster Services on all the nodes in the cluster, and repeating the test, the following kernel panic was observed :
"Kernel panic - not syncing: NWCLSTR_Abend: CRM:CRMSetResource: conflicting resource name"
Excerpt from the core log file :
[106021.635444] NWCLSTR_Abend: CRM:CRMSetResource: conflicting resource name
[106021.635445]
[106021.635447] Kernel panic - not syncing: NWCLSTR_Abend: CRM:CRMSetResource: conflicting resource name
[106021.635449]
[106021.635449]
[106021.635460] Pid: 15613, comm: NCS_CRM_RECEIVE Tainted: G X 2.6.32.49-0.3-default #1
[106021.635462] Call Trace:
[106021.635478] [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
[106021.635485] [<ffffffff8139aee6>] dump_stack+0x69/0x73
[106021.635489] [<ffffffff8139af68>] panic+0x78/0x19b
[106021.635504] [<ffffffffa096f88e>] NWCLSTR_Abend+0xae/0xb0 [clstrlib]
[106021.635526] [<ffffffffa0a3f33d>] CRMSetResource+0x3bd/0x550 [crm]
[106021.635534] [<ffffffffa0a3c21a>] CRMViaReceiveThread+0x13a/0x200 [crm]
[106021.635541] [<ffffffff810654b6>] kthread+0x96/0xa0
[106021.635545] [<ffffffff81003fba>] child_rip+0xa/0x20
Resolution
The problem has been identified in the Novell Cluster Services modules.
A solution is currently scheduled to be released in a future support pack.
A solution is currently scheduled to be released in a future support pack.
Additional Information
Some other crashes (kernel panic's) that were also observed during cluster testing, have been identified as having the same root cause, and are resolved with the same code change.
The following kernel panic has also been observed :
CRM core in "QueueResourceEventMesg: called back"
Excerpt from the core log file :
[69349.864641] 8b NCS_CRM_RES_T04 - CrmRMEGroupCheck: node 0, CLUS_POOL17_125_SERVER, failOver=00000011.
[69349.864641] 35 38 a1 00 00 85 f6 74 cc 48 63 eb 48 8b 14 ed a0 c9 bf a0 <8b> b2 68 01 00 00 89 f0 41 23 84 24 68 01 00 00 a9 00 00 0f 00
[69349.864641] RIP [<ffffffffa0bf234f>] CrmRMEGroupCheck+0x9f/0x1f0 [crm]
[69349.864641] RSP <ffff880361e3be30>
[69349.864641] CR2: 0000000000000168
The following kernel panic has also been observed :
CSS core in "CSS:CSS_InitCheckpoint: invalid app received"
Excerpt from the core log file :
[ 393.824639] Report taken: startTime= 1326352583, endTime= 1326352824
[ 393.824643] node=4, name=blr2-182-131, heartbeat=1, tolerance=8
[ 393.824644]
[ 393.824648]
[ 393.824649] node 4, name blr2-182-131, maxTicks=0.262sec at 1326352587, minTicks=8.000sec, lastTickDate 1326352824
[ 393.824651]
[ 393.824663]
[ 393.824664] node 31, name blr2-182-150, maxTicks=0.262sec at 1326352587, minTicks=8.000sec, lastTickDate 1326352824
[ 393.824666]
[ 393.824674]
[ 393.824675] SBD STATS: maxTicks=0.262sec at 1326352587, minTicks=8.000sec, lastTickDate 1326352824
[ 393.824676]
[ 393.824685] NWCLSTR_Abend: CSS:CSS_InitCheckpoint: invalid app received.
[ 393.824686]
[ 393.824689] Kernel panic - not syncing: NWCLSTR_Abend: CSS:CSS_InitCheckpoint: invalid app received.
[ 393.824690]
[ 393.824691]
[ 393.824701] Pid: 26205, comm: NCS_CSS_RECEIVE Tainted: G B X 2.6.32.49-0.3-default #1
[ 393.824704] Call Trace:
[ 393.824720] [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
[ 393.824727] [<ffffffff8139aee6>] dump_stack+0x69/0x73
[ 393.824732] [<ffffffff8139af68>] panic+0x78/0x19b
[ 393.824746] [<ffffffffa095f88e>] NWCLSTR_Abend+0xae/0xb0 [clstrlib]
[ 393.824768] [<ffffffffa0580f8d>] CSSProcessDesc+0x91d/0x1930 [css]
[ 393.824775] [<ffffffffa0583cf6>] CSSViaReceiveThread+0x76/0x210 [css]
[ 393.824782] [<ffffffff810654b6>] kthread+0x96/0xa0
[ 393.824787] [<ffffffff81003fba>] child_rip+0xa/0x20
The following kernel panic has also been observed :
CRM core in "QueueResourceEventMesg: called back"
Excerpt from the core log file :
[69349.864641] 8b NCS_CRM_RES_T04 - CrmRMEGroupCheck: node 0, CLUS_POOL17_125_SERVER, failOver=00000011.
[69349.864641] 35 38 a1 00 00 85 f6 74 cc 48 63 eb 48 8b 14 ed a0 c9 bf a0 <8b> b2 68 01 00 00 89 f0 41 23 84 24 68 01 00 00 a9 00 00 0f 00
[69349.864641] RIP [<ffffffffa0bf234f>] CrmRMEGroupCheck+0x9f/0x1f0 [crm]
[69349.864641] RSP <ffff880361e3be30>
[69349.864641] CR2: 0000000000000168
The following kernel panic has also been observed :
CSS core in "CSS:CSS_InitCheckpoint: invalid app received"
Excerpt from the core log file :
[ 393.824639] Report taken: startTime= 1326352583, endTime= 1326352824
[ 393.824643] node=4, name=blr2-182-131, heartbeat=1, tolerance=8
[ 393.824644]
[ 393.824648]
[ 393.824649] node 4, name blr2-182-131, maxTicks=0.262sec at 1326352587, minTicks=8.000sec, lastTickDate 1326352824
[ 393.824651]
[ 393.824663]
[ 393.824664] node 31, name blr2-182-150, maxTicks=0.262sec at 1326352587, minTicks=8.000sec, lastTickDate 1326352824
[ 393.824666]
[ 393.824674]
[ 393.824675] SBD STATS: maxTicks=0.262sec at 1326352587, minTicks=8.000sec, lastTickDate 1326352824
[ 393.824676]
[ 393.824685] NWCLSTR_Abend: CSS:CSS_InitCheckpoint: invalid app received.
[ 393.824686]
[ 393.824689] Kernel panic - not syncing: NWCLSTR_Abend: CSS:CSS_InitCheckpoint: invalid app received.
[ 393.824690]
[ 393.824691]
[ 393.824701] Pid: 26205, comm: NCS_CSS_RECEIVE Tainted: G B X 2.6.32.49-0.3-default #1
[ 393.824704] Call Trace:
[ 393.824720] [<ffffffff810061dc>] dump_trace+0x6c/0x2d0
[ 393.824727] [<ffffffff8139aee6>] dump_stack+0x69/0x73
[ 393.824732] [<ffffffff8139af68>] panic+0x78/0x19b
[ 393.824746] [<ffffffffa095f88e>] NWCLSTR_Abend+0xae/0xb0 [clstrlib]
[ 393.824768] [<ffffffffa0580f8d>] CSSProcessDesc+0x91d/0x1930 [css]
[ 393.824775] [<ffffffffa0583cf6>] CSSViaReceiveThread+0x76/0x210 [css]
[ 393.824782] [<ffffffff810654b6>] kthread+0x96/0xa0
[ 393.824787] [<ffffffff81003fba>] child_rip+0xa/0x20