Master IP Address Resource goes comatose on failover

  • 7007086
  • 21-Oct-2010
  • 27-Apr-2012

Environment

Novell Cluster Services
Novell eDirectory
Novell NetWare
Novell Open Enterprise Server 2 (OES 2) Linux
Novell Open Enterprise Server (NetWare 6.5)
Slow LDAP
Slow ldapsearch
Slow ndsrepair -T
Master IP Address Resource
Master_IP_Address_Resource

Situation

The Master_IP_Address_Resource would intermittently go comatose on the new node after being failed over.

Troubleshooting showed that other tools that rely on LDAP such as ndsrepair or ldapsearch were also failing or going slow.

Resolution

To exclude or include NCP over specific interfaces on NetWare eDirectory servers:
  • SET NCP Exclude IP Addresses = xxx.xxx.xxx.xxx
or
  • SET NCP Include IP Addresses = yyy.yyy.yyy.yyy
To include NCP over specific interfaces on Linux/Unix/Solaris eDirectory servers:
  • Edit /etc/opt/novell/eDirectory/conf/nds.conf
    • n4u.server.interfaces=xxx.xxx.xxx.xxx
To exclude NCP over specific interfaces on Windows eDirectory servers:
eDirectory may need to be started for this to take effect.

Additional Information

The NetWare eDirectory servers each had two network interfaces, one of which was bound but unused.  The unused addresses were being returned to the cluster nodes in the referrals list.  When a referral came in on this address it would remain unanswered, taking 30 seconds to time out.

Troubleshooting

ndstrace was enabled on the cluster nodes with the following switches
  • ndstrace +tags +srch +time +areq +ldap +rslv
And it showed that the LDAP DoBind and DoSearch requests were failing.  For example:
1155705152 LDAP: [2010/10/20 16:43:47.536] (10.50.32.206:51542)(0x0001:0x60) DoBind on connection 0xe20f680
1155705152 LDAP: [2010/10/20 16:43:47.536] (10.50.32.206:51542)(0x0001:0x60) Bind name:cn=install,ou=adm,O=Novell, version:3, authentication:simple
1155705152 RSLV: [2010/10/20 16:43:47.536] Connect to tcp:10.50.32.206:524 succeeded
1155705152 RSLV: [2010/10/20 16:43:47.536] Begin-> DCResolveWithConstraint context = 0dac0004
1155705152 RSLV: [2010/10/20 16:43:47.536] Begin using RN cache \CN=install\OU=adm\O=Novell\test_Tree\
1155705152 RSLV: [2010/10/20 16:43:47.536] RootID is .Novell.test_Tree.
1155705152 RSLV: [2010/10/20 16:43:47.536] End using RN cache tag 4, succeeded
1155705152 RSLV: [2010/10/20 16:43:47.536] Starting to walk from initial connection
1155705152 RSLV: [2010/10/20 16:43:47.536] Resolving v3, \CN=install\OU=adm\O=Novell\test_Tree\
1155705152 AREQ: [2010/10/20 16:43:47.536] Calling DSAResolveName conn:5 for client .[Public].
1155705152 RSLV: [2010/10/20 16:43:47.536] Resolving \CN=install\OU=adm\O=Novell\test_Tree\, flags 00014004.
1155705152 RSLV: [2010/10/20 16:43:47.536] Responding with referrals.
1155705152 RSLV: [2010/10/20 16:43:47.536] Starting to process 6 received addresses:
1155705152 RSLV: [2010/10/20 16:43:47.536]       -> tcp:10.50.238.83:524 1350
1155705152 RSLV: [2010/10/20 16:43:47.536]       -> tcp:168.84.21.114:524 1350
1155705152 RSLV: [2010/10/20 16:43:47.536]       -> tcp:10.50.16.142:524 1350
1155705152 RSLV: [2010/10/20 16:43:47.536]       -> tcp:10.50.238.84:524 1350
1155705152 RSLV: [2010/10/20 16:43:47.536]       -> tcp:10.50.32.211:524 1350
1155705152 RSLV: [2010/10/20 16:43:47.536]       -> tcp:10.50.16.169:524 1350
1155705152 RSLV: [2010/10/20 16:43:47.536] (1)Trying to connect. tries = 1
Using the first address in the referral list, 10.50.238.83, the request took 30 seconds to time out.
1155705152 RSLV: [2010/10/20 16:44:17.535] Connect to tcp:10.50.238.83:524 failed, connection timed out (-748)
1155705152 RSLV: [2010/10/20 16:44:17.535] TryConnection returning -748