OES11 installation fails during eDirectory configuration stage with error -626 (DSERR_ALL_REFERRALS_FAILED)

  • 7010690
  • 27-Aug-2012
  • 18-Jun-2013

Environment

Novell Open Enterprise Server 11 (OES 11) Linux
NetIQ eDirectory

Situation

Attempting to add server into existing tree during OES installation, but install keeps failing with an error -626 (ALL_REFERRALS_FAILED)
 
Errors/Symptoms:
 

All referrals failed.

ERROR -626: Setup for NDS installation failed.

Please make certain that you have provided the complete server and admin contexts

Resolution

The issue was resolved by ensuring the server holding the master copy of the partition where the new server was added had an entry for this server. 
 
 

Cause

There was no entry in DNS for the new server.  When prompted for an existing server in the tree, a main replica server was specified.  This main replica server was a master of the tree root, but only held a r/w of the partition where the server was added.  This master of root DID have an entry in it's /etc/hosts file, however the install seems to want to contact the master of that partition and that server did not know anything about the new server.

Additional Information

In order to troubleshoot this issue, we needed to enable ndstrace on the server being installed. 

Since you can’t run ndstrace before the ndsd process is running and we wanted to capture all possible information and normally this is a bit difficult since ndsd gets started during the installation and configuration process.  The easiest way to accomplish this is by using ndsautotrace utility.  This is part of the novell-NDSServ package and located at opt/novell/eDirectory/bin/ndsautotrace.  See TID 7012638 - What is ndsautotrace and how do I use it?


In the ndstrace we could see that the server was walking the tree to go to the master of the partition where the server was located and since there was no DNS entry for the server being installed AND no entry for that server in it’s /etc/hosts file we get a -626.

 

172.16.151.235 is master of root

192.168.1.200 is Master of o=test  (DSFW server, needs to be the master)

 

3263694592 RSLV: [2012/08/20 18:40:19.758] DEBUG: Connect to tcp:172.16.151.235:524 succeeded

3263694592 RSLV: [2012/08/20 18:40:19.775] DEBUG: Begin-> DCResolveWithConstraint context = 3cd30001

3263694592 RSLV: [2012/08/20 18:40:19.775] DEBUG: Starting to walk from initial connection

3263694592 RSLV: [2012/08/20 18:40:19.775] DEBUG: Resolving \test_TREE\o=test\cn=admin

3263694592 RSLV: [2012/08/20 18:40:19.793] DEBUG: ------> tag = 6

3263694592 RSLV: [2012/08/20 18:40:19.793] DEBUG: ------> id = 0000802B

3263694592 RSLV: [2012/08/20 18:40:19.793] DEBUG: End---> DCResolveWithConstraint err = 0

3263694592 RSLV: [2012/08/20 18:40:19.987] DEBUG: Connect to tcp:172.16.151.235:524 succeeded

3263694592 RSLV: [2012/08/20 18:40:20.23] DEBUG: Begin-> DCResolveWithConstraint context = 3cd30002

3263694592 RSLV: [2012/08/20 18:40:20.23] DEBUG: Starting to walk from initial connection

3263694592 RSLV: [2012/08/20 18:40:20.23] DEBUG: Resolving v2, non-text

3263694592 RSLV: [2012/08/20 18:40:20.42] DEBUG: ------> tag = 6

3263694592 RSLV: [2012/08/20 18:40:20.42] DEBUG: ------> id = 00008014

3263694592 RSLV: [2012/08/20 18:40:20.42] DEBUG: End---> DCResolveWithConstraint err = 0

3263694592 RSLV: [2012/08/20 18:40:20.227] DEBUG: Connect to tcp:172.16.151.235:524 succeeded

3263694592 RSLV: [2012/08/20 18:40:20.579] DEBUG: Begin-> DCResolveWithConstraint context = 3cd30001

3263694592 RSLV: [2012/08/20 18:40:20.579] DEBUG: Starting to walk from initial connection

3263694592 RSLV: [2012/08/20 18:40:20.579] DEBUG: Resolving \test_TREE\o=test\ou=testCloud

3263694592 RSLV: [2012/08/20 18:40:20.597] DEBUG: ------> tag = 6

3263694592 RSLV: [2012/08/20 18:40:20.597] DEBUG: ------> id = 0000B6A0

3263694592 RSLV: [2012/08/20 18:40:20.597] DEBUG: End---> DCResolveWithConstraint err = 0

3263694592 RSLV: [2012/08/20 18:40:20.597] DEBUG: Begin-> DCResolveWithConstraint context = 3cd30001

3263694592 RSLV: [2012/08/20 18:40:20.597] DEBUG: Starting to walk from initial connection

3263694592 RSLV: [2012/08/20 18:40:20.597] DEBUG: Resolving \test_TREE\o=test\ou=testCloud

3263694592 RSLV: [2012/08/20 18:40:20.621] DEBUG: ------> tag = 6

3263694592 RSLV: [2012/08/20 18:40:20.621] DEBUG: ------> id = FFFFFFFFFFFFFFFF

3263694592 RSLV: [2012/08/20 18:40:20.621] DEBUG: ------> refCount = 1

3263694592 RSLV: [2012/08/20 18:40:20.621] DEBUG: (3)Trying to connect. tries = 1

3263694592 RSLV: [2012/08/20 18:40:23.625] DEBUG: Connect to tcp:192.168.1.200:524 failed, 113 (0x71)

3263694592 RSLV: [2012/08/20 18:40:23.625] DEBUG: ------> TryConnection() err = 113

3263694592 RSLV: [2012/08/20 18:40:23.625] DEBUG: ------> DCNameToIDWithPack() err = 0

3263694592 RSLV: [2012/08/20 18:40:23.625] DEBUG: End---> DCResolveWithConstraint err = -626

 

 

First question that needs to be investigated by Novell engineering is why does the server get redirected to the other server when the main replica of root server holds complete copy of the tree? 
 
This TID will be updated once there is confirmation on if there is truly a need to go to the master as opposed to a server with r/w copy.  If it is determined that there should not be a need, then a bug will be entered to resolve with a code change.  In the meantime, there is a fairly simple workaround.

 

In this case, we didn’t want to change the master of o=test (I’ve changed the name here for obvious reasons) since this server was a dsfw server so we ended up making the child container a partition and making the master of root also master of this new partition.  Once this was done the server installed without any issue.  As mentioned in the resolution we could have also added an entry in the dsfw’s hosts file for this server a