Removing a Crashed Server from the NDS Tree

  • 3338221
  • 01-Mar-2007
  • 27-Apr-2012

Environment

Novell NetWare 4.10
Novell NetWare 4.11
Novell intraNetWare 4.11
Novell NetWare 4.2
Novell NetWare 5.0
Novell NetWare 5.1
Novell NetWare 6
Novell NetWare 6.5
Novell Directory Services
Novell intraNetWare for Small Business
Novell NetWare for Small Business
Formerly TID 2908056

Situation

Removing a Crashed Server from the NDS Tree

Resolution

Recovering From a System Crash

When a server's hardware has crashed, failed or a server has been taken out of a Directory Services tree without properly removing DS from that server, several steps need to be taken to ensure that the remaining servers can synchronize correctly and that if necessary, the server can be replaced and re-inserted into the Directory Services tree.

WARNING: Deleting a server object for a failed server will cause loss of server references for that server unless proper steps have been taken. If a server fails and this server will be replaced, follow TID #10013535, Crashed Server - Saving server references. The DSMAINT -PSE procedure will retain links to home directories, directory map objects, and NDS aware printing that will be otherwise lost if the server object is just deleted (this is for a 4.x server, NW5 currently does not have this functionality).

1. Verify that time is synchronized on the network
If time is not synchronized, changes cannot properly be made to the directory service tree. See TID #10011516 - Timesync Definitions Reference for time synchronization help.

Load DSREPAIR | Time Synchronization

This will report whether time is synchronized across all available servers. If time is not synchronized, determine why. Questions to ask: is there a Single Time or a Reference Time server available and working properly? Are you using configured sources and if so, is the source server up and running?

2. Verify a Master replica for each ring contained on the crashed server
If a server goes down permanently or is replaced without removing DS, the replicas it contained will have incorrect replica ring information. Each server in each of the replica rings will still think the server should be contacted with updates whenever they occur. Also, if the server that has been removed or has had a hardware crash contained a master replica of any partition, another server with a read-write replica must be selected to become the new master replica.

If a non-troubled server contained all the master replicas, this process is rather simple. If it is not known what replicas the suspect server had, this process can be quite complicated, as each replica in the tree would have to be queried to determine if the suspect server were part of the replica ring.

Verify that a Master replica exists for each Partition:
Load DSREPAIR | Advanced Options | Replica and Partition Operations | (select each replica one at a time ) | View Replica Ring

In the replica ring for each replica that was contained on the suspect server, verify that a master replica exists on a good server. If not, escape back one screen to Replica Options, Partition (this partition). Choose the option to "Designate this Server as the New Master Replica". (If this server doesn't contain a read-write replica of said partition, or if you have another server you wish to be the master, do this step on that server)

WARNING: DO NOT designate a Subordinate Reference replica as the new Master replica unless no R/W or Read Only replica exists of that partition. Doing so will cause all of your partition objects to go unknown and you will have to recreate them manually.

Once it is verified that a master replica exists for each partition:

3. Clean up the Tree. (server objects)
When a master replica is present for each partition, run SYS:\PUBLIC\WIN32\\NDSMGR32.EXE | View | Set Context (Set context to [Root] or to a context sufficient to see the problem server) | View | Partition and Server View | Click on the problem server (the servers are listed in the lower-left part of the screen) An error will appear (usually a -321 error). Click on OK. At this point the server will be highlighted. Click on Object }| Delete. This will remove the server from the tables on each server in the tree, containing server names and IPX addresses, as well as remote NDS. It will also remove the server from the replica rings, then come back to delete the server object. NOTE: When removing a NW 5.x server from the tree, all objects in the tree associated with that server (LDAP, some backup programs, etc) will need to be deleted (with NWAdmin/ConsoleOne) and then reinstalled when and if the server is put back into the tree.

When you are on NetWare 6 you must use Console One Instead of NDS Manager. Also clean up the replica rings by loading dsrepair -a | advanced options | replica and partition operations | select the replica it contained | view replica | select crashed server | remove server from replica ring. Do this for every partition the crashed server had a replica of.


NOTE: You may need to bring the server DOWN before you can delete the server object. Also, don't worry if the server object will not delete. When you re-install DS back onto the server it will prompt you to replace the existing NCP server object. Make sure you install the server into the SAME CONTEXT that it existed in before.

Wait some time for the server object to be deleted by the system, before executing the following step:
Check whether the server object is deleted. (Servers known to the database)
Run Check External References ( DSREPAIR | advanced options )on the master of the partition the server was deleted from. If there are still obituaries for the deleted server, wait until they are finished processing.
Only after enough time (30min - 1Hour) you should:

4. Verify that each replica ring is consistent and valid:
Load DSREPAIR -a | Advanced Options | Replica and Partition Operations | (select each replica one at a time ) | View Replica Ring. If the suspect server exists in the replica ring, select it and press enter, a screen will appear entitled Replica Options: Server . Select the option entitled "Remove this Server from the Replica Ring". This will remove the suspect/crashed server from the replica ring for this partition. This information will synchronize out to the other servers in the replica ring.

This step needs to be completed for each replica that the suspect/crashed server contained.

5. Clean up the Tree (volume objects)
When the server object is deleted, the volume objects corresponding to it either be removed also or will go unknown; this is noted by a yellow ? beside them in NetWare Administrator (NWADMIN) or (unknown) beside the name in DOS. Using NETADMIN.EXE or NWADMIN, delete the Volume Objects corresponding the suspect/crashed server. These need to be removed before the server is re-installed so the volumes can be put in the tree correctly. Delete any object relating to the crashed server (Licensing objects, LDAP objects, backup software references, etc.) Wait for the obituaries for the volume objects, etc. to purge before re-installing the server into the tree.

6. Summary and Miscellaneous.
At this point all references to the suspect server should be gone. There should not be any place in the tree where you can find the server or its volumes. It is important here to make sure that all servers are synchronizing and communicating properly. Make sure the server that went down wasn't a router between two segments, etc.. SET DSTRACE=ON, SET DSTRACE=+S, SET DSTRACE=*H, make sure that everything completes correctly. You should see "SYNC: End sync of partition All processed = YES." for each partition in the tree. See KB 2909026 for more information on DSTRACE.

Also Load DSREPAIR | Report Synchronization Status. Make sure this reports all synched within the last half hour for DS versions 6.x and one hour for DS versions 7.x and 8.x and there are no errors.

7. Reinstall the server into the tree using Load NWCONFIG | Directory Options | Install Directory Services onto this server.

8. After the server is back into the tree, Load DSREPAIR | Advanced Options | Check Volume objects and trustees to clear out any invalid trustee assignments.

9. Realize that any print queues referencing this server's volume objects will need to be recreated. User's home directory assignments may also need to be reassigned (the User object | environment tab | Home Directory field). And Trustee assignments (user's file rights) will have to be restored from your SMS-compliant backup solution. If you don't have an SMS backup, you will need to reassign file rights manually (use containers and groups to assign rights, it will go much faster). The reason for this is that all of these assignments were pointing to a specific volume object ID. When the volume object was deleted and recreated (through the INSTALL process), the ID number changed. Any objects referencing the old ID number will not function properly.

10. If this server was a NW5x server, you will also have to re-install licensing and certificate server. If this server was the organizational CA for the tree, you will need to re-install certificate server on all other NW5x servers in the tree.

.

Additional Information

When you are on NetWare 6 you must use Console One Instead of NDS Manager

Formerly known as TID# 10010922