Environment
Situation
A cluster resource is initializing on a cluster node. One of the steps in the cluster resource start script starts NetWare FTP (NWFTPD.NLM). However, the NetWare FTP service immediately unloads, and on the Logger Screen a message is given about finding an "unload this instance = yes" in the FTP configuration file.
Resolution
Additional Information
1. Offline the cluster resource which is associated with this instance of NWFTPD.
2. Edit the cluster resource STOP script. Move the applicable NWFTPD -U command to the beginning of the script.
3. Immediately after the stop script's NWFTPD -U command, insert a small delay. In Novell testing, a 3 second delay was sufficient, i.e.:
DELAY 3
However, in some cases it is possible that a longer delay may be needed. An extra buffer of time may be desired. Experimentation may be necessary to determine the lowest safe delay for a particular environment.
4. View the cluster START script. Make note of the exact syntax of the NWFTPD -C command (it will be used in later steps).
7. At the currently-active node, insure that the right NWFTPD instance is running by manually executing the NWFTPD -C command with the exact syntax which was identified in step #4. If it was not already running, it should launch successfully. If it was already running, then the logger screen should give a message indicating that NWFTPD could not bind, and was possibly already loaded. This is fine.
8. Now test migrating the resource, and/or putting the resource offline / online. It should be working smoothly now.
"NWFTPD -U configuration_file" unloads a particular instance of FTP in a roundabout way. A new instance of NWFTPD is loaded. The only task this instance performs is to edit the specified configuration file, and set UNLOAD_THIS_INSTANCE=YES. Once that quick task (modifying the file) is done, the cluster stop script can move on to the next command. However, the fact that the file has been modified does not mean that the subsequent effect of that modification has been completed.
The main instance of NWFTPD.NLM which is truly using that configuration file (and handling FTP client requests) will see that change; set the setting back to NO; and shut itself down. Herein lies the potential complication. In a cluster-shutdown situation, a 'race condition' develops. In this "race," the cluster script is executing to deactivate the resource, but the active NWFTPD instance is executing it's code to change the file back to normal and unload. If the cluster script finishes before the NWFTPD reaction finishes, then the YES setting will still be present in the file. When the resource attempts to come up on the next node, NWFTPD will load there, see the YES setting, and react by unloading itself.
One solution is to make sure that NWFTPD has time to set "NO" and unload itself before the cluster resource finishes being deactivated. The delay suggested above should accomplish that.
HOWEVER, even if a sufficient delay is added, there could be some subtle factors which would prevent that simple change from correcting the situation right away, or in every possible scenario. So the updated code in the "Resolution" section above is the preferred solution. Examples of the subtle exceptions:
a. Changes to the cluster scripts don't take effect right away. It is not enough to simply migrate the resource. The resource must be fully taken offline and online again, for the scripts to be reread.
b. Even after adding an adequate delay to the script and off-lining the resource, there could still exist a UNLOAD_THIS_INSTANCE = YES setting left over in the configuration file, from the final offline of the resource (which will have executed the script as it previously existed -- without the delay). So even after making this correction, the problem will often occur at least one more time.