Clustered NetWare FTP service complains of "unload this instance" setting, then unloads

  • 7003911
  • 17-Jul-2009
  • 26-Apr-2012

Environment

Novell NetWare 6.5 Support Pack 6
Novell NetWare 6.5 Support Pack 7
Novell NetWare 6.5 Support Pack 8

Situation

A cluster resource is initializing on a cluster node. One of the steps in the cluster resource start script starts NetWare FTP (NWFTPD.NLM). However, the NetWare FTP service immediately unloads, and on the Logger Screen a message is given about finding an "unload this instance = yes" in the FTP configuration file.

Resolution

The problem exists because of the way NWFTPD unloads itself during the cluster resource STOP script.  Timing issues can cause the NWFTPD unload procedure to be incomplete when the resource goes offline.  The result is that NWFTPD unloads itself after coming up on the new node.
 
An adjustment has been made to the way NWFTPD unloads itself, to compensate for this.  The new NWFTPD.NLM is available in NWFTPD15.ZIP, at:
 
If for some reason that link does not work, go to https://download.novell.com/patch/finder/.  Select "NetWare" as the product, then do a keyword search on the download name, NWFTPD15.ZIP.
 
Follow the readme instructions carefully in order to fully correct the issue.
 
See the "Additional Information" section below for a configurational workaround, if that is preferred.

Additional Information

Configurational Workaround (does not require the code update mentioned above.)
 
This can typically be resolved with some easy adjustments to the Cluster Resource *STOP* script. However, the exact order of operations to properly put this into effect can be tricky. Here's the steps and order to use:

1. Offline the cluster resource which is associated with this instance of NWFTPD.

2. Edit the cluster resource STOP script. Move the applicable NWFTPD -U command to the beginning of the script.

3. Immediately after the stop script's NWFTPD -U command, insert a small delay. In Novell testing, a 3 second delay was sufficient, i.e.:

DELAY 3

However, in some cases it is possible that a longer delay may be needed.   An extra buffer of time may be desired.  Experimentation may be necessary to determine the lowest safe delay for a particular environment.

4. View the cluster START script. Make note of the exact syntax of the NWFTPD -C command (it will be used in later steps).

5. Bring the cluster resource online. Another failure is typically expected at this point. Continue with these steps.
 
6.  Edit the configuration file on the shared resource.  This would be the configuration file specified by the NWFTPD -C command noted in step #4.  Scroll to the bottom of the file and make sure UNLOAD_THIS_INSTANCE=NO.  Save / exit.

7. At the currently-active node, insure that the right NWFTPD instance is running by manually executing the NWFTPD -C command with the exact syntax which was identified in step #4. If it was not already running, it should launch successfully. If it was already running, then the logger screen should give a message indicating that NWFTPD could not bind, and was possibly already loaded. This is fine.

8. Now test migrating the resource, and/or putting the resource offline / online. It should be working smoothly now.

 
Below is a detailed explanation, which can help in troubleshooting, or can help to understand the reasons behind the steps above:

"NWFTPD -U configuration_file" unloads a particular instance of FTP in a roundabout way. A new instance of NWFTPD is loaded. The only task this instance performs is to edit the specified configuration file, and set UNLOAD_THIS_INSTANCE=YES. Once that quick task (modifying the file) is done, the cluster stop script can move on to the next command. However, the fact that the file has been modified does not mean that the subsequent effect of that modification has been completed.

The main instance of NWFTPD.NLM which is truly using that configuration file (and handling FTP client requests) will see that change; set the setting back to NO; and shut itself down. Herein lies the potential complication. In a cluster-shutdown situation, a 'race condition' develops.  In this "race," the cluster script is executing to deactivate the resource, but the active NWFTPD instance is executing it's code to change the file back to normal and unload. If the cluster script finishes before the NWFTPD reaction finishes, then the YES setting will still be present in the file.  When the resource attempts to come up on the next node, NWFTPD will load there, see the YES setting, and react by unloading itself.

One solution is to make sure that NWFTPD has time to set "NO" and unload itself before the cluster resource finishes being deactivated. The delay suggested above should accomplish that.

HOWEVER, even if a sufficient delay is added, there could be some subtle factors which would prevent that simple change from correcting the situation right away, or in every possible scenario.  So the updated code in the "Resolution" section above is the preferred solution.  Examples of the subtle exceptions:

a.  Changes to the cluster scripts don't take effect right away. It is not enough to simply migrate the resource. The resource must be fully taken offline and online again, for the scripts to be reread.

b.  Even after adding an adequate delay to the script and off-lining the resource, there could still exist a UNLOAD_THIS_INSTANCE = YES setting left over in the configuration file, from the final offline of the resource (which will have executed the script as it previously existed -- without the delay). So even after making this correction, the problem will often occur at least one more time.

c.  Even when it is thought that this problem has been solved, imagine that a situation occurs where the cluster resource is active on a node, but (for whatever reason) NWFTPD has already been unloaded; the configuration file has UNLOAD_THIS_INSTANCE = NO; and the proper delay is in an effective stop script. You'd think that migrating the resource at this point would be successful, right? Unfortunately, this is wrong. Upon migrating the resource, the stop script executes NWFTPD -U, setting UNLOAD_THIS_INSTANCE = YES once again. But since there is no NWFTPD.NLM active to react to this and set it back to "NO", the "YES" once again gets preserved until NWFTPD attempts to load on the next node. Then the problem occurs again.
 
So even if things are"perfectly" configured, if the expected instance of NWFTPD is not already running correctly when someone offlines or migrates a resource, FTP is primed to fail upon the next start.