Environment
Novell Cluster Services 1.8.4
Novell Open Enterprise Server 2 (OES 2) Linux
Novell Open Enterprise Server 2 (OES 2) Linux
Situation
The defaults for multipathing (MPIO) are to queue I/O if one or more of the paths are lost from the HBA to the SAN. Queueing data to the disk is not desirable in clustering, we want the resource to failover to a cluster node where the paths are good.
Another scenario where you see this same issue is when you disable the ports to the SAN when testing a SAN failure. You will not failover resources until the MPIO times out.
Another scenario where you see this same issue is when you disable the ports to the SAN when testing a SAN failure. You will not failover resources until the MPIO times out.
Resolution
The HBAs must be set to "failed" mode if the customer wants NCS to auto failover storage resources if a disk path goes down.
In the /etc/multipath.conf file set the following under the device:
In the /etc/multipath.conf file set the following under the device:
no_path_retry "0"
or
no_path_retry fail
Additional Information
Here is an example of a working configuration for with MPIO, QLogic and an EMC CX300 SAN.
The /etc/modprobe.conf.local has the following:
Kernel version 2.6.16.60-0.42.5 or later, there has been made a change in the qla2xxx driver. The /etc/modprobe.conf.local:
The /etc/multipath.conf file has the following:
One other bit of information on the QLogic HBAs BIOS:
The defaults for the Port Down Retry and Link Down Retry values are 45 seconds, so it will take about 50 seconds once a fault occurs before I/O will resume on the remaining HBAs. Change these in the HBA BIOS to 5.
This will reduce the delay from 50 seconds to about 10 seconds. During this time it appears that all I/O is blocked and will not fail until this Retry time has expired. The Watchdog timeout does not start until this Retry time has expired.
The /etc/modprobe.conf.local has the following:
options qla2xxx qlport_down_retry=1
Kernel version 2.6.16.60-0.42.5 or later, there has been made a change in the qla2xxx driver. The /etc/modprobe.conf.local:
options qla2xxx qlport_down_retry=2After making changes to /etc/modprobe.conf.local don't forget to run a mkinitrd.
The /etc/multipath.conf file has the following:
defaults
{
polling_interval 1
# no_path_retry 0
user_friendly_names yes
features 0
}
devices {
device {
vendor "DGC"
product ".*"
product_blacklist "LUNZ"
path_grouping_policy "group_by_prio"
path_checker "emc_clariion"
features "0"
hardware_handler "1 emc"
prio "emc"
failback "immediate"
no_path_retry "0" #Set MP for failed I/O mode, any other non-zero values sets the HBAs for queued I/O mode
}
}
One other bit of information on the QLogic HBAs BIOS:
The defaults for the Port Down Retry and Link Down Retry values are 45 seconds, so it will take about 50 seconds once a fault occurs before I/O will resume on the remaining HBAs. Change these in the HBA BIOS to 5.
Port Down Retry=5
Link Down Retry=5
This will reduce the delay from 50 seconds to about 10 seconds. During this time it appears that all I/O is blocked and will not fail until this Retry time has expired. The Watchdog timeout does not start until this Retry time has expired.