Slow performance copying to HP MSA1500 san poison pill split brain abend cluster

  • 3830617
  • 03-Dec-2007
  • 27-Apr-2012

Environment

Novell NetWare 6.5 Support Pack 6
HP MSA 1500 SAN
QLogic Host Bus Adapters

Situation

Customer had a 6 node cluster ( all Nw servers NW6.5 SP6 )
Connected to 2 SANs HP EVA 3000 and HP MSA 1500
There were problems on the HP MSA 1500. On this SAN there were 9 LUNs. 4 are for NW and 5 for Windows and Linux. Problem happened on all servers.
.The specific error condition on the MSA san is that the"Average Command Latency" goes to 200 ms. Normally it runs between 5-30ms. When the latency is that high, the NetWare server performing large writes to the MSA1500 has the "concurrent disk requests" counter go over 1000 in monitor.nlm. If this continues for some time, cluster nodes have split-brains due to an inability to write the Heartbeat to the sbd partition in a timely manner. End-users notice when the concurrent-disk-requests gets very high, as the server starts seeming laggy, even though data being read is on a different much faster LUN.
This problem would happen on large file copies to the HPMSA 1500 san. Large reads from the san worked fine. The HP EVA 3000 san did not have any problems at all. This problem would cause hi performance loss.

Resolution

1) The problems were discovered after the move to production. Testing did not show this problem, but on looking at the testing methodology, they never tested a case that would have triggered it. The problem was discovered after we moved a volume on the cluster to this device, and then added a bunch of storage to the MSA san. Adding the storage required the creation of several Drive Arrays and LUNs, and also required the expansion of an existing Drive Array. It was the expansion of the Drive Array that triggered the condition that allowed this problem to occur.

2) This SAN is in production. Production processes have caused the problems I've mentioned before. As it happens, we're about ready to do another drive-array expansion, and are trying to build a methodology to work around the problem for the 2 weeks it'll take to expand the array.