storage node BLUE in the NMS and cannot ping nor ssh

  • KM02796535
  • 19-May-2017
  • 19-May-2017

Summary

LDR - Storage was in Error state becasue of a TOUT (TimeOut) on several volumes

Error

dc-s3:/var/local/log # grep TOUT bycast-err.log
Sep 10 01:18:29 dc-s3 ADE: |12849472 20207 000060 SVMR EVHR 2016-09-10T01:18:29.171954| ERROR    0947 SVMR: Health check on volume 0 has failed with reason 'TOUT'
Sep 10 01:19:33 dc-s3 ADE: |12849472 17585 000060 SVMR EVHR 2016-09-10T01:19:33.191810| ERROR    0947 SVMR: Health check on volume 4 has failed with reason 'TOUT'
Sep 10 01:20:37 dc-s3 ADE: |12849472 17585 000060 SVMR EVHR 2016-09-10T01:20:37.212291| ERROR    0947 SVMR: Health check on volume 3 has failed with reason 'TOUT'
Sep 10 01:21:41 dc-s3 ADE: |12849472 28400 000060 SVMR EVHR 2016-09-10T01:21:41.232761| ERROR    0947 SVMR: Health check on volume 2 has failed with reason 'TOUT'
dc-s3:/var/local/log #

Cause

LDR - Storage was in Error state becasue of a TOUT (TimeOut) on several volumes

LDR Health Check Timeout was set to default 40 seconds (this can be seen in the NMS - storage node - LDR - Storage - configuration - Health Check Timeout )

Fix

- servermanager was up for 400+ days
 - LDR Health Check Timeout was set to default 40 seconds (this can be seen in the NMS - storage node - LDR - Storage - configuration - Health Check Timeout )

 - restarted servermanager
 - all services came up ok
 - set the LDR Health Check Timeout to 300 seconds (in the NMS - storage node - LDR - Storage - configuration - Health Check Timeout)
 - dc-s3 is back online