Summary
Error
Sep 10 01:18:29 dc-s3 ADE: |12849472 20207 000060 SVMR EVHR 2016-09-10T01:18:29.171954| ERROR 0947 SVMR: Health check on volume 0 has failed with reason 'TOUT'
Sep 10 01:19:33 dc-s3 ADE: |12849472 17585 000060 SVMR EVHR 2016-09-10T01:19:33.191810| ERROR 0947 SVMR: Health check on volume 4 has failed with reason 'TOUT'
Sep 10 01:20:37 dc-s3 ADE: |12849472 17585 000060 SVMR EVHR 2016-09-10T01:20:37.212291| ERROR 0947 SVMR: Health check on volume 3 has failed with reason 'TOUT'
Sep 10 01:21:41 dc-s3 ADE: |12849472 28400 000060 SVMR EVHR 2016-09-10T01:21:41.232761| ERROR 0947 SVMR: Health check on volume 2 has failed with reason 'TOUT'
dc-s3:/var/local/log #
Cause
LDR - Storage was in Error state becasue of a TOUT (TimeOut) on several volumes
LDR Health Check Timeout was set to default 40 seconds (this can be seen in the NMS - storage node - LDR - Storage - configuration - Health Check Timeout )
Fix
- servermanager was up for 400+ days
- LDR Health Check Timeout was set to default 40 seconds (this can be seen in the NMS - storage node - LDR - Storage - configuration - Health Check Timeout )
- restarted servermanager
- all services came up ok
- set the LDR Health Check Timeout to 300 seconds (in the NMS - storage node - LDR - Storage - configuration - Health Check Timeout)
- dc-s3 is back online