HAGC made up of gn1s-a-1 and gn1-a-1 showing failover in the NMS

  • KM02796536
  • 19-May-2017
  • 19-May-2017

Summary

gn1s-a-1 - active primary at the time of failover gn1-a-1 - supplementary primary at the time of failover

Error

ha.log
------
May 14 15:54:18 gn1s-a-1 heartbeat: [11063]: info: killing /usr/lib/heartbeat/crmd process group 11091 with signal 15
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: crm_shutdown: Requesting shutdown
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_shutdown_req: Sending shutdown request to DC: gn1s-a-1
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_shutdown_req: Processing shutdown locally
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: handle_shutdown_request: Creating shutdown request for gn1s-a-1

Cause

- while there could be multiple reasons for a failover to take place , in this particular case the reason for the failover was a system shutdown of gateway node gn1s-a-1 due to overheating.


bycast.log
----------
May 14 15:53:15 gn1s-a-1 hpasmxld[3917]: WARNING: System Overheating (Zone 4, Location Ambient, Temperature 41C)
May 14 15:53:15 gn1s-a-1 hpasmxld[3917]: A System Reboot has been requested by the management processor in 60 seconds.
:
May 14 15:54:15 gn1s-a-1 hpasmxld[3917]: A System Reboot has been initiated by the management processor.
May 14 15:54:15 gn1s-a-1 shutdown[7741]: shutting down for system reboot


Then the cluster starts shutting down.


ha.log
------
May 14 15:54:18 gn1s-a-1 heartbeat: [11063]: info: killing /usr/lib/heartbeat/crmd process group 11091 with signal 15
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: crm_shutdown: Requesting shutdown
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE [ input=I_SHUTDOWN cause=C_SHUTDOWN origin=crm_shutdown ]
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_state_transition: All 2 cluster nodes are eligible to run resources.
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_shutdown_req: Sending shutdown request to DC: gn1s-a-1
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: do_shutdown_req: Processing shutdown locally
May 14 15:54:18 gn1s-a-1 crmd: [11091]: info: handle_shutdown_request: Creating shutdown request for gn1s-a-1

 

-- customer talked to the data center team and they did have a cooling outage on Sunday.

Fix

 - there was no need for any workaround or fix , the fsg services failed over from gn1s-a-1 to gn1-a-1

- gn1-a-1 is now the active primary and gn1s-a-1 is the supplementary primary

-  reset the failover count from the NMS