My Hadoop-based cluster is unhealthy, how can I fix it?

  • KM03712733
  • 25-Sep-2020
  • 15-Oct-2020

Summary

This KB article addresses a scenario in which your Hadoop-based cluster is unhealthy - services are not running smoothly, actions are not executed and etc. The list provided in this article can guide you how to possibly find the culprit of an unhealthy cluster.

Question

My cluster is unhealthy, how can I fix it?

Answer

The list provided in this article can guide you how to possibly find the culprit of an unhealthy cluster. Linux commands updated to CentOS/RHEL 7.x accompanies this KB for your convenience:

  • Time - Is the time using the date command is synched across all of your cluster machines? your cluster is configured more likely with Network Time Protocol Deamon executing sudo service ntpd status can provide better understanding if the time on the machine is not as expected.
  • DNS - Using DNS server is the recommended configuration (there's also unrecommended alternative of relying on /etc/hosts file which is discussed on next). In this case nslookup and reverse nslookup should work flawlessly between nodes. If it doesn't this can be a major issue that needs to be solved. You may want to check communication from your cluster machines to the cluster's DNS server. Consult your IT administrator in case the issue persists.
  • Alternative to DNS -  In some deployments the alternative of relying on /etc/hosts file is used. If you not sure what is used in your deployment check /etc/nsswitch.conf and verify that the configured line is hosts:      files dns myhostname - this means first check for any entries at /etc/hosts and then DNS server. In this case nslookup and reverse nslookup shouldn't work between nodes. However, Pinging between FQDNs of cluster nodes should work. If this not the case you may want to check your /etc/hosts file configuration. 
    • /etc/hosts should be in the following format:

IP address     hostname - hostname should be lowercase only

  • Network - Unhealthy network or network settings can impact your cluster significantly.
    • Your cluster nodes may be configured with DHCP or fixed IP. If your interface has an IPv6 address too then it can also impact your cluster health and you should disable it.
    • Is there's any latency Pinging between the nodes? Are some of the Pings being missed?
    • Are ports might be blocked due to firewall rules? Is the machine you are trying to accses listen on the desired port
      • netcat can be a useful command to detect open port on a remote machine 
        •  sudo nc -vz <remote_machine_fqdn> <remote_machine_port>
      • nmap and telnet can be useful too to detect open port on a remote machine 
      • netstat can be a useful command to check whether the local machine is listening on the desired port sudo netstat -tunlp | grep <local_port>
      • ss is an alternative to netstat 
  • Security - Firewall and SElinux are two security components running on cluster nodes. 
    • sudo systemctl status firewalld should indicate firewall is stopped and disabled. That said, it can be enabled provided that there are  rules allowing Interset UEBA traffic.
    • SElinux getenforce should show Permissive or Disabled (note that during installation it will be temporary disabled).
  • OS - The Linux OS that is running on your cluster nodes can impact the health of your cluster as well.
    • Are all nodes on the same CentOS/RHEL release? Is it compatible with the release stated in the release notes? Check by executing cat etc/centos-release OR cat etc/redhat-release.
    • Are there any unnecessary RPMs running on your cluster nodes? Our recommendation is to run Interset UEBA over a minimal OS installation.
    • Ulimit settings - ulimit -a should show 65536 open files or higher  
  • Resources - Whether you are using VMs or baremetal machines you may want to look at the allocated resources as it may not be suffcient for the proper execution of your cluster's machines. The commands to use are

df -h to check disk space, free -gh to check RAM and top -i to check CPU utilization. 

In the case of VM's it can be that the host it is running on is low on resources due to an activity that is outside of your cluster. You may want to check your virtual machines management software (e.g. vCenter, Hyper-V) to see what other VMs are deployed on the same host your VMs are residing on. 

  • Miscellaneous - When your cluster used to work flawlessly and now doesn't you want to find out what has changed. Changes on the cluster's machine OS (new SW installed? new HW installed?), hardware on the machines and outside of it (servers, router/switches), power outage that occurred recently, rewiring of the cluster are just to name a few of changes that may take place on your cluster's environment and may impact the health of your cluster.