Linux system hangs or is unstable

Document ID:3301593
Creation Date:29-Oct-2007
Modified Date:16-Jul-2012
- Micro Focus Products:
  Open Enterprise Server
- SUSE Products:
  SUSE Linux Enterprise Desktop
  SUSE Linux Enterprise Server

Environment

SUSE Linux Enterprise Server 11
SUSE Linux Enterprise Server 10
SUSE Linux Enterprise Server 9
SUSE Linux Enterprise Server 8
Novell Open Enterprise Server 11 (OES 11) Linux
Novell Open Enterprise Server 2 (OES 2) Linux
Novell Open Enterprise Server 1 (OES 1) Linux

Situation

Symptoms
System hangs
System is unstable

Resolution

Follow a structured troubleshooting process covering the following areas discussed in detail below:

Problem characterization
Hardware layer
BIOS / firmware layer
Storage layer
Software layer

Additional Information

Introduction

Due to the large number of different potential causes, system hangs are among the most difficult problems to troubleshoot and a systematic approach is required for troubleshooting to be effective. This document describes such an approach, in general terms.

Problem characterization

First of all, establish a detailed characterization of the problem which answers at a minimum the following questions:

What is meant by a hang or instability? Is the system not providing a particular service (reliably) anymore, has the system as whole become completely inaccessible (both via network and via console), or is it still responsive to some forms of connection (e.g. SSH, VNC or ping) or commands?
For a hang, is it a single occurrence or has the hang occurred multiple times?
For a recurring hang, is there a pattern to the hangs? E.g. can the hang be triggered by a particular sequence of operations, or does it always occur around a particular time of day, after a particular period of system uptime, or when particular cron jobs are executed.

Hardware layer

System hangs or instabilities can be caused by hardware that is defective or improperly configured. Unfortunately, this happens more than most people realize, for two main reasons:

A ground rule with hardware is "Cheap, reliable, fast. Pick any two". Hardware that is cheap and reliable is not fast; hardware that is fast and cheap is not reliable; hardware that is reliable and fast is not cheap.
Proper hardware configurationis difficult. Most hardware has many settings which can be tweaked, but knowing when and what to tweak can be something of a black art.

Use diagnostics software

Fortunately, reputable hardware vendors offer diagnostics software that can and should be used to detect hardware problems. If hardware problems are incorrectly disregarded as a problem source, much time will be wasted on analysing the software level.

Aside from vendor hardware diagnostics software, for x86 and x86_64 systems there are very thorough diagnostic tools for the memory subsystem: Memtest86 and Memtest86+. These tools are often better at identifying memory subsystem issues than vendor hardware diagnostics software. A version of them is included on the boot CD of Novell's Linux products and these tools can also be obtained from the www.memtest86.org and www.memtest86.com web sites.

Consult vendor configuration guides

As for hardware configuration, some vendors (e.g. IBM) provide detailed configuration guides for Novell SUSE Linux products on specific hardware models on their support sites. When available, this type of guide should be followed, preferably from the initial installation onwards. Even when such a guide has not been followed during initial installation, it should be consulted later on to check the system configuration and bring it in line with the hardware vendor's recommendations.

Consult certification documentation

Additionally, for Novell YES CERTIFIED configurations, consult thecertification bulletin. Where applicable, the certification bulletins contain configuration details such as Linux kernel parameters.

Address power supply issues

In some regions or at some locations, power from the regular electrical grid may be too variable in voltage, frequency or current for hardware to operate reliably. In such locations, appropriate electrical hardware like surge protectors, voltage regulators, uninterruptible power supplies and/or generators should be used to provide reliable power for computer systems operation.

Isolate components

In some cases, stability issues and hangs are caused by specific extension cards. Remove all non-essential extension cards, test the system then put them back one by one, testing the system after every added card.

Best practice: "burn in" testing

In light of these considerations, it is considered best practice for hardware that is to be used for production services to undergo thorough "burn in" testing covering diagnostics and stress and load testing prior to being put into production use.

BIOS layer

On PC-based systems, the BIOS (Basic Input/Output System) is responsible for the initial setup of the system and devices up to the point where a boot loader can be started to boot the system. On other architectures, the term "BIOS" is not used, but equivalent embedded software exists, e.g. "Open Firmware" or "Extensible Firmware Interface".

The BIOS and its equivalents on non-PC architectures may also be involved in power management, hardware monitoring and hotplugging of extension cards.

A BIOS, like any other software, may contain general programming defects (bugs) and may not always be following or supporting relevant standards such as ACPI fully. Vendors regularly release updated versions of BIOSes to correct such defects. Given the central role of the BIOS, it is important to track such version updates and to ensure the most recent non-development version of the BIOS is installed.

Most reputable vendors provide a search interface on their support sites that make it easy to find the current BIOS revision for a particular hardware model as well as update instructions.

Other Firmware

With modern hardware many components, for instance NICs, HBAs and storage controllers, include embedded software or firmware of their own. This firmware should be brought up to date as well.

Storage layer

Ensure that your storage is consistent by performing filesystem checks (and recovery) on all storage areas, including the root filesystem. To check the root filesystem, use the rescue environment from the service pack or installation CDs or DVDs.

Software layer

Check for corrupted data

Even when the filesystems check out cleanly, data contained in them may be corrupted, including code and data vital to proper operation of the operating system. The package management system stores checksums of data under its control. Run

rpm -Vva

to verify the contents of your file system against those checksums.
Check the output of this command for signs of changes in files that are not configuration files, like binaries and libraries.

Keep the software installation up to date

Novell actively maintains released products for long periods of time. This maintenance includes fixes for software defects in particular as well as the addition of drivers for newer hardware models. Use the tools supplied by Novell, in particular the SPident tool, the Novell Customer Center and the online update facilities of your product to check whether your software installation is up to date and to bring it up to date if it isn't.

Check recent updates

Unfortunately, updated packages can occasionally introduce new defects. You can use the package management system of your Novell SUSE Linux product to determine what updates have been installed recently, e.g. through

rpm -qa --last

This may help isolating what software update introduced a defect. When an updated package breaks a previously functioning system, please inform Novell Technical Services through a service request or a bug report.

Support from Novell Technical Services

Basic information

When opening a service request with Novell Technical Services for a server hang or instability issue, the following information may be vital to an efficient resolution process:

A detailed characterization of the problem (as discussed above)
A description of changes made to the system and its configuration during troubleshooting prior to the openening of a service request.
A configuration report for the affected system, created using the tool from TID 10100285 - Config Report For Linux. This tool should be run with the "-v" argument to include additional package management information. Attach this report to your service request as soon as your service request has been opened.

Crash dumps

During the handling of your service request, you may be asked to provide a system crash dump for analysis, which may require substantial setup (e.g. of a serial console and/or second server to receive dumps). You can prepare for this by consulting the relevant TIDs for details:

TID 3374462 - Configure kernel core dump capturefor SLE10 products.
TID 3044267 - HOWTO: Configure lkcd to capture a kernel core dump for SLES9 and OES/Linux.

Change Log

2012 Jul 16 - jrecord - Added SLES11 to the environment

+ Upgrading to Open Enterprise Server 2 SP1 Linux

+ Open Enterprise Server 2 SP1 Migration Strategies