How to troubleshoot and collect diagnostic information from an abended, unresponsive or crashed server

  • 3193476
  • 21-Feb-2008
  • 27-Apr-2012

Environment

Novell NetWare
Novell Open Enterprise Server (NetWare based)

Situation

Troubleshooting
  • Abend
  • Hang
  • Crash
  • Freeze / Frozen
  • Dead
  • Outage
  • Unresponsive
  • Broken

Resolution

Objectives: What are you trying to achieve?
  • You are trying to establish what happened in the run-up to the failure
  • You are trying to gather as much information as you can at failure-time before the server is rebooted and the evidence is destroyed; treat a crashed server like a crime scene
  • Do not lose sight of these objectives by becoming overly focussed on this procedure; e.g. do not delay a reboot for three hours while you drive to site just to see if the hard disk lights are flashing
Before you start
  • Don't panic – you'll miss important information
  • Be thorough, systematic and methodical
What is the problem?
  • Describe what you see in plain language; e.g. "the server is displaying the Monitor screen, it is static, nothing is updating, the hard disk lights are not flashing, the Num Lock key doesn't light/unlight when pressed, can't swap round screens"
Clearly define events as you understand them
  • Do not use vague language; e.g. "we have a server outage"
  • Do explain ambiguous terms; e.g. a "hang" means different things to different people – explain what it means to you (See Reflex Actions, below)
  • Clearly state what is fact, what is an assumption and what you don't know; if you're not sure about something, say so
  • In the absence of, and in addition to, facts get anecdotal evidence but be sure to say that it is so; e.g. "Users think the server has been getting slower for weeks"
  • Establish Scope and Scale; e.g. "it looks like only some users are affected, not all of them" and "this only affects users accessing files on Server ABCD999"
  • Talk to the help desk and other support teams to see if there have there been any other issues that might be related; e.g."NetWare backups are slow but so are Windows backups"
  • Include anything that may be relevant; e.g "The building got struck by lightning last night"
Reflex actions: Things you should always do if a server "hangs" (or is off the air, slow, etc)
  • PING the server from various places (same segment, different segment, etc); do multiple PINGs as some may get through some may not
  • At the console, do any screens continue to update? Is there any activity? e.g. Monitor, logger, timesync, ZENworks, etc
  • At the console, do the Num and Caps lock light/unlight?
  • At the console, can you swap screens? i.e.<ALT><ESC>
  • At the console, can you bring up the screen menu? i.e.<CTRL><ESC>
  • At the console, can you bring up the screen title? i.e.<ALT>
  • Is there a console prompt? Can you type anything in at it? i.e. Servername: HELP
  • If the server appears totally dead check whether hard drive or network adapter lights are flashing
  • Note: sometimes there is a delay between pressing a key and something happening
  • If you can swap screens check what LOAD MONITOR -> Kernel -> Busiest Threads says; does a particular thread stay at the top of the list? Do the same threads cycle at the top? Make a note of them
  • What is the Server Utilisation?
  • Are there any errors on the console, logger, or any other screens?
If the server has lost the prompt but is otherwise OK
  • When convenient, try and DOWN it using the Emergency Console; i.e. <CTRL><ALT><ESC> or via NoRM; i.e. Manage Server -> Down/Restart
  • If these don't work try and DOWN the server from the debugger (<LSHIFT><RSHIFT><ALT><ESC>) -> Q
Files to collect after a problem
  • SYS$LOG.ERR
  • CONSOLE.LOG
  • LOGGER.TXT
  • ABEND.LOG
  • CONFIG Report
  • Do not edit any log files as the history can be useful; they are text so should ZIP up over 90%
  • Where possible get screenshots from a workstation by remote controlling (RConJ, RDB, RConsole, etc) in
If you suspect memory problems
  • LOAD SEG will show how the server is using its memory
  • Screenshots (see above) to get from SEG: LOAD SEG<Take Screenshot> <F10><Take Screenshot> <F7><Take Screenshot>

  • SEG needs to be loaded when the server was having problems in order to to collect statistics in SEGx.CSV and show how the server was; i.e. have SEG already loaded on problem servers
  • Pressing slash '/' brings up a menu so you can write SEGSTATS from the Info option
  • SEG.NLM is available from CoolSolutions (https://www.novell.com/coolsolutions/tools/14445.html)
Taking a Coredump
  • There is little or no value in analysing anything other than the first abend because abend <2> and later could merely be symptoms of the initial Abend rather than problems in their own right
  • Take the coredump when the abend occurs; do not allow the server to keep running (abend recovery) and then take the coredump manually (via <LSHIFT><RSHIFT><ALT><ESC>) at a later date
  • This procedure assumes that the server has been primed to dump to the local hard disk, to a dump volume or across the network
  • A coredump can be taken when the server has abended by answering YES to Copy Diagnostic Image at the prompt or at any time (e.g. during High Utilisation) from the debugger by typing .c (dot c)
  • Follow the prompts to take the dump: YES to take dump without cache and NO to compress the dump (as this takes longer and can be unreliable)
  • See KB 10099907 - General Information About Taking A Coredump (https://www.novell.com/support/search.do?searchString=10099907)