5

Here is the deal,

come to work only to find out one server isn't responding at all, the machine is turned on, but the screen doesn't show anything att all, doesn't respond to keyboard inputs (I don't have sys rq keys enabled).

The server needs to be up and running as fast as possiblo, so I dod a hard reset of the server and it's all working fine now.

Now my boss want's to know what happened and why.

So how do I start debugging what went wrong before the reboot? Which logs should I pay special attention to, and are there any neat tricks that you might now on how to debug a random server freeze (it doesn't happen often - this is the first time that I've seen it)

Thanks for any usefull guidelines and suggestions.

Ps: I'm running ubuntu server 12.04.

zidarsk8
  • 384
  • 1
  • 3
  • 12

3 Answers3

7

Since it's probably a hardware fault, I'd look at some hardware diagnostics.

If you have a hardware RAID controller, I'd find out if you can read its log (if 3Ware, use tw_cli). And, whether you have hardware or software RAID, you can look at the SMART parameters of the disks (if the disks are connected to a RAID controller, you may need special commands to access them. See the smartctl manpage).

If you do:

smartctl -a /dev/sdX

I always primarily look at:

  • Reallocated sector count. Is especially bad when its increasing over time. And, I don't fully trust a disk that has any reallocated sectors.
  • Look at the SMART error log. It's tricky to read at first, but the primary thing is to see if there are events, and at what time (expressed in disk age in hours) they occurred. You can see the current disk age as one of the SMART parameters. If it's recent, it may be related.

Also, keep an eye on dmesg and syslog to see if you have get errors over time. For example, disk errors often show up long before it's a fatal problem as ata exceptions. We have a central logging server (using rsyslog) that notifies me about ata exceptions. A quick example on how to set that up:

/etc/rsyslog.d/60-smtp.conf:

$ModLoad ommail
$ActionMailSMTPServer localhost
$ActionMailFrom noreply@example.com

/etc/rsyslog.d/70-mail-ata-errors:

$ActionMailTo you@yexample.com
$template mailSubjectATA,"ATA error on %hostname%"
$template mailBodyATA,"You have ATA errors. Mostly it's the disk and you get these errors before a possible mdraid setup kicks the drive.\r\nBEWARE: ata1.00 is first ata, first disk. Ata1.01 is first ata, second disk. Use the ata-to-device-names.sh script to identify devices.\r\n msg='%msg%'"
$ActionMailSubject mailSubjectATA
$ActionExecOnlyOnceEveryInterval 3600
:msg, regex, "ata.*exception" :ommail:;mailBodyATA

See here for the ata-to-devicenames script.

Another thing you can do is a memtest. Ubuntu installation DVDs/CDs have those in the boot menu, and I believe any Ubuntu server has one in its regular boot menu as well. Let is make one pass at least, more if possible.

Do you have ECC RAM BTW? ECC RAM is important for long term stability and data integrity.

Halfgaar
  • 7,921
  • 5
  • 42
  • 81
  • Thank you for this. I don't have ECC and hard drives seem to be fine. syslog has some kernel messages and stack traces that I'll look into right now, but so far it all looks like something ipv6 related, which is funny `ICMPv6: RA: ndisc_router_discovery failed to add default route` – zidarsk8 Jul 09 '14 at 18:36
  • If your mainboard supports ECC, I would install it. As for those ipv6 errors; those are not fatal. – Halfgaar Jul 09 '14 at 20:03
4

/var/log/syslog is a good place to start. Find the first log messages after the reboot. They will say something about syslog starting and what kernel version you are running.

Then scroll up and find the last line, which was logged before the system crashed. Scroll up further to see if you can find any log messages from the kernel itself.

Go through other logs in /var/log to see if you can find any lines with a time stamp between the last log line from before the crash and the first from after.

It is highly probably that all of this effort can only narrow down the time of the crash, but not tell you anything about why the server crashed. In particular if it is a hardware fault, it can be difficult to get proper log messages.

There may be configuration changes, which can be made to help get more information, in case the problem happens again. Enabling the Sys Rq key is one option. Also it may be worthwhile to turn off screen blanking (I assume you avoid wasting power by not having the monitor turned on, while you are not using it). Moreover, logging across the network to another server may help, in particular if the root cause is disk/file system related.

kasperd
  • 29,894
  • 16
  • 72
  • 122
2

Part of me wants to say that Linux shouldn't just crash... Modern operating systems under normal usage patterns should be fairly stable. When I do start seeing server instability, it's almost always a hardware or driver interaction. I would recommend looking very closely at the server, it's conditions and related components (RAM, storage, etc.).

If you're using hardware that doesn't or can't provide insights into the hardware health (like a desktop-class machine), there's little chance you'll see much of anything reflected in the Linux-level logging.

ewwhite
  • 194,921
  • 91
  • 434
  • 799