19

In a new Xeon 55XX server with 4xSSD at raid 10 with Debian 6, I have experienced 2 random shut downs within two weeks after the server being built. Looking at bandwidth logs before shut down does not indicate anything unusual. The server load is usually very low (about 1) and it is collocated far away.There seem to be no power outage while the server went down.

I know that I look at /var/log but not sure which logs should I investigate and what should I look for. So Appreciate your hints.

alfish
  • 3,027
  • 15
  • 45
  • 68

7 Answers7

11

First, I must ask: "shutdowns"? Do you mean that the machine reboots or does it actually halt? If it halts, it is either misconfigured (perhaps in BIOS) or something is actively shutting down the machine (i.e. init 0).

If not, your primary candidate would be /var/log/syslog and /var/log/kern.log as your problem sounds like a kernel panic or a software-triggered hardware-fault. Of course, if the server runs some service (e.g. apache) may give you a clue too.

Often, in situations like this, there are log entries generated, but because the machine is having difficulties, it won't manage to write the entries to disk. If the box is colocated, chances are that it is connected to a serial console by the colo partner. That is where I would look if I did not find anything suspicious in the above logs.

If the machine is not connected to a serial console and there is nothing in the log, you may want to consider sending syslog to a different box via network. Perhaps the network interface survives a bit longer, and the log messages can be read on the syslog server. Have a look at rsyslog or syslog-ng.

UPDATE:

I agree with @Johann below. Most likely cause of halt is processor temperature watchdog. Try checking/plotting temperature in box via lmsensors or smartctl (usually the easiest). I find that collectd is unparalleled at keeping track of large number of variables over time. It can do both IPMI and lm-sensors and hddtemp. Also, some BIOS:es log temperature halt events.

Bittrance
  • 2,970
  • 2
  • 21
  • 27
  • The machine went off, and returned to life just after I asked the support to manually start it. – alfish May 08 '12 at 09:37
  • If temperature is the issue, install munin to track temperature-data over time to spot trends. – pkhamre May 08 '12 at 10:11
  • +1 to temperature issues. Had the same thing on one of my servers in a datacenter - turns out they forgot to connect one of the CPU fans when they built the system. – Grant May 08 '12 at 13:25
10

First, you want to check /var/log/syslog. If you are not sure what to look for, you can start by looking for the words error, panic and warning.

grep -i error /var/log/syslog

If you have system graphs available (e.g. Munin). Check them and look for abnormal patterns. If you do not have munin installed, it might be an idea to install it (apt-get install munin munin-node)

You should also check root-mail for any interesting messages that might be related to your system crash.

Other logfiles you should check is application error-logs. E.g /var/log/apache2/error.log or similiar. They might contain information leading you to the problem.

pkhamre
  • 5,900
  • 3
  • 15
  • 27
6

In my experience, an "unexpected halt" is almost always caused by overheating. Check your temperatures and fan speeds via lm_sensors and make sure that they are good.

Recently we had the same pattern: A server halted about one hour after the support manually started it. After this hours the CPU temperature hit the configured threshold in the BIOS (iirc 60 or 70°C) and halted the system. All these troubles where caused by an broken CPU fan. After replacing the fan everything returned to normal.

ercpe
  • 566
  • 3
  • 15
2

There are a number of logs files in /var/log directory (and it's subdirectories), including

/var/log/boot

and

/var/log/boot.log

Start with the files above.

asdmin
  • 2,020
  • 16
  • 28
Naveen
  • 21
  • 1
  • 1
    And look for "what"? – Pierre.Vriens Jun 15 '16 at 06:36
  • That depends on the type of the failure occurred. Most of the cases, the root cause is a kernel crash, a power failure or overheat induced CPU shutdown, which means there's nobody to write an entry to the log files and flush it onto the disk, so there will be no messages there at all. – asdmin Jun 16 '16 at 06:39
2

You can find if system know about fact that it was going down with next commands

sudo last -1x reboot
sudo last -1x shutdown

If no info => then it could be lose of power or something else external

if you have info => search in logs around reboot/shutdown time

  • 1
    Not sure why this is down-voted, but imo this is the best advise to find out if the system was properly shutdown or not. – psv Aug 11 '20 at 13:24
1

There are 2 ways of checking what triggered shutdown, first check the Out-Of-Band Management console for any issue in the hardware, i would suggest to configure SNMP and receive emails or add the traps in a monitoring software for any alert.

Then through the Operating System, you can either check /var/log/messages(RedHat based distros) or /var/log/syslog(Debian Based distros).

etcshad0vv
  • 61
  • 6
0

The disk subsystem is complicated enough to be affected when a problem occurs, because of you'll hardly get anything in your log files.

Try logging over the serial console. This needs some cabling, and an other system to pick up the lines, but you have better chance actually catching the problem.

Of course if your node has a built-in management system similar to Oracle's ALOM/ILOM, you can also check for possible problems and log files there.

asdmin
  • 2,020
  • 16
  • 28