1

My Ubuntu Server version 11.10 went down in middle of the night a few days ago without any reason. Now I want to know what's the problem.

Here are some part of syslog which I can't understand a single word. Can anyone help me point out the problem?

Server down was between 23:17:01 and 07:41:43 until we restarted it's hardware.

Jul 15 22:55:02 my-webserver CRON[4879]: (CRON) info (No MTA installed, discarding output)
Jul 15 23:00:01 my-webserver CRON[5576]: (munin) CMD (/usr/bin/munin-cron)
Jul 15 23:00:01 my-webserver CRON[5578]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul 15 23:00:01 my-webserver CRON[5577]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul 15 23:00:02 my-webserver CRON[5575]: (CRON) error (grandchild #5576 failed with exit status 1)
Jul 15 23:00:02 my-webserver CRON[5575]: (CRON) info (No MTA installed, discarding output)
Jul 15 23:05:01 my-webserver CRON[6229]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul 15 23:05:01 my-webserver CRON[6230]: (munin) CMD (/usr/bin/munin-cron)
Jul 15 23:05:01 my-webserver CRON[6231]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul 15 23:05:01 my-webserver CRON[6226]: (CRON) error (grandchild #6229 failed with exit status 1)
Jul 15 23:05:01 my-webserver CRON[6226]: (CRON) info (No MTA installed, discarding output)
Jul 15 23:09:01 my-webserver CRON[6838]: (root) CMD (  [ -x /usr/lib/php5/maxlifetime ] && [ -d /var/lib/php5 ] && find /var/lib/php5/ -depth -mindepth 1 -maxdepth 1 -type f -cmin +$(/usr/lib/php5/maxlifetime) ! -execdir fuser -s {} 2>/dev/null \; -delete)
Jul 15 23:10:01 my-webserver CRON[8404]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul 15 23:10:01 my-webserver CRON[8405]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul 15 23:10:01 my-webserver CRON[8407]: (munin) CMD (/usr/bin/munin-cron)
Jul 15 23:10:01 my-webserver CRON[8401]: (CRON) error (grandchild #8404 failed with exit status 1)
Jul 15 23:10:01 my-webserver CRON[8401]: (CRON) info (No MTA installed, discarding output)
Jul 15 23:15:01 my-webserver CRON[9036]: (munin) CMD (if [ -x /usr/bin/munin-cron ]; then /usr/bin/munin-cron; fi)
Jul 15 23:15:01 my-webserver CRON[9035]: (munin) CMD (/usr/bin/munin-cron)
Jul 15 23:15:01 my-webserver CRON[9041]: (root) CMD (if [ -x /etc/munin/plugins/apt_all ]; then /etc/munin/plugins/apt_all update 7200 12 >/dev/null; elif [ -x /etc/munin/plugins/apt ]; then /etc/munin/plugins/apt update 7200 12 >/dev/null; fi)
Jul 15 23:15:01 my-webserver CRON[9034]: (CRON) error (grandchild #9035 failed with exit status 1)
Jul 15 23:15:01 my-webserver CRON[9034]: (CRON) info (No MTA installed, discarding output)
Jul 15 23:17:01 my-webserver CRON[9544]: (root) CMD (   cd / && run-parts --report /etc/cron.hourly)
Jul 16 07:41:43 my-webserver kernel: imklog 5.8.1, log source = /proc/kmsg started.
Jul 16 07:41:43 my-webserver rsyslogd: [origin software="rsyslogd" swVersion="5.8.1" x-pid="783" x-info="http://www.rsyslog.com"] start
Jul 16 07:41:43 my-webserver rsyslogd: rsyslogd's groupid changed to 103
Jul 16 07:41:43 my-webserver rsyslogd: rsyslogd's userid changed to 101
Jul 16 07:41:43 my-webserver rsyslogd-2039: Could no open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Jul 16 07:41:43 my-webserver kernel: [    0.000000] Initializing cgroup subsys cpuset
Jul 16 07:41:43 my-webserver kernel: [    0.000000] Initializing cgroup subsys cpu
Jul 16 07:41:43 my-webserver kernel: [    0.000000] Linux version 3.0.0-12-server (buildd@crested) (gcc version 4.6.1 (Ubuntu/Linaro 4.6.1-9ubuntu3) ) #20-Ubuntu SMP Fri Oct 7 16:36:30 UTC 2011 (Ubuntu 3.0.0-12.20-server 3.0.4)
Farid Rn
  • 195
  • 3
  • 13
  • 4
    No offense here, but if you "can't understand a single word" of your server's syslog, perhaps you ought to hire someone that does understand it. – EEAA Jul 19 '12 at 15:07
  • 2
    The cleaning crew needed to power their vacuum cleaner at 23:17? We actually can't rule that out, given the logs. – cjc Jul 19 '12 at 15:27
  • Basically, whatever it was, it wasn't logged. It could be a kernel-level error that caused the system to lock up, it could have been a power outage, it could have been someone pushing the big red button. The errors reported by cron are interesting, but since you apparently didn't have a mailserver set up to get the error messages, it's hard to say if they're related at all. Your best bet at figuring it out is to see what happened at 7:41:43 to make it come back online. Was it off? Was it on but locked up? If so, was there anything printed on the console? – DerfK Jul 19 '12 at 16:17
  • Where does the server reside? Hosting Provider? Inhouse? – HTTP500 Jul 19 '12 at 16:22
  • @DerfK I wasn't aware of server down until 7:41:43 which one of employees called and said site is not loading, so I called someone in our server room to restart server and everything went back to normal. I'm not 100% sure but I think no one pushed that button and of course there are about 100 servers in our server room and only this one has problems that night. – Farid Rn Jul 19 '12 at 16:42
  • @HTTP500 I'm software developer in this network and because of using PHP and mysql on this server, it's the only server that is running Linux and our other severs are all Windows-based. And because we don't have Linux experts, I'm in charge of this server performance. – Farid Rn Jul 19 '12 at 16:44
  • @faridv, You didn't really answer my question. But your other comment "our server room" seems to suggest inhouse. If it was Hosting Provider I was going to suggest that you follow-up with them. – HTTP500 Jul 19 '12 at 16:50
  • You have 100 servers in your server room and no alerting/monitoring?! – HTTP500 Jul 19 '12 at 16:51
  • @HTTP500 It's a tv channel server-room. I don't know if they are using any alerting/monitoring tool but I'm sure that this server is not on the list and now my manager wants a report about what happened that night and I have nothing to say! :( – Farid Rn Jul 19 '12 at 16:56
  • Take a look at some of the other logs: `/var/log/kern.log` should have kernel level messages, `/var/log/dmesg` should have messages from this bootup that might include whether the drives had to be checked due to unclean shutdown. Honestly, though, the evidence of whatever it was probably went away with the powercycle, especially if nobody remembers/noticed whether it was powered down before restarting it. – DerfK Jul 19 '12 at 17:14

3 Answers3

7

No. But I can tell you what you should do now.

  1. Set up monitoring. Get Nagios or Zabbix or something similar. If you only have one server, install it there but be aware it won't be able to alert you if the whole server goes down, only if certain services go down.
  2. Set up more monitoring. Get an external third-party service like Pingdom or HostTracker. These sorts of services often have free or very cheap options if that is a problem.
  3. Set up remote access. Something like a KVM or a serial console.
  4. Set up performance monitoring. This is covered by software like Zabbix (again), Munin or Cacti. (Technically Nagios can do this but I don't like it for this functionality.) What you get out of this is graphs showing what your server was doing and what it was running out of just before it stopped responding.

At the very least, with the monitoring and alerting in place, your downtime will be reduced to minutes rather than hours. With the remote access and graphing you might just get enough data to figure out what happened.

Ladadadada
  • 25,847
  • 7
  • 57
  • 90
  • I think this is the most useful answer yet, don't forget about observium – Lucas Kauffman Jul 19 '12 at 17:17
  • +1 for KVM or serial console - when the system fails, it will NOT try to write to the logs on the hard drive since it risks corrupting files. It may however print error messages to the console, which you won't understand either, but at least it gives you something you can google for. – Grant Jul 19 '12 at 17:40
2

I see two possibilities:

  1. Your location suffered a power failure around 23:17 and power was restored around 07:41.

  2. Someone who is at your company overnight decided to unplug the computer.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
  • Nice ideas but I'm looking for a third possibility! – Farid Rn Jul 19 '12 at 16:46
  • 1
    A third possibility is someone needed to borrow the power outlet for a server test... With the information given it's almost impossible to pinpoint the exact cause. – Red Tux Jul 19 '12 at 19:23
1

There's nothing in that log to indicate why it rebooted. At Jul 15 23:17:01 it was running, at Jul 16 07:41:43 it was restarted.

You'll need to look into resource utilization logs, application logs, network logs etc etc.

Coops
  • 5,967
  • 1
  • 31
  • 52
  • I'm really confused, my manager wants to know what's the problem and I really can't point that out. I looked through mysql, apache, dmesg and there's nothing there. I just know that my server went down at 23:17:01 and I want to know why! – Farid Rn Jul 19 '12 at 15:49