My server randomly hangs and becomes unresponsive without any logging (dmesg, syslog, kern.log, boot.log, and messages). I cannot predict when it is going to happen. Sometimes the server runs fine for months and suddenly it starts to happen again. In the last week it happened more than 8 times. This situation has been happening for more than one year.
The kernel log is always the same:
Jan 24 03:20:34 voyager dnsmasq-dhcp[4476]: DHCPREQUEST(br100) 192.168.145.3 fa:16:3e:4e:e0:d5
Jan 24 03:20:34 voyager dnsmasq-dhcp[4476]: DHCPACK(br100) 192.168.145.3 fa:16:3e:4e:e0:d5 viaapp
Jan 24 03:20:37 voyager dnsmasq-dhcp[4476]: DHCPREQUEST(br100) 192.168.145.9 fa:16:3e:62:09:86
Jan 24 03:20:37 voyager dnsmasq-dhcp[4476]: DHCPACK(br100) 192.168.145.9 fa:16:3e:62:09:86 web-sistemas
Jan 24 03:20:38 voyager dnsmasq-dhcp[4476]: DHCPREQUEST(br100) 192.168.145.16 fa:16:3e:79:dd:f8
Jan 24 03Jan 24 03:22:47 voyager kernel: imklog 5.8.6, log source = /proc/kmsg started.
Jan 24 03:22:47 voyager rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="2040" x-info="http://www.rsyslog.com"] start
Jan 24 03:22:47 voyager rsyslogd: rsyslogd's groupid changed to 103
Jan 24 03:22:47 voyager rsyslogd: rsyslogd's userid changed to 101
Jan 24 03:22:47 voyager rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Jan 24 03:22:47 voyager kernel: [ 0.000000] Initializing cgroup subsys cpuset
Jan 24 03:22:47 voyager kernel: [ 0.000000] Initializing cgroup subsys cpu
Jan 24 03:22:47 voyager kernel: [ 0.000000] Linux version 3.2.0-60-generic (buildd@toyol) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #91-Ubuntu SMP Wed Feb 19 03:54:44 UTC 2014 (Ubuntu 3.2.0-60.91-generic 3.2.55)
Jan 24 03:22:47 voyager kernel: [ 0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.2.0-60-generic root=UUID=c8dba39e-4d36-4528-9432-d610fce72407 ro crashkernel=384M-2G:64M,2G-:128M console=tty1 console=ttyS0,115200n8
The server has Intel S5500BC motherboard, Xeon E5630 Intel CPU, 32GB RAM, and 4x Seagate Barracuda 2TB 7200 RPMST2000DM001. I'm using Ubuntu 12.04.2 LTS with kernel 3.2.0-60-generic, and the harddisk are part of a soft RAID 10 using md. I'm also running some virtual machines using kvm and libvirt.
In the beginning I thought it was related to I/O usage. I stressed the CPU, I/O, HDD I/O, and memory allocation using many tools, including dd, stress, and some scripts I developed in bash/python. I've never been able to replicate the problem.
All hard disks pass in short and long self-tests of smartctl. There is not any error message.
I've also installed linux-crashdump but it also cannot log anything. I ran a script every two seconds to collect the sensors output and the temperature seemed ok - below 55 celsius degrees.
I've already replaced the motherboard, RAM, and hard-disks, but the problem remains. Thus, I guess it is not hardware related and for any reason the OS cannot write the logs. I also tested the RAM using memtest and it passed successfully four cycles.
The only thing that I noticed is that, when I'm running a stress test with stress
, I got the following log: [28189.472043] INFO: task kvm:5058 blocked for more than 120 seconds.
I've enabled IPMI and it responds when the server hangs. I used it to collected sensors and also the event log. As it always has log records about the power unit, I already replaced the power supply three times. The IPMI saves me a lot of downtime, since I use it to reboot the server. The server is connected to a no-break which has 3 more servers connected to it. None of the other servers has any problem.
bc3 | 01/22/2015 | 22:47:41 | Power Unit Pwr Unit Status | Power off/down | Asserted
bc4 | 01/22/2015 | 22:47:41 | Power Unit Pwr Unit Status | Failure detected | Asserted
bc5 | 01/22/2015 | 22:47:46 | Power Unit Pwr Unit Status | Power off/down | Deasserted
bc6 | 01/22/2015 | 22:47:46 | Power Unit Pwr Unit Status | Failure detected | Deasserted
bc7 | 01/22/2015 | 22:47:49 | Fan System Fan 3 | Lower Non-critical going low | Deasserted | Reading 0
bc8 | 01/22/2015 | 22:47:49 | Fan System Fan 3 | Lower Critical going low | Deasserted | Reading 0
bc9 | 01/22/2015 | 22:47:56 | Fan System Fan 3 | Lower Non-critical going low | Asserted | Reading 0 < Threshold 374 RPM
bca | 01/22/2015 | 22:47:56 | Fan System Fan 3 | Lower Critical going low | Asserted | Reading 0 < Threshold 330 RPM
bcb | 01/22/2015 | 22:48:01 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
bcc | 01/22/2015 | 22:48:02 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
bcd | 01/22/2015 | 22:48:43 | System Event BIOS Evt Sensor | OEM System boot event | Asserted
bce | 01/22/2015 | 22:48:51 | Critical Interrupt PCIe Cor Sensor | | Asserted
Sometimes the server reboots instead of hang. But most of the time it hangs and I myself have to reboot it.
Ah, one more info, the server sometimes hangs during the boot, before even loading the GRUB.
Do you have any suggestion of what is happening or what can I do to go further in this problem?