2

My server randomly hangs and becomes unresponsive without any logging (dmesg, syslog, kern.log, boot.log, and messages). I cannot predict when it is going to happen. Sometimes the server runs fine for months and suddenly it starts to happen again. In the last week it happened more than 8 times. This situation has been happening for more than one year.

The kernel log is always the same:

Jan 24 03:20:34 voyager dnsmasq-dhcp[4476]: DHCPREQUEST(br100) 192.168.145.3 fa:16:3e:4e:e0:d5
Jan 24 03:20:34 voyager dnsmasq-dhcp[4476]: DHCPACK(br100) 192.168.145.3 fa:16:3e:4e:e0:d5 viaapp
Jan 24 03:20:37 voyager dnsmasq-dhcp[4476]: DHCPREQUEST(br100) 192.168.145.9 fa:16:3e:62:09:86
Jan 24 03:20:37 voyager dnsmasq-dhcp[4476]: DHCPACK(br100) 192.168.145.9 fa:16:3e:62:09:86 web-sistemas
Jan 24 03:20:38 voyager dnsmasq-dhcp[4476]: DHCPREQUEST(br100) 192.168.145.16 fa:16:3e:79:dd:f8
Jan 24 03Jan 24 03:22:47 voyager kernel: imklog 5.8.6, log source = /proc/kmsg started.
Jan 24 03:22:47 voyager rsyslogd: [origin software="rsyslogd" swVersion="5.8.6" x-pid="2040" x-info="http://www.rsyslog.com"] start
Jan 24 03:22:47 voyager rsyslogd: rsyslogd's groupid changed to 103
Jan 24 03:22:47 voyager rsyslogd: rsyslogd's userid changed to 101
Jan 24 03:22:47 voyager rsyslogd-2039: Could not open output pipe '/dev/xconsole' [try http://www.rsyslog.com/e/2039 ]
Jan 24 03:22:47 voyager kernel: [    0.000000] Initializing cgroup subsys cpuset
Jan 24 03:22:47 voyager kernel: [    0.000000] Initializing cgroup subsys cpu
Jan 24 03:22:47 voyager kernel: [    0.000000] Linux version 3.2.0-60-generic (buildd@toyol) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #91-Ubuntu SMP Wed Feb 19 03:54:44 UTC 2014 (Ubuntu 3.2.0-60.91-generic 3.2.55)
Jan 24 03:22:47 voyager kernel: [    0.000000] Command line: BOOT_IMAGE=/vmlinuz-3.2.0-60-generic root=UUID=c8dba39e-4d36-4528-9432-d610fce72407 ro crashkernel=384M-2G:64M,2G-:128M console=tty1 console=ttyS0,115200n8

The server has Intel S5500BC motherboard, Xeon E5630 Intel CPU, 32GB RAM, and 4x Seagate Barracuda 2TB 7200 RPMST2000DM001. I'm using Ubuntu 12.04.2 LTS with kernel 3.2.0-60-generic, and the harddisk are part of a soft RAID 10 using md. I'm also running some virtual machines using kvm and libvirt.

In the beginning I thought it was related to I/O usage. I stressed the CPU, I/O, HDD I/O, and memory allocation using many tools, including dd, stress, and some scripts I developed in bash/python. I've never been able to replicate the problem.

All hard disks pass in short and long self-tests of smartctl. There is not any error message.

I've also installed linux-crashdump but it also cannot log anything. I ran a script every two seconds to collect the sensors output and the temperature seemed ok - below 55 celsius degrees.

I've already replaced the motherboard, RAM, and hard-disks, but the problem remains. Thus, I guess it is not hardware related and for any reason the OS cannot write the logs. I also tested the RAM using memtest and it passed successfully four cycles.

The only thing that I noticed is that, when I'm running a stress test with stress, I got the following log: [28189.472043] INFO: task kvm:5058 blocked for more than 120 seconds.

I've enabled IPMI and it responds when the server hangs. I used it to collected sensors and also the event log. As it always has log records about the power unit, I already replaced the power supply three times. The IPMI saves me a lot of downtime, since I use it to reboot the server. The server is connected to a no-break which has 3 more servers connected to it. None of the other servers has any problem.

 bc3 | 01/22/2015 | 22:47:41 | Power Unit Pwr Unit Status | Power off/down | Asserted
 bc4 | 01/22/2015 | 22:47:41 | Power Unit Pwr Unit Status | Failure detected | Asserted
 bc5 | 01/22/2015 | 22:47:46 | Power Unit Pwr Unit Status | Power off/down | Deasserted
 bc6 | 01/22/2015 | 22:47:46 | Power Unit Pwr Unit Status | Failure detected | Deasserted
 bc7 | 01/22/2015 | 22:47:49 | Fan System Fan 3 | Lower Non-critical going low  |     Deasserted | Reading 0
 bc8 | 01/22/2015 | 22:47:49 | Fan System Fan 3 | Lower Critical going low  | Deasserted | Reading 0
 bc9 | 01/22/2015 | 22:47:56 | Fan System Fan 3 | Lower Non-critical going low  | Asserted | Reading 0 < Threshold 374 RPM
 bca | 01/22/2015 | 22:47:56 | Fan System Fan 3 | Lower Critical going low  | Asserted | Reading 0 < Threshold 330 RPM
 bcb | 01/22/2015 | 22:48:01 | System Event BIOS Evt Sensor | Timestamp Clock Sync |   Asserted
 bcc | 01/22/2015 | 22:48:02 | System Event BIOS Evt Sensor | Timestamp Clock Sync | Asserted
 bcd | 01/22/2015 | 22:48:43 | System Event BIOS Evt Sensor | OEM System boot event | Asserted
 bce | 01/22/2015 | 22:48:51 | Critical Interrupt PCIe Cor Sensor |  | Asserted

Sometimes the server reboots instead of hang. But most of the time it hangs and I myself have to reboot it.

Ah, one more info, the server sometimes hangs during the boot, before even loading the GRUB.

Do you have any suggestion of what is happening or what can I do to go further in this problem?

msbrogli
  • 273
  • 1
  • 3
  • 8
  • 1
    In your place first I would run a memtest, and in case of negative result I would buy a new mainboard. – peterh Jan 24 '15 at 15:12
  • I had lots of problems with KVM on 10.04. You might consider upgrading. – Christopher Perrin Jan 24 '15 at 15:13
  • @PeterHorvath I already did it. I forgot to mention but I already included in the description. Thanks for your suggestion. – msbrogli Jan 24 '15 at 15:25
  • @PeterHorvath I already replaced the mainboard. Only the CPU is the same. – msbrogli Jan 24 '15 at 15:26
  • @ChristopherPerrin I'm sorry. I mistyped the Ubuntu release in the title. Actually, it is 12.04.4 LTS. It's very strange because one of my other servers has very similiar setup but with 10.04.4 LTS and it work just fine. Thanks! – msbrogli Jan 24 '15 at 15:27
  • I just found [this article about C-States](http://www.ingmarverheij.com/damn-you-c-states-unexpected-xenserver-reboot/). I disabled C3 and C6. Hope it will solve the problem. – msbrogli Jan 24 '15 at 20:20
  • All Ubuntu versions are pretty bad with KVM, guess there's no QA done. This, however, looks like a hardware issue. – dyasny Jan 24 '15 at 20:31
  • I had a [somewhat similar issue](http://serverfault.com/q/290627) a while ago. Replaced pretty much all hardware, tried a different OS. I'm still not 100% sure what the actual issue was, but I still suspect it was a faulty SATA cable. If I had to go through that again, I'd probably start with replacing the cables as they are cheapest. Good luck! – ssc Jan 26 '15 at 19:33
  • @ssc Excelente idea! I'll do it tomorrow and I report here the results. Hope it will solve the problem! Thanks! :) – msbrogli Jan 27 '15 at 04:08

1 Answers1

0

Just to give a feedback about this issue. I also changed the SATA cables and the problem persists. After running the memtest for more than 24 hours, it started to increase the error counter.

Now I'm trying to figure out which memory module is bad.

--

The bad memory module was replaced and now let's see whether the problem is solved. I hope so, but I'm not so confident since I already replaced the memory modules.

--

The server suddenly restarted yesterday afternoon. There were no power outages and no other device was restarted. We are still trying to figure out where the problem is.

msbrogli
  • 273
  • 1
  • 3
  • 8