6

We are running a KVM node which is crashing irregularly showing a very strange behaviour. The interesting thing is that we already had this problem with another node which crashed every 1-2 weeks. As we could not find a hardware issue, we began to migrate the VMs to a new node. About one week after we had migrated 50% of the VMs, the new node crashed while the "old" one is running fine since then (uptime 3 weeks, we have not seen such a great uptime for months).

When a node crashes, we sometimes see these strange things on the Supermicro IPMI:

enter image description here enter image description here

We also saw:

  • "No signal" like the server has been powered off (of course it was not, and it was also never shown as powered off on the IPMI main page)
  • The normal login screen or other normal output from the server, but freezed

What we never saw was a kernel panic or at least some messages in the logs before the crash, there is complete silence until suddenly the lights go out.

As the problem "moved" from one server to another (a brand-new machine), there are only a few options left in my opinion:

  • A specific VM is causing the issue
  • Kernel bug
  • Hardware issue regarding our setup

More information about the machines:

  • CentOS 7 with latest kernel (3.10.0-514.2.2.el7.x86_64)
  • Supermicro Case with redundant power supplies
  • Supermicro X10DRi / X10DRWi with latest BIOS version
  • Intel Xeon E5-2630 v3 / v4
  • 512 GB DDR4 ECC RAM (Samsung Server RAM)
  • 145 VMs running (RAM and CPU far away from being saturated, also thanks to KSM)
  • Software RAID-10 with 8 / 16 SSDs

Has anyone seen this behaviour or can say something about the strange "messages" on the console? I have never seen something like this and even do not know how I should describe this for a Google search. At the moment we have no very good idea what should be done next as it could be everything.

Thanks in advance!

  • 1
    It could be one particular VM causing the crash by triggering a kernel bug.... Kernel 3.10 is positively ancient (that's the problem with CentOS) and newer kernels could probably fare better. Anyway you probably should bisect the bug: by migrating half the VM to the other node, wait and see which one crash, and do it again until you can attribute the crash to one particular VM or group of VMs. – wazoox Dec 23 '16 at 16:08
  • 1
    @wazoox This is not the place to be spreading FUD. Anyway, about this issue I would suggest removing `rhgb quiet` and adding `consoleblank=0` to the kernel command line, because you're seeing artifacts of console blanking when you want to see something useful when it crashes. – Michael Hampton Dec 23 '16 at 17:34
  • @MichaelHampton what FUD are you talking about? Anyway, additionally you could set up a serial console to be able to review an eventual kernel panic message more thoroughly. As for console blanking this probably won't apply if you have gdm running -- you'd better turn off X11 to catch kernel messages. – wazoox Dec 23 '16 at 22:53

2 Answers2

2

This might be a CPU bug. Intel published an errata about this problem and they also provide a microcode update for the E5 v3/v4 CPUs (datecode 20170707). CentOS 7.4 already has a newer microcode version 0xb000021 (in CentOS 7.3 it was 0xb00001e). It may help to exchange the microcode or upgrade to 7.4. I also had a lot of trouble with this system freezes. I exchanged the mainboard (X10DRi), RAM, CPU and powersupply without success. I can't say for sure if this is the solution, because I do not have enough uptime since I updated the microcode. Supermicro still does not provide an updated BIOS with the current Intel microcode. You may get an unofficial prerelease from your distributor for the X10DRI.

Bernhard
  • 21
  • 2
0

A short update on this: After upgrading to the newest LTS kernel (4.4.39) the server is stable. Uptime 19 days now, so I think we got it. Although we do not really know the root cause, we think the CentOS 7 kernel (3.10) might be too old for some very modern hardware. As we can not deliver a helpful error message (like a kernel panic in the best case), we decided to not report this to the CentOS developers.

  • Were you using any form of virtualized MMU (ie: VT-d and the likes)? If so, it could be a misredicted guest memory write (by a software *or firmware* bug). – shodanshok Aug 01 '17 at 23:05