How to fix "BUG: soft lockup - CPU#0 stuck for 17163091968s"?

Question

UPDATE: I updated the title of the message, because I've recently seen more of these problems with this exact time amount of 17163091968s. This should help people investigating the symptoms to find this page. See my (self-)accepted answer below.

I have a bunch of 64-bit Ubuntu 10.04 LTS VM's in a VMware vSphere datacenter. VMware tools is installed (vSphere Client says "OK").

I've seen some of the VM's hang a few times with the following error in syslog. When checking the situation from vSphere, the console was black, and the "Reboot guest" command didn't do anything, so I had to power cycle the VM.

Dec  1 11:44:15 s0 kernel: [18446744060.007150] BUG: soft lockup - CPU#0 stuck for 17163091988s! [jed:26674]
Dec  1 11:44:15 s0 kernel: [18446744060.026854] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs xt_tcpudp iptable_filter ip_tables x_tables acpiphp fbcon tileblit font bitblit softcursor ppdev vga16fb psmouse parport_pc shpchp vgastate i2c_piix4 lp parport serio_raw intel_agp floppy mptspi mptscsih vmw_pvscsi e1000 mptbase
Dec  1 11:44:15 s0 kernel: [18446744060.026899] CPU 0:
Dec  1 11:44:15 s0 kernel: [18446744060.026900] Modules linked in: btrfs zlib_deflate crc32c libcrc32c ufs qnx4 hfsplus hfs minix ntfs vfat msdos fat jfs xfs exportfs reiserfs xt_tcpudp iptable_filter ip_tables x_tables acpiphp fbcon tileblit font bitblit softcursor ppdev vga16fb psmouse parport_pc shpchp vgastate i2c_piix4 lp parport serio_raw intel_agp floppy mptspi mptscsih vmw_pvscsi e1000 mptbase
Dec  1 11:44:15 s0 kernel: [18446744060.026920] Pid: 26674, comm: jed Not tainted 2.6.32-30-server #59-Ubuntu VMware Virtual Platform
Dec  1 11:44:15 s0 kernel: [18446744060.026922] RIP: 0033:[<00007f92e03d2ce6>]  [<00007f92e03d2ce6>] 0x7f92e03d2ce6
Dec  1 11:44:15 s0 kernel: [18446744060.026930] RSP: 002b:00007fff6069b770  EFLAGS: 00000202
Dec  1 11:44:15 s0 kernel: [18446744060.026932] RAX: 00007f92e27e7e10 RBX: 00007f92e06d5e40 RCX: 0000000000020000
Dec  1 11:44:15 s0 kernel: [18446744060.026933] RDX: 00007f92e27e7e10 RSI: 0000000000020209 RDI: 0000000000000002
Dec  1 11:44:15 s0 kernel: [18446744060.026934] RBP: ffffffff81013cae R08: 0000000000000001 R09: 0000000000000000
Dec  1 11:44:15 s0 kernel: [18446744060.026935] R10: 00007f92e06d6398 R11: 0000000000000870 R12: 00000000000000c0
Dec  1 11:44:15 s0 kernel: [18446744060.026937] R13: 00007f92e299dca0 R14: 0000000000000020 R15: 00007f92e06d5e40
Dec  1 11:44:15 s0 kernel: [18446744060.026939] FS:  00007f92e105b700(0000) GS:ffff880009c00000(0000) knlGS:0000000000000000
Dec  1 11:44:15 s0 kernel: [18446744060.026940] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
Dec  1 11:44:15 s0 kernel: [18446744060.026941] CR2: 00007ff12ea15000 CR3: 0000000267067000 CR4: 00000000000006f0
Dec  1 11:44:15 s0 kernel: [18446744060.026968] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Dec  1 11:44:15 s0 kernel: [18446744060.026989] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Dec  1 11:44:15 s0 kernel: [18446744060.026991] Call Trace:

(There is no trace - that's the last line.)

I don't seem to have the other errors anymore, but I'm quite sure the process mentioned above (jed) was different in the other dumps.

What could cause this problem?
How to prevent this from happening?

Some extra info:

The value 17163091988 is a bit (pun intended) suspicious - it's 1111111111000000000000000000010100 in binary. Maybe the error was trying to say 20 seconds (10100 in binary)?
I'm not sure whether the problem persists with the latest 10.04 kernel (2.6.32-35).
I've also seen task ... blocked for more than 120 seconds problems - maybe they could be related?
the vSphere client shows no alerts or migration tasks for the VM.

maybe something wrong timekeeping? You can play with `clocksource`. Also CPUs' C-states is a good guess. — SaveTheRbtz, Dec 02 '11 at 00:26

score 12 · Accepted Answer · answered Dec 09 '11 at 17:40

Thanks to all commenters. I think I found the answer. There seems to be a timekeeping bug in at least Ubuntu's kernel version 2.6.32-30-server. The bug sometimes (?) kills machines when they reach an uptime of about 200..210 days. Actually the halt does not happen immediately after the limit is reached, but is triggered by some operation (in my case: apt-get install ...).

NB: 200 days is about 2^32 times 1/250 second, and 250 is the default value for CONFIG_HZ.

For now, I haven't found data on whether the problem has been fixed in more recent kernels. I do know that it doesn't seem to affect an older kernel (2.6.32-26-server). From all this information I presume that if it's not fixed yet, it can be avoided by:

boot the machines every 190 days (a good idea for kernel upgrades anyway)
adjust CONFIG_HZ to 100 and thus make it every 497 days. However, this might have quite unexpected side effects, especially in virtual environments. And it doesn't solve the problem.

Here's a bug report for Ubuntu.

Out of curiosity: Are you using NTP or time synchronization via vmware? It sounds like constant time shift or something like that.. there should be logged entries of time shifting in syslog. — pauska, Dec 09 '11 at 17:49
I just seen something that looks related in debian, 2.6.32-5-amd64 kernel showing two soft lock ups which is performing "oddly" — James, Apr 17 '12 at 21:46

score 6 · Answer 2 · answered Feb 15 '12 at 15:26

This is actually a kernel bug that got fixed by the following kernel commit :

http://git.kernel.org/?p=linux/kernel/git/tip/tip.git;a=commit;h=4cecf6d401a01d054afc1e5f605bcbfe553cb9b9

You can search LKML for the following title (cannot post more than 2 links) : [stable] 2.6.32.21 - uptime related crashes?

And this is the LP# bug that brings the kernel fix :

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/902317

Upgrading to the latest kernel in lucid-updates should fix this issue for good.

HTH

score 2 · Answer 3 · answered Dec 01 '11 at 12:15

Could it be that the virtualisation host has some power-saving features ("Green IT") enabled that could send unused cores into a low-power/sleep mode, causing interesting disruptions in the VMs using that core? I've heard this used to be a problem mainly in HyperV environments but it may be something to look into.

score 1 · Answer 4 · answered Sep 23 '15 at 00:26

In case someone else finds this, a kernel upgrade fixed a similar problem for me. I had a JBOD that was attached to the system through a SAS3 controller throwing these CPU Softlock errors on boot.

I had Ubuntu 14.04.2 kernel version 3.16.0-30, and doing a "apt -y upgrade" ended me up at kernel 3.16.0-49, and that solved the problem.

How to fix "BUG: soft lockup - CPU#0 stuck for 17163091968s"?

4 Answers4