3

Been getting these runaway conditions on a lightly loaded vps (debian squeeze kernel 2.6.38-bpo.2-amd64, 2GB, SSD) - top load is nothing , then jumps to 30.0 & everything grinds to a halt. I'm guessing the Xen Hypervisor kicks in and limits things - vendor control panel shows cpu @ 175%. A few minutes will pass, cpu & load will fall back to nominal levels and things return to normal.

Here is a copy of the latest kern.log

Oct 11 20:10:34 stage kernel: [348092.046302] Clocksource tsc unstable (delta = -8590343613 ns)
Oct 11 20:16:55 stage kernel: [348311.490907] INFO: task bounce:23869 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348311.490918] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348311.490924] bounce          D ffff880003085100     0 23869   1153 0x00000000
Oct 11 20:16:56 stage kernel: [348311.490933]  ffff880003085100 0000000000000286 0000000000000000 ffff88007aaf3600
Oct 11 20:16:56 stage kernel: [348311.490943]  0000000000013700 ffff88007c16bfd8 ffff88007c16bfd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348311.490953]  ffff880003085100 ffff88007c16a010 ffff8800032bc098 ffffffff8103b96e
Oct 11 20:16:56 stage kernel: [348311.490963] Call Trace:
Oct 11 20:16:56 stage kernel: [348311.490973]  [<ffffffff8103b96e>] ? __wake_up+0x35/0x46
Oct 11 20:16:56 stage kernel: [348311.490982]  [<ffffffff81326e50>] ? _raw_spin_lock_irqsave+0x11/0x2f
Oct 11 20:16:56 stage kernel: [348311.490993]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348311.491000]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348311.491006]  [<ffffffff81326ea0>] ? _raw_spin_unlock_irqrestore+0x10/0x11
Oct 11 20:16:56 stage kernel: [348311.491015]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348311.491024]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348311.491030]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348311.491035]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348311.491041]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348311.491047]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:16:56 stage kernel: [348311.491053] INFO: task bounce:23873 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348311.491057] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348311.491063] bounce          D ffff880003083600     0 23873   1153 0x00000000
Oct 11 20:16:56 stage kernel: [348311.491071]  ffff880003083600 0000000000000282 0000000000000000 ffff88007aaf3600
Oct 11 20:16:56 stage kernel: [348311.491080]  0000000000013700 ffff88006d859fd8 ffff88006d859fd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348311.491089]  ffff880003083600 ffff88006d858010 ffff88006d859eb0 0000000000000000
Oct 11 20:16:56 stage kernel: [348311.491098] Call Trace:
Oct 11 20:16:56 stage kernel: [348311.491104]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348311.491111]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348311.491118]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348311.491125]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348311.491130]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348311.491136]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348311.491141]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348311.491146]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:16:56 stage kernel: [348311.491152] INFO: task bounce:23875 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348311.491157] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348311.491162] bounce          D ffff880003085e80     0 23875   1153 0x00000000
Oct 11 20:16:56 stage kernel: [348311.491170]  ffff880003085e80 0000000000000286 0000000000000000 ffff88007aaf3600
Oct 11 20:16:56 stage kernel: [348311.491179]  0000000000013700 ffff88004eecdfd8 ffff88004eecdfd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348311.491188]  ffff880003085e80 ffff88004eecc010 ffff880003085e80 0000000000000000
Oct 11 20:16:56 stage kernel: [348311.491198] Call Trace:
Oct 11 20:16:56 stage kernel: [348311.491203]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348311.491210]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348311.491217]  [<ffffffff8100679f>] ? xen_restore_fl_direct_end+0x0/0x1
Oct 11 20:16:56 stage kernel: [348311.491223]  [<ffffffff81326ea0>] ? _raw_spin_unlock_irqrestore+0x10/0x11
Oct 11 20:16:56 stage kernel: [348311.491230]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348311.491237]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348311.491243]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348311.491248]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348311.491253]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348311.491258]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:16:56 stage kernel: [348431.612291] INFO: task bounce:23869 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348431.612305] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348431.612314] bounce          D ffff880003085100     0 23869   1153 0x00000000
Oct 11 20:16:56 stage kernel: [348431.612327]  ffff880003085100 0000000000000286 0000000000000000 ffff88007aaf3600
Oct 11 20:16:56 stage kernel: [348431.612343]  0000000000013700 ffff88007c16bfd8 ffff88007c16bfd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348431.612357]  ffff880003085100 ffff88007c16a010 ffff8800032bc098 ffffffff8103b96e
Oct 11 20:16:56 stage kernel: [348431.612372] Call Trace:
Oct 11 20:16:56 stage kernel: [348431.612386]  [<ffffffff8103b96e>] ? __wake_up+0x35/0x46
Oct 11 20:16:56 stage kernel: [348431.612397]  [<ffffffff81326e50>] ? _raw_spin_lock_irqsave+0x11/0x2f
Oct 11 20:16:56 stage kernel: [348431.612413]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348431.612423]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348431.612432]  [<ffffffff81326ea0>] ? _raw_spin_unlock_irqrestore+0x10/0x11
Oct 11 20:16:56 stage kernel: [348431.612445]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348431.612461]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348431.612470]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348431.612478]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348431.612487]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348431.612495]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:16:56 stage kernel: [348431.612504] INFO: task bounce:23873 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348431.612511] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348431.612520] bounce          D ffff880003083600     0 23873   1153 0x00000000
Oct 11 20:16:56 stage kernel: [348431.612531]  ffff880003083600 0000000000000282 0000000000000000 ffff88007aaf3600
Oct 11 20:16:56 stage kernel: [348431.612545]  0000000000013700 ffff88006d859fd8 ffff88006d859fd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348431.612559]  ffff880003083600 ffff88006d858010 ffff88006d859eb0 0000000000000000
Oct 11 20:16:56 stage kernel: [348431.612573] Call Trace:
Oct 11 20:16:56 stage kernel: [348431.612582]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348431.612591]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348431.612603]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348431.612612]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348431.612621]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348431.612629]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348431.612637]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348431.612645]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:16:56 stage kernel: [348431.612653] INFO: task bounce:23875 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348431.612660] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348431.612669] bounce          D ffff880003085e80     0 23875   1153 0x00000000
Oct 11 20:16:56 stage kernel: [348431.612680]  ffff880003085e80 0000000000000286 0000000000000000 ffff88007aaf3600
Oct 11 20:16:56 stage kernel: [348431.612694]  0000000000013700 ffff88004eecdfd8 ffff88004eecdfd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348431.612708]  ffff880003085e80 ffff88004eecc010 ffff880003085e80 0000000000000000
Oct 11 20:16:56 stage kernel: [348431.612721] Call Trace:
Oct 11 20:16:56 stage kernel: [348431.612730]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348431.612739]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348431.612750]  [<ffffffff8100679f>] ? xen_restore_fl_direct_end+0x0/0x1
Oct 11 20:16:56 stage kernel: [348431.612759]  [<ffffffff81326ea0>] ? _raw_spin_unlock_irqrestore+0x10/0x11
Oct 11 20:16:56 stage kernel: [348431.612770]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348431.612780]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348431.612788]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348431.612797]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348431.612804]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348431.612812]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:16:56 stage kernel: [348431.612820] INFO: task bounce:23876 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348431.612827] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348431.612836] bounce          D ffff880003081440     0 23876   1153 0x00000000
Oct 11 20:16:56 stage kernel: [348431.612847]  ffff880003081440 0000000000000282 0000000000000000 ffff88007abea1c0
Oct 11 20:16:56 stage kernel: [348431.612861]  0000000000013700 ffff88006d80ffd8 ffff88006d80ffd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348431.612874]  ffff880003081440 ffff88006d80e010 ffff88006d80feb0 0000000000000000
Oct 11 20:16:56 stage kernel: [348431.612888] Call Trace:
Oct 11 20:16:56 stage kernel: [348431.612897]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348431.612906]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348431.612917]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348431.612926]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348431.612935]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348431.612943]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348431.612951]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348431.612959]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:16:56 stage kernel: [348431.612968] INFO: task postdrop:23904 blocked for more than 120 seconds.
Oct 11 20:16:56 stage kernel: [348431.612976] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:16:56 stage kernel: [348431.612984] postdrop        D ffff88007b586540     0 23904  23903 0x00000000
Oct 11 20:16:56 stage kernel: [348431.612995]  ffff88007b586540 0000000000000282 0000000000000000 ffff88007abe9440
Oct 11 20:16:56 stage kernel: [348431.613009]  0000000000013700 ffff88004e94ffd8 ffff88004e94ffd8 0000000000013700
Oct 11 20:16:56 stage kernel: [348431.613023]  ffff88007b586540 ffff88004e94e010 ffff88007abe9440 0000000000000000
Oct 11 20:16:56 stage kernel: [348431.613036] Call Trace:
Oct 11 20:16:56 stage kernel: [348431.613045]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:16:56 stage kernel: [348431.613054]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:16:56 stage kernel: [348431.613065]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:16:56 stage kernel: [348431.613075]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:16:56 stage kernel: [348431.613083]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:16:56 stage kernel: [348431.613091]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:16:56 stage kernel: [348431.613099]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:16:56 stage kernel: [348431.613107]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:21:02 stage kernel: [348552.062201] INFO: task bounce:23869 blocked for more than 120 seconds.
Oct 11 20:21:02 stage kernel: [348552.062216] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:21:02 stage kernel: [348552.062225] bounce          D ffff880003085100     0 23869   1153 0x00000000
Oct 11 20:21:02 stage kernel: [348552.062238]  ffff880003085100 0000000000000286 0000000000000000 ffff88007aaf3600
Oct 11 20:21:02 stage kernel: [348552.062253]  0000000000013700 ffff88007c16bfd8 ffff88007c16bfd8 0000000000013700
Oct 11 20:21:02 stage kernel: [348552.062268]  ffff880003085100 ffff88007c16a010 ffff8800032bc098 ffffffff8103b96e
Oct 11 20:21:02 stage kernel: [348552.062282] Call Trace:
Oct 11 20:21:02 stage kernel: [348552.062301]  [<ffffffff8103b96e>] ? __wake_up+0x35/0x46
Oct 11 20:21:02 stage kernel: [348552.062313]  [<ffffffff81326e50>] ? _raw_spin_lock_irqsave+0x11/0x2f
Oct 11 20:21:02 stage kernel: [348552.062329]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:21:02 stage kernel: [348552.062339]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:21:02 stage kernel: [348552.062349]  [<ffffffff81326ea0>] ? _raw_spin_unlock_irqrestore+0x10/0x11
Oct 11 20:21:02 stage kernel: [348552.062361]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:21:02 stage kernel: [348552.062373]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:21:02 stage kernel: [348552.062382]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:21:02 stage kernel: [348552.062390]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:21:02 stage kernel: [348552.062399]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:21:02 stage kernel: [348552.062407]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b
Oct 11 20:21:02 stage kernel: [348552.062416] INFO: task bounce:23873 blocked for more than 120 seconds.
Oct 11 20:21:02 stage kernel: [348552.062423] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Oct 11 20:21:02 stage kernel: [348552.062431] bounce          D ffff880003083600     0 23873   1153 0x00000000
Oct 11 20:21:02 stage kernel: [348552.062442]  ffff880003083600 0000000000000282 0000000000000000 ffff88007aaf3600
Oct 11 20:21:02 stage kernel: [348552.062456]  0000000000013700 ffff88006d859fd8 ffff88006d859fd8 0000000000013700
Oct 11 20:21:02 stage kernel: [348552.062470]  ffff880003083600 ffff88006d858010 ffff88006d859eb0 0000000000000000
Oct 11 20:21:02 stage kernel: [348552.062484] Call Trace:
Oct 11 20:21:02 stage kernel: [348552.062493]  [<ffffffffa00249ec>] ? log_wait_commit+0xc0/0x111 [jbd]
Oct 11 20:21:02 stage kernel: [348552.062502]  [<ffffffff8106033d>] ? autoremove_wake_function+0x0/0x2a
Oct 11 20:21:02 stage kernel: [348552.062513]  [<ffffffffa00377f6>] ? ext3_sync_file+0xbe/0xec [ext3]
Oct 11 20:21:02 stage kernel: [348552.062522]  [<ffffffff81117733>] ? vfs_fsync_range+0x4c/0x73
Oct 11 20:21:02 stage kernel: [348552.062531]  [<ffffffff811177db>] ? do_fsync+0x27/0x3c
Oct 11 20:21:02 stage kernel: [348552.062539]  [<ffffffff8111780d>] ? sys_fsync+0xb/0xf
Oct 11 20:21:02 stage kernel: [348552.062547]  [<ffffffff81009973>] ? sysret_check+0x17/0x5a
Oct 11 20:21:02 stage kernel: [348552.062554]  [<ffffffff81009952>] ? system_call_fastpath+0x16/0x1b

Any Ideas on where to look?

[Edit] Adding another soft lockup that occurred a few days earlier - not related to bounce:

Oct  7 06:26:39 stage kernel: [5729599.258122] BUG: soft lockup - CPU#1 stuck for 80s! [events/1:16]
Oct  7 06:26:39 stage kernel: [5729599.258122] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_LOG iptable_filter ip_tables x_tables loop snd_pcm snd_timer evdev snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront
Oct  7 06:26:39 stage kernel: [5729599.258122] CPU 1:
Oct  7 06:26:39 stage kernel: [5729599.258122] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_LOG iptable_filter ip_tables x_tables loop snd_pcm snd_timer evdev snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront
Oct  7 06:26:39 stage kernel: [5729599.258122] Pid: 16, comm: events/1 Not tainted 2.6.32-5-amd64 #1 
Oct  7 06:26:39 stage kernel: [5729599.258122] RIP: e030:[<ffffffff8100922a>]  [<ffffffff8100922a>] hypercall_page+0x22a/0x1001
Oct  7 06:26:39 stage kernel: [5729599.258122] RSP: e02b:ffff88007ffb1cc8  EFLAGS: 00000246
Oct  7 06:26:39 stage kernel: [5729599.258122] RAX: 0000000000040001 RBX: ffff880003519780 RCX: ffffffff8100922a
Oct  7 06:26:39 stage kernel: [5729599.258122] RDX: 00000000000116e0 RSI: 0000000000000000 RDI: 0000000000000000
Oct  7 06:26:39 stage kernel: [5729599.258122] RBP: ffff880001a77100 R08: ffff88007ffb0000 R09: ffffffff8100e22f
Oct  7 06:26:54 stage kernel: [5729599.258122] R10: 0000000000000000 R11: 0000000000000246 R12: ffff88007ffa8000
Oct  7 06:26:54 stage kernel: [5729599.258122] R13: ffff8800008e3880 R14: 0000000000000000 R15: 0000000000000000
Oct  7 06:26:54 stage kernel: [5729599.258122] FS:  00007f1e47300720(0000) GS:ffff880003504000(0000) knlGS:0000000000000000
Oct  7 06:26:54 stage kernel: [5729599.258122] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct  7 06:26:54 stage kernel: [5729599.258122] CR2: 000000000065e000 CR3: 0000000001001000 CR4: 0000000000002660
Oct  7 06:26:54 stage kernel: [5729599.258122] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  7 06:26:54 stage kernel: [5729599.258122] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  7 06:26:54 stage kernel: [5729599.258122] Call Trace:
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8100e242>] ? check_events+0x12/0x20
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8100e1e9>] ? xen_irq_enable_direct_end+0x0/0x7
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8104827b>] ? finish_task_switch+0x44/0xaf
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff812fb155>] ? thread_return+0x4e/0xe0
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8100e242>] ? check_events+0x12/0x20
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8106182b>] ? worker_thread+0xcc/0x21d
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff810c8060>] ? vmstat_update+0x0/0x39
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff81064f1a>] ? autoremove_wake_function+0x0/0x2e
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8106175f>] ? worker_thread+0x0/0x21d
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff81064c4d>] ? kthread+0x79/0x81
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff81010d61>] ? int_ret_from_sys_call+0x7/0x1b
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
Oct  7 06:26:54 stage kernel: [5729599.258122]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
Oct  7 06:26:56 stage /USR/SBIN/CRON[12186]: (root) CMD (/root/phptest.sh)
Oct  7 06:27:03 stage kernel: [5729625.855236] BUG: soft lockup - CPU#0 stuck for 65s! [watchdog/0:5]
Oct  7 06:27:03 stage kernel: [5729625.855236] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_LOG iptable_filter ip_tables x_tables loop snd_pcm snd_timer evdev snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront
Oct  7 06:27:03 stage kernel: [5729625.855236] CPU 0:
Oct  7 06:27:03 stage kernel: [5729625.855236] Modules linked in: nf_conntrack_ipv4 nf_defrag_ipv4 xt_state nf_conntrack xt_tcpudp ipt_LOG iptable_filter ip_tables x_tables loop snd_pcm snd_timer evdev snd soundcore snd_page_alloc pcspkr ext3 jbd mbcache xen_netfront xen_blkfront
Oct  7 06:27:03 stage kernel: [5729625.855236] Pid: 5, comm: watchdog/0 Not tainted 2.6.32-5-amd64 #1 
Oct  7 06:27:03 stage kernel: [5729625.855236] RIP: e030:[<ffffffff8100922a>]  [<ffffffff8100922a>] hypercall_page+0x22a/0x1001
Oct  7 06:27:03 stage kernel: [5729625.855236] RSP: e02b:ffff88007ff6fd48  EFLAGS: 00000246
Oct  7 06:27:03 stage kernel: [5729625.855236] RAX: 0000000000040001 RBX: ffff8800034fb780 RCX: ffffffff8100922a
Oct  7 06:27:03 stage kernel: [5729625.855236] RDX: 00000000000116e0 RSI: 0000000000000000 RDI: 0000000000000000
Oct  7 06:27:03 stage kernel: [5729625.855236] RBP: ffff88007e1c4000 R08: ffff88007ff6e000 R09: ffffffff8100e22f
Oct  7 06:27:05 stage kernel: [5729625.855236] R10: ffff880002d23100 R11: 0000000000000246 R12: ffff88007ff53170
Oct  7 06:27:05 stage kernel: [5729625.855236] R13: ffff8800230b9530 R14: 0000000000000000 R15: 0000000000000000
Oct  7 06:27:05 stage kernel: [5729625.855236] FS:  00007f1e47300720(0000) GS:ffff8800034e6000(0000) knlGS:0000000000000000
Oct  7 06:27:05 stage kernel: [5729625.855236] CS:  e033 DS: 0000 ES: 0000 CR0: 000000008005003b
Oct  7 06:27:05 stage kernel: [5729625.855236] CR2: 00007f3d34099120 CR3: 0000000001001000 CR4: 0000000000002660
Oct  7 06:27:05 stage kernel: [5729625.855236] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
Oct  7 06:27:05 stage kernel: [5729625.855236] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Oct  7 06:27:05 stage kernel: [5729625.855236] Call Trace:
Oct  7 06:27:05 stage kernel: [5729625.855236]  [<ffffffff8100dc35>] ? xen_force_evtchn_callback+0x9/0xa
Oct  7 06:27:05 stage kernel: [5729625.855236]  [<ffffffff8100e242>] ? check_events+0x12/0x20
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8100e1e9>] ? xen_irq_enable_direct_end+0x0/0x7
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8104827b>] ? finish_task_switch+0x44/0xaf
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff812fb155>] ? thread_return+0x4e/0xe0
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8104914d>] ? switched_to_rt+0x0/0x58
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff810693bf>] ? cpu_clock+0x28/0x30
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8109428f>] ? watchdog+0x0/0x75
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff810942d3>] ? watchdog+0x44/0x75
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8109428f>] ? watchdog+0x0/0x75
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff81064c4d>] ? kthread+0x79/0x81
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff81011baa>] ? child_rip+0xa/0x20
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff81010d61>] ? int_ret_from_sys_call+0x7/0x1b
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8101151d>] ? retint_restore_args+0x5/0x6
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff8100e22f>] ? xen_restore_fl_direct_end+0x0/0x1
Oct  7 06:27:06 stage kernel: [5729625.855236]  [<ffffffff81011ba0>] ? child_rip+0x0/0x20
jimg
  • 262
  • 4
  • 14

1 Answers1

2

Your kernel is not crashing.

Are you running postfix? This is the only (common) program I'm aware of that has an executable called 'bounce'.

It looks like bounce is waiting for a read/write/flock to complete, indeed, looking at the stack, it appears to have gone to sleep at ext3_sync_file()

Is the filesystem mounted with the sync option? Although changing this to async won't solve the problem it will give performance improvements after you've sorted the underlying fault.

I'd start by checking that the disk is not being spun down/disabled, then try running an fsck on all partitions on the disk, then if it's clean, checking for contention on the files bounce is accessing (i.e. if it is Postfix, are there multiple instances running?, or you running backups while this occurs?). If that doesn't show up anything try moving the datafiles to a different physical disk.

(and do check your kernel patches are up to date - there seem to be reports of a similar problem - though not I've not seen it with this particular version)

symcbean
  • 19,931
  • 1
  • 29
  • 49
  • yes- we are running postfix: with a large job running every evening.. but this "slowdown" occurred outside that window. – jimg Oct 12 '11 at 14:20
  • On the advice of the hosting company, I did update the kernel via Debian backport - linux-image-2.6.38-bpo.2-amd64. This error still appeared. – jimg Oct 12 '11 at 14:44
  • Does the second log extract provide any additional information? – jimg Oct 12 '11 at 21:59
  • 1
    More confused - AFAIK iptables doesn't explicitly use disk I/O - unless its paged out? There's a rather glaring problem with the time stamp counter - you might want to check /sys/devices/system/clocksource/clocksource0/current_clocksource contains 'xen' – symcbean Oct 13 '11 at 10:36