Why is the kernel and the watchdogs consuming all the CPU?


I run a bunch of heavy user space processes on a powerful Xeon machine (32 virtual processors according to /proc/cpuinfo). The CPU load is normally around 30 and the machine feel responsive, but later on a few more user space processes are forked and they allocate lots of memory and perform CPU intensive calculations. At that time the load increases to about 60-150 and the machine is on its knees.

But when that occurs the CPU does not seem to be (mainly) consumed by my user space processes anymore. See output from top below.


What can cause the watchdogs to consume so much CPU?

Is it possible to guess why 93.1% of the CPU is consumed by the system, instead of my user space processes?

top - 13:58:49 up 44 days,  6:32, 11 users,  load average: 137.97, 64.80, 30.74
Tasks: 403 total,  48 running, 355 sleeping,   0 stopped,   0 zombie
Cpu(s):  6.4%us, 93.1%sy,  0.0%ni,  0.5%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:    258441M total,   257793M used,      647M free,     3243M buffers
Swap:    16383M total,        0M used,    16383M free,   239114M cached

53921 me        20   0 1380m  88m  10m S    333  0.0   0:59.21 java
   40 root      RT   0     0    0    0 S     97  0.0  34:43.65 watchdog/8
   68 root      RT   0     0    0    0 S     93  0.0  27:57.92 watchdog/15
   52 root      RT   0     0    0    0 S     92  0.0  27:49.83 watchdog/11
   88 root      RT   0     0    0    0 S     82  0.0  37:38.22 watchdog/20
54041 me        20   0 1317m  82m  10m S     78  0.0   1:00.55 java
   24 root      RT   0     0    0    0 S     67  0.0  30:50.30 watchdog/4
 3460 root      20   0     0    0    0 S     55  0.0   4:44.01 afs_rxevent
  128 root      RT   0     0    0    0 S     53  0.0  38:39.29 watchdog/30
45245 root      20   0     0    0    0 R     53  0.0   1:25.40 kworker/2:0
  124 root      RT   0     0    0    0 S     50  0.0  36:14.85 watchdog/29
42623 root      20   0     0    0    0 R     49  0.0   3:24.94 kworker/1:0
55884 foo       20   0 34640  20m 7796 R     49  0.0   0:05.64 program1
53312 me        20   0 1388m 191m  10m S     48  0.1   1:25.89 java
44111 root      20   0     0    0    0 R     47  0.0   5:12.84 kworker/24:0
   86 root      20   0     0    0    0 R     43  0.0  26:00.48 kworker/20:0
55968 foo       20   0 34660  20m 7800 R     38  0.0   0:03.16 program1
55562 foo       20   0  193m  14m 5264 S     38  0.0   0:02.45 program2
   26 root      20   0     0    0    0 R     37  0.0  35:39.29 kworker/5:0
  344 root      20   0     0    0    0 R     33  0.0  32:38.50 kworker/29:1

> uname -a
Linux machine5 3.0.13-0.27-default #1 SMP Wed Feb 15 13:33:49 UTC 2012 (d73692b) x86_64 x86_64 x86_64 GNU/Linux


Posted 2013-09-18T18:28:19.490

Reputation: 141

The process in question (java) is being run by you. Why is it not userspace? – terdon – 2013-09-18T20:33:03.503



When your application issues a system call [such as read(2)], the time taken by the system call is accounted as %sy, not as user-space. Odd that you have no swap space used at all; have you disabled that? The behavior you describe is consistent with kernel memory fragmentation, which makes the kernel work harder and harder to perform the dynamic memory allocation that it uses quite heavily.

Don't rely on top(1) output; when this happens, look in /proc/meminfo for the truth. If Committed_AS approaches or exceeds the MemTotal you have exceeded the available memory in the system and need to begin swapping. If Committed_AS exceeds CommitLimit, you have exhausted both physical memory and swap space. Before you get to this point, however, performance falls over a cliff such as you have described.

Oldest Software Guy

Posted 2013-09-18T18:28:19.490

Reputation: 221