High load without explanation

Question

I have a very high load on my machine and don't know what is responsible or how to find out.

On the machine runs a jboss appserver and mysql. Here is a top from the user at peak time:

top - 16:23:01 up 101 days,  6:50,  1 user,  load average: 23.42, 21.53, 24.73
Tasks:   9 total,   1 running,   8 sleeping,   0 stopped,   0 zombie
Cpu(s): 17.2%us,  1.6%sy,  0.0%ni, 80.4%id,  0.1%wa,  0.1%hi,  0.7%si,  0.0%st
Mem:  16440784k total, 16263720k used,   177064k free,   151916k buffers
Swap: 16780872k total,    30428k used, 16750444k free,  8963648k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
27344 b         40   0 16.0g 6.5g  14m S  169 41.7   1184:09 java
 6047 b         40   0 11484 1232 1228 S    0  0.0   0:00.01 mysqld_safe
 6192 b         40   0  604m 182m 4696 S    0  1.1  93:30.40 mysqld
 7948 b         40   0 84036 1968 1176 S    0  0.0   0:00.07 sshd
 7949 b         40   0 14004 2900 1608 S    0  0.0   0:00.03 bash
 7975 b         40   0  8604 1044  840 S    0  0.0   0:00.44 top

The CPU usage of the java process is normal. The peaks only show up when i deployed a certain web application. Could the resulting network traffic boost the load in such way that i don't see it in top?

Is there anything abnormal in `dmesg` or your kernel logs? – 3dinfluence May 06 '10 at 15:03 — 3dinfluence, May 06 '10 at 15:03

Kyle Brandt · Answer 1 · 2010-05-06T16:09:28.163

So load average is actually quite complicated, but my understanding that it basically is what is waiting in the run queue. So my guess is that you may have things waiting on IO. Here is a nice stolen snippet to see what is waiting:

ps -eo stat,pid,user,command | egrep "^STAT|^D|^R"

D : Uninterruptible sleep (usually IO)
R : Running or runnable (on run queue)

As pointed out, iostat works well as well to see if it likely disk.

score 0 · Answer 2 · answered May 06 '10 at 16:01

Hard to say from a single top snapshot. More info required.

Assuming as you say the CPU usage is normal, it looks like you have spare CPU, it looks like you are not out of memory so the next thing I'd look at would be IO.

Is the IOWait (%wa) always low or is this snapshot non typical from the IOWait perspective?

vmstat 1 will show us your memory, io over time.

iostat -x 1 will also show us what disk/partitions are being written to.

With hosts where web apps and databases are hosted on the same box, one thing I have seen on more than one occasion is that the logs for the webapp and the databases data dir often end up on the same disk/partition/filesystem, which can cause contention. A number of distros I have seen put the mysql data in /var/lib/mysql and tomcat webapps in /var/lib/tomcat/webapps and of course the logs in /var/log/tomcat.

I.e. your webapp is taking lots of hits and trying to log those hits to the partition, but at the same time it is trying to read data for the DB from the same partition.

I generally find utilistation await time and service time the most useful stats from iostat if I suspect contention.

The quick and dirty way to find out is simply move the tomcat log location to a different partition/disk if possible.

score 0 · Answer 3 · answered May 06 '10 at 18:47

usual answer in such cases - start gathering some stats with munin or cacti, because now you are pretty blind. things to plot:

io statistics - disk reads/writes
memory consumption, reads and writes from swap
number of processes and number of threads [ can it be that java for some reason spawns tones of them in this specific scenario? ]
number of open tcp sockets, open file descriptors [possibly...]
load average
cpu usage with usual nice/iowait/user/softirq and more etc
for tomcat you can also get [probably] quite good java stats - heap size, size of PermGen/Survivor/Tenured, number of hits/sec

score 0 · Answer 4 · answered Jul 24 '12 at 18:46

In our case this was caused by the underlying Ubuntu server having run do-release-upgrade but not been rebooted after it yet. Looking at the VM dumps, it was VM itself, not software on top of it that did something weird with the OS libraries. Rebooting the OS fixed the issue.

High load without explanation

4 Answers4

Linked