High load average, when should I be worried?

Question

I have a server which runs a few hundred processes simultaneously, most of them are idle, it is some sort of web crawler it sleeps between requests for various reasons.

So as a result, my load average is usually something like: 21.64, 27.05, 29.16

That's very very high right? But everything runs smoothly!

And my CPU consumption is something like (mpstat 60 1 output):

11:07:06 AM  CPU    %usr   %nice    %sys %iowait    %irq   %soft  %steal  %guest  %gnice   %idle
11:08:06 AM  all   34.82    0.00    4.16   10.70    0.00    0.31    0.00    0.00    0.00   50.01
Average:     all   34.82    0.00    4.16   10.70    0.00    0.31    0.00    0.00    0.00   50.01

So, since I'm not even running at 100% CPU usage I feel like I do not have a reason to be worried, or am I missing something? There is a slight delay when nginx is serving requests, but that's expected given the large number of queued requests, But I read somewhere that a load average higher than 1 is a cause for alarm, and I honestly don't see why that is.

So please advise.

Thanks

score 4 · Answer 1 · answered Aug 07 '17 at 15:32

Only worry if it actually corresponds to a slow application.

A bit more precisely, load average relates to the number of processes running or waiting. This can be a lot more than 1 and perform just fine. A load average of 21 on a host with 24 cores will have idle CPU, even with those processes running 100%. The advice that 1 is a lot may come from people who have not seen large or busy hosts.

iowait is delay for the application but (in modern storage systems) the CPU is effectively free to do other things.

Monitor your application's response time. Correlate that with your other monitoring to see what actually indicates things are slow.

score 2 · Answer 2 · answered Aug 07 '17 at 10:09

A load average higher than 1 refers to 1 core/thread. So a rule of thumb is that an average load equal to your cores/threads is OK, more will most likely lead to queued processes and slow down things.

The iowait e.g. is also accounted in the load average and one process which is doing heavy IO can push the load average over 1 without using a second core/thread.
While this heavy IO process will likely have bad response time, a second process can be very responsive a high load. Depending on the resources the process is accessing.

score 0 · Answer 3 · answered Aug 07 '17 at 10:22

You should collect more information to get a better sense. Also your post lacks details such as what kind of server, which linux distribution, how many CPU/cores you have? You can do a mpstat -P ALL to get information per CPU. Do you have enough memory? disk? file system setup?

You can probably identify which operation in nginx is causing high load by looking at lsof |grep nginx output.

Do you have any alert/monitoring in place? That way you can be notified when you see high alert. Do you log server load (via sar)? Can you identify any trends over the course of a day/week? What are processes are running?

I also see iowait numbers around 10 in your mpstat output so that means your system is waiting for I/O operations. So then you need to check your disk/filesystem settings and optimise if needed.

Basically, a high load does not necessarily mean bad -- it could just mean your server and services are getting used. Or it would mean something bad might happen tomorrow. However, one should get a better understanding of the system behaviour rather than just simply saying everything runs smoothly. So gather more data, monitor, read, research, and observe over a few days and then these might help you to get some more insight.

Hope this helps.

What I posted above is the average of all CPUs, I checked and the load is almost the same on all of them, so the system is utilizing all CPUs and cores, plus I know EXACTLY what is causing the high load, as I said the crawler that I'm running, which sleeps a couple of seconds between requests. My guess is the iowait is caused by network io, I have 2 SSD disks with raid 0, my processes do not use the disk that much. I also have 79 GB of free ram, so there's that as well. — AL-Kateb, Aug 07 '17 at 10:59
It has been going on like that for a couple of months now, there is no problem, but what I would like to know is how do I know how much more can I push this server! Because these numbers are not very clear to me, 25 load is considered high, since I have 12 cores, but yet CPU usage is around 30%, does this mean I can still push it, or the high load average should concern me and I should start trying to maybe tune it up. I am running CentOS 7.3 on a server with Intel(R) Xeon(R) CPU E5-1650 v3 and 96 GB of RAM., so there's that as well — AL-Kateb, Aug 07 '17 at 11:04
The load could be due to your crawler, which is waiting for response from the site it is crawling. It is hard to tell how far you can push it, since it is hard to guess what the behaviour of the overall system is. You need to find the limit yourself via scientific method. — Tero Kilkanen, Aug 07 '17 at 11:22

High load average, when should I be worried?

3 Answers3