2

I use 'htop' to monitor my web server. It's recently quite loaded and the Load average is showing something like this:

Load average: 3.10 2.56 1.63

I searched the web about these numbers and I found an article about it: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

In the article, it says if I have 2 CPUs, 2.0 means 100% CPU utilization.

And my VPS has two CPUs, so what does 3.1 mean? How could it exceed 100% CPU utilization?

And from these numbers, does it mean I should be wary about the loading now? But the performance seems totally fine, and this is a managed VPS, the hosting company has not notified me any warning about it.

During day time, Load average always show these high numbers... here is another snapshot while writing.

Load average: 3.03 2.77 1.97

Load average: 0.41 1.29 1.60 <---- 5 more minutes later

So I am wondering how much room left for this site to grow in current configurations? What kind of proactive actions I should take in advance?

I don't want to wait until the server bursts.

Thanks.

Joe Huang
  • 215
  • 2
  • 8
  • In addition to the duplicate question linked above, please read the other `load average` questions here on Server Fault, as well as [the Wikipedia article on load average](http://en.wikipedia.org/wiki/Load_(computing)#Unix-style_load_calculation). Once you *understand* load average you will be able to interpret it and use it as a tool for capacity planning. – voretaq7 Nov 05 '13 at 18:19

2 Answers2

7

3.1 means that on average there are 3.1 processes either using the CPU, waiting for it, or waiting for I/O to complete. It's not a measure of CPU utilization but of load.

The load average is just one piece of information. It doesn't really tell you much by itself. Have a competent server administrator analyze the server's behavior to assess how well it's handling the load it's being given.

David Schwartz
  • 31,215
  • 2
  • 53
  • 82
  • hm... it's interesting... because the link says differently about the loading average indications. – Joe Huang Nov 04 '13 at 09:58
  • 2
    The link provided in the question is a good introduction but because it's just an introduction it doesn't include what David is saying here which is that processes can end up stuck in the queue even when the CPU is idle and that even a box with a high load average can still be working fine. The link also mentioned that the 15 minute average is the best place to look, not the 1 minute average. A metric like *front-end web request latency* is better for determining when you actually have a *problem*. – Ladadadada Nov 04 '13 at 10:45
  • In fact, that post is simplified to the point where it's more likely to mislead than to educate. It would be mostly correct if all processes were CPU bound and nobody ever waited for I/O. – David Schwartz Nov 07 '13 at 08:03
4

The best proactive action you can take is to install a monitoring/graphing tool like Cacti, Zabbix, Nagios, Munin or Observium. (There are other choices available.)

Track load average, CPU utilisation, I/O stats, memory usage, HTTP requests per second and anything else you can think of. With the graphs, you will often be able to predict and prevent downtime before it happens.

Most tools also provide alerts on thresholds such as "Less than 5% disk space remaining" which can very quickly let you home in on the source of the downtime.

These tools will be less effective if you only have one box.

Ladadadada
  • 25,847
  • 7
  • 57
  • 90
  • It's a managed VPS, so I actually don't have the permission to install these tools. – Joe Huang Nov 04 '13 at 09:59
  • 1
    @JoeHuang If the system is *properly* managed it likely already has monitoring tools installed - you should speak to your provider about getting access to the monitoring information and trend statistics. – voretaq7 Nov 05 '13 at 18:23