Server not coping, cpu load averages spike to 33.0

Question

Basically I have a server failing under load. Its a editorial news site that see irregular traffic spikes. I'm tearing my hair out trying to stabilise the LAMP configuration.

Current Time: Wednesday, 14-Dec-2011 15:13:06 SAST
Restart Time: Wednesday, 14-Dec-2011 14:08:44 SAST
Parent Server Generation: 0
Server uptime: 1 hour 4 minutes 21 seconds
Total accesses: 52825 - Total Traffic: 530.2 MB
CPU Usage: u281.32 s20.44 cu0 cs0 - 7.82% CPU load
13.7 requests/sec - 140.6 kB/second - 10.3 kB/request
19 requests currently being processed, 13 idle workers

Am I crazy or should my dedicated server be making easy meat of this load?

Intel i7
8GB DDR3
Soft raid 1
CentOS6

Load averages typically around 3 but twice today it climbed to 30+; dumped its clients and stabilised back to 2.

``top'' reveals little of interest with mysql sitting at 11% cpu.

In your opinion is this possibly a hardware issue? I see the raid maybe clogging up with a unresponsive ata interface in one such case of bad load?

How many req/s would you say is fair for a box this size?

First things first: gather performance metrics. Use dstat, iostat, vmstat, free, and the apache status page output to collect data on performance during different times of the day. Depending on the (in)sanity of your web requests and SQL queries, 13 requests per second could be nothing special, or a server-killer. — adaptr, Dec 14 '11 at 14:40
Do you know what kind of harddrives that soft raid-1 consists of? I'm pretty confident that this is a I/O issue.. top should show you that. — pauska, Dec 14 '11 at 15:00
8 GB DDR3? This is not dividable by 3, so you are not using the fullest possible RAM-performance. The banks used should be 3, 6, 9 or 12. Ask you hardware vendor for details... — Nils, Dec 14 '11 at 20:26

score 2 · Answer 1 · answered Dec 14 '11 at 16:26

The "load average" number is not actually load - it is the number of threads in "running" or "runnable" state. The aforementioned threads can be waiting for something to happen - paging operations or I/O for example (which would be bad performancewise I/O is typically a shared resource and if a number of threads are waiting for it, chances are good that even more join the wait queue).

In a setup with a running MySQL server, I have seen similar figures due to lock contention on a popular table during longish update operations. You can check by issuing the SHOW PROCESSLIST command to your MySQL server (PHPMyAdmin even has this exposed as a function). The quick-and-dirty solution for this was enabling low-priority-updates in the MySQL configuration.

score 2 · Answer 2 · answered Dec 14 '11 at 19:18

You need to get more detailed metrics to pinpoint the problem.

I usually review

disk io
ram usage
swap usage
network usage
connections/sec in apache
queries/sec in the database
firewall issues
network stack (e.g. time wait, open connectons)

From here, I work up into the logs for Apache, MySQL and the system.

Finally, turn to any application specific issues.

Some tools:

Munin or Cacti (or other tool to give detailed stats)
Sysstat and bundled tools (iostat, vmstat, etc)
Extended Status in Apache
Log slow queries in MySQL
Cache reporting for any opcode caches, memcache etc
http://www.webpagetest.org/ for frontend checks
For app issues, some of my clients have had success with New Relic

With a good toolkit and a systematic approach, you can usually begin to unravel the problem.

Some useful tests:

Access static content (img or css)
Access a phpinfo or hello world page
Access a php page with a simple database connection and close
Access a php page with a DB connection, select, close
Access a php page with a DB connection write and close
Access your web application

By timing each of these tests you can begin to unravel where latency may be happening. I have seen highly loaded servers return static content very quickly. The Time to First Byte was very low. This suggests a problem at the application layer. Continuing working through the applicaton stack until you can find the slow down.

While tedious, this process has served me well and once you get used to it, you can blow through it very quickly.

score 1 · Answer 3 · answered Dec 14 '11 at 14:26

Does this happen on a regular basis? I.e., every day you know about when this will happen?

Cron jobs running at that time?

What processes (top or htop should show it) are running?

What disk subsystem are you running? RAID type? Type of controller? (on different channels...?)

Server load isn't just CPU use. It can be a network overload or drive system overload.

Are you checking your disks to see if there's an issue on the drives? One possibly failing?

You'll need to narrow down what exactly is going on, if it's the database choking, are you getting an actual number of high hits to the website, what's your traffic look like, are there messages in the logs, is the server running a batch job of some kind that is heavy on disk I/O...? Any of these things can cause a spike in server "load". You'll need to narrow down where and what is going wonky at that point. If it's happening at nearly the same time of day each time, check cron schedules and anything that might be doing housekeeping on the server, including backups or disk dumps.

If it's correlating to something else...maybe updating a particular type of news story...check your bandwidth usage. Check your logs to see if you're under some kind of scan or probe from malicious users.

score 1 · Answer 4 · answered Dec 14 '11 at 14:50

Scaling, for the impatient or just lazy:

Cache DB results (memcached) and static stuff (varnish, nginx);
Separate asset serving from app serving (images, js, css, serve that from a different host);
Separate DB from app;
Load balance the app access across multiple servers;

Of course, you have to do that before you check your server like Bart said and is sure that the server is doing whatever it can. I mean, if there's room to improve on your current design, do it first, but even in that situation caching will help a lot.

Server not coping, cpu load averages spike to 33.0

4 Answers4