System load average is extremely high

Question

I own a website which is running on a VPS since last week. From monday until saturday, everything is going smoothly. The website has around 4.500 unique visitors a day, and the load average and respond time is fine.

On a sunday, the website has around 11.000 unique visitors, because we have offer unique and exclusive content on that day. The content is stored in a MySQL database, which is running on a different VPS server and using the InnoDB engine. This is where things are going wrong. Because of the increase of visitors, the load average will rise to the extreme, until the point where the website will be unreachable.

Here is the top output:

 This is an automated message notifying you that the 5 minute load average on your system is 238.37.
 This has exceeded the 10 threshold.

 One Minute      - 237.31
 Five Minutes    - 238.37
 Fifteen Minutes - 231.1

 top - 16:41:12 up 5 days, 18:51,  1 user,  load average: 238.68, 238.62, 231.25
 Tasks: 517 total, 246 running, 271 sleeping,   0 stopped,   0 zombie
 Cpu(s):  1.8%us,  0.3%sy,  0.0%ni, 97.6%id,  0.0%wa,  0.0%hi,  0.1%si,  0.2%st
 Mem:   3922920k total,  3542968k used,   379952k free,     2736k buffers
 Swap:  1048564k total,   105316k used,   943248k free,   142772k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND    
 14395 apache    20   0  313m  13m 4044 R  2.8  0.4   0:09.81 /usr/sbin/httpd -k start -DSSL
 13405 apache    20   0  314m  15m 4432 R  2.3  0.4   0:17.87 /usr/sbin/httpd -k start -DSSL
 15865 apache    20   0  312m  13m 4176 R  2.3  0.4   0:01.28 /usr/sbin/httpd -k start -DSSL
 15930 apache    20   0  310m  11m 4060 R  2.3  0.3   0:00.88 /usr/sbin/httpd -k start -DSSL
 15978 apache    20   0  310m  11m 4048 R  2.3  0.3   0:01.08 /usr/sbin/httpd -k start -DSSL
 16041 apache    20   0  309m  10m 4052 R  2.1  0.3   0:00.58 /usr/sbin/httpd -k start -DSSL
 16082 apache    20   0  211m 4192 2276 R  1.9  0.1   0:00.09 /usr/sbin/httpd -k start -DSSL
 14298 apache    20   0  310m  11m 4044 R  0.6  0.3   0:09.56 /usr/sbin/httpd -k start -DSSL
 14457 apache    20   0  311m  11m 4068 R  0.6  0.3   0:10.18 /usr/sbin/httpd -k start -DSSL
 14486 apache    20   0  310m  11m 4464 R  0.6  0.3   0:06.13 /usr/sbin/httpd -k start -DSSL
 15287 apache    20   0  313m  14m 4048 R  0.6  0.4   0:05.21 /usr/sbin/httpd -k start -DSSL
 15363 apache    20   0  310m  11m 4064 R  0.6  0.3   0:04.13 /usr/sbin/httpd -k start -DSSL
 15400 apache    20   0  313m  13m 4048 R  0.6  0.4   0:04.09 /usr/sbin/httpd -k start -DSSL
 15404 apache    20   0  310m  11m 4056 R  0.6  0.3   0:04.22 /usr/sbin/httpd -k start -DSSL
 15649 apache    20   0  313m  14m 4432 R  0.6  0.4   0:02.88 /usr/sbin/httpd -k start -DSSL
 15675 apache    20   0  310m  10m 4044 S  0.6  0.3   0:02.22 /usr/sbin/httpd -k start -DSSL
 15692 apache    20   0  310m  11m 4084 R  0.6  0.3   0:01.46 /usr/sbin/httpd -k start -DSSL
 15702 apache    20   0  311m  12m 4044 R  0.6  0.3   0:01.85 /usr/sbin/httpd -k start -DSSL
 15719 apache    20   0  310m  10m 4048 R  0.6  0.3   0:02.32 /usr/sbin/httpd -k start -DSSL
 15781 apache    20   0  318m  18m 4044 R  0.6  0.5   0:01.91 /usr/sbin/httpd -k start -DSSL
 15788 apache    20   0  312m  13m 4048 R  0.6  0.4   0:02.13 /usr/sbin/httpd -k start -DSSL
 15823 apache    20   0  310m  11m 4060 R  0.6  0.3   0:02.04 /usr/sbin/httpd -k start -DSSL
 15837 apache    20   0  311m  12m 4052 R  0.6  0.3   0:01.64 /usr/sbin/httpd -k start -DSSL

On sunday, the website has to perform a pretty large query, with a couple of left joins on different tables.

The website is running on a VPS, containing 2 x 2.4 Ghz proccessor and 4GB ram. The database is running on a SSD VPS, containing 2 x 2.4 Ghz proccessors, and 2GB ram.

On the specific sunday, I also got this message in the ErrorLog of the server:

 Sun Nov 24 15:03:34 2013] [error] server reached MaxClients setting, consider raising the MaxClients setting

The website is created by using the PHP Codeigniter framework, and worked fine the first 8 weeks on a shared hosting (with the same code). After those weeks, the problem started, that's why I decided to move to a VPS server. But the problem seems to be continuing.

I have absolutely no clue where things are going wrong, so any help would he highly appreciated.

ErikE · Answer 1 · 2013-12-02T22:02:57.030

MaxClients is an Apache server directive. There are a couple of related directives, but they all in different ways concern setting a limit on Apaches processing of requests.

The motive for this is to not let Apache consume so much resources so as to threaten the overall system stability. Therefore, if you increase MaxClients and other similar directives, an eye needs to be kept on system resources such as RAM.

Read more here: MaxClients Directive

But to begin with, it is not clear exactly where the problem is even if you have spotted some symptoms (obviously, or you wouldn't be posting). It is possible Apache is showing you a problem which lies elsewhere.

But in the output you are focusing on Apache as that part is fairly straight forward so let's start there.

As you read in the link, MaxClients define the maximum number of requests httpd serves. When that fills up, requests are queued in the ListenBackLog. When that is full, clients are rejected.

Either maxclients filled up because the number of simultaneous requests were so many. Then you need to provision for that which is relatively easy. Scale up and/or out in the Apache layer until you're on the safe side, or until the db layer maxes out.
Or maxclients filled up because they could not be served fast enough due to underlying layers. Therefore requests piled up in the Apache until no more were accepted. That is not equally easy to solve as it foremost raises a multitude of questions. You focus instantly on the database layer, probably for good reason even if you haven't made it completely clear (yet).

It would also be in your interest to see if clients were rejected, how many etc. If you fetch page statuscodes into your logs you could parse for...503 I seem to recall, but you should google check me on that. This just to learn about your delivery and set a baseline to compare against the next time it happens.

Here are some questions which come to mind. I realize finding the answers. is easier said than done unless you have tools which give you supreme insight. We use Dynatrace (java oriented) which is very expensive but can give answer to such questions in a matter of minutes and down to the component level in the code, even for a sysadmin. This makes the procedure swift of pinpointing causality wether in the code, infrastructure or in a combination. I know there are other tools on the market and possibly also open source alternatives which do this. I just haven't got experience working with those. I guess one can code debug and applicationlog the same things too, it is just a matter of it consuming a lot more work time.

Did the requests take longer to serve during the buildup to the error you described? My Apachelogs show me how long it took to serve each request. I can't recall if that's default or not but can post such a directive at the beginning of the week if you don't have it.

If they did take longer was it due to different requestpatterns or to simply greater numbers?

In terms of Apache server load, do you serve a lot of static content or is it mainly dynamic? Have you baselined just static vs just dynamic somehow? The question is relevant as static content can be separated both to different deliveryservers but also to RAM cache (Varnish was mentioned in another answer). In some deliveries the savings can be significant, in others not.

Are there many requests for (essentially) the same dynamic content, or is it recalculated uniquely per request? In other words, is it possible to inject an early caching layer to catch certain dynamic content?

Looking deeper, is the code heavy in it's processing? Can one baseline that using requests which mainly trigger logic but relatively little database calls? Perhaps it's the code which needs to be optimized or scaled up/out?

Are the database calls optimized or wasteful, both in number and in how queries are formulated? What were the response times per each call during the heavy load? Did response times increase? Did the numbers of calls reach high volumes?

You see where it heads closer and closer to the database, at each layer looking at oportunities for optimizing, caching, distributing(scale out), raw power (scale up).

Maybe you already have all of these answers and are totally right in zeroing in on innodb, I couldn't tell with the info given! All I can provide is methodical questions which may be relevant or not and which may ring bells or not :-)

But getting Apache out of the equation quickly seems like a good idea(as it may just be suffering collateral damage). If you can indeed verify Apache is not the root cause, focus on the application and then how it uses the database.

Basic stats such as network io, ram consumption, cpu consumption, disk io, page faults and so on for the machines involved can also be helpful to study.

So you're saying that the MaxClients limit is also causing the extreme load average? — Tomzie, Nov 30 '13 at 14:38
No, without having scrutinized your numbers, I would say the requests are responsible for that. But the MaxClients directive (along with the others) are there to avoid the server getting overwhelmed. It does this by stopping accepting requests. If your server is pushed to the limit, consider adding resources as neccessary and increase the MaxClient and other settings (scale up) or add servers with a load balancer in front (scale out). If you scale out you could even lower the maxclient setting if you deem it prudent in order to lower the pressure on each server. — ErikE, Nov 30 '13 at 14:45
Okay. Do you think switching from InnoDB to MyIsam would have influence on the memory usage? — Tomzie, Nov 30 '13 at 14:54
It's somewhat like balancing on a rope, there is no fixed formula but careful observance and tuning over time. — ErikE, Nov 30 '13 at 14:55
Were'nt the dbs on separate servers, whilst Apache was getting the hammering? (Sorry, on iphone with kids all around, will read more thoroughly tonight). — ErikE, Nov 30 '13 at 14:57
That's correct. The database is running on a different VPS server, and the webapplication server is giving the load average warnings. Does that mean that the MySQL database can't be the problem? — Tomzie, Nov 30 '13 at 15:07

score 0 · Answer 2 · answered Nov 30 '13 at 19:35

the answer to your question is leverge memory caching as much as possible. ie memcache, varnish, etc....... and then use nginx, which you can scale horizontally, and behind it, a php-fpm pool appropriately sized to your load, fully meshed with upsream nginx boxes.

once you get to a certain level of traffic, its not as much as throwing hardware at the problem, as much as leveraging caching, and having individual tiers that can be upgraded/updated individually.

you cant have a super high availability site on a single vps, unless its static html, and even then varnish is ideal.

get a pair of haproxy frontend load balancers, distributing to varnish, pulling from nginx / php / memcache / redis / mysql (postgres)..... thats it in a nutshell :P

System load average is extremely high

2 Answers2