13

I am looking to understand some server performance problems I am seeing with a (for us) heavily loaded web server. The environment is as follows:

  • Debian Lenny (all stable packages + patched to security updates)
  • Apache 2.2.9
  • PHP 5.2.6
  • Amazon EC2 large instance

The behavior we're seeing is that the web typically feels responsive, but with a slight delay to begin handling a request -- sometimes a fraction of a second, sometimes 2-3 seconds in our peak usage times. The actual load on the server is being reported as very high -- often 10.xx or 20.xx as reported by top. Further, running other things on the server during these times (even vi) is very slow, so the load is definitely up there. Oddly enough Apache remains very responsive, other than that initial delay.

We have Apache configured as follows, using prefork:

StartServers          5
MinSpareServers       5
MaxSpareServers      10
MaxClients          150
MaxRequestsPerChild   0

And KeepAlive as:

KeepAlive On
MaxKeepAliveRequests 100
KeepAliveTimeout 5

Looking at the server-status page, even at these times of heavy load we are rarely hitting the client cap, usually serving between 80-100 requests and many of those in the keepalive state. That tells me to rule out the initial request slowness as "waiting for a handler" but I may be wrong.

Amazon's CloudWatch monitoring tells me that even when our OS is reporting a load of > 15, our instance CPU utilization is between 75-80%.

Example output from top:

top - 15:47:06 up 31 days,  1:38,  8 users,  load average: 11.46, 7.10, 6.56
Tasks: 221 total,  28 running, 193 sleeping,   0 stopped,   0 zombie
Cpu(s): 66.9%us, 22.1%sy,  0.0%ni,  2.6%id,  3.1%wa,  0.0%hi,  0.7%si,  4.5%st
Mem:   7871900k total,  7850624k used,    21276k free,    68728k buffers
Swap:        0k total,        0k used,        0k free,  3750664k cached

The majority of the processes look like:

24720 www-data  15   0  202m  26m 4412 S    9  0.3   0:02.97 apache2                                                                       
24530 www-data  15   0  212m  35m 4544 S    7  0.5   0:03.05 apache2                                                                       
24846 www-data  15   0  209m  33m 4420 S    7  0.4   0:01.03 apache2                                                                       
24083 www-data  15   0  211m  35m 4484 S    7  0.5   0:07.14 apache2                                                                       
24615 www-data  15   0  212m  35m 4404 S    7  0.5   0:02.89 apache2            

Example output from vmstat at the same time as the above:

procs -----------memory---------- ---swap-- -----io---- -system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa
 8  0      0 215084  68908 3774864    0    0   154   228    5    7 32 12 42  9
 6 21      0 198948  68936 3775740    0    0   676  2363 4022 1047 56 16  9 15
23  0      0 169460  68936 3776356    0    0   432  1372 3762  835 76 21  0  0
23  1      0 140412  68936 3776648    0    0   280     0 3157  827 70 25  0  0
20  1      0 115892  68936 3776792    0    0   188     8 2802  532 68 24  0  0
 6  1      0 133368  68936 3777780    0    0   752    71 3501  878 67 29  0  1
 0  1      0 146656  68944 3778064    0    0   308  2052 3312  850 38 17 19 24
 2  0      0 202104  68952 3778140    0    0    28    90 2617  700 44 13 33  5
 9  0      0 188960  68956 3778200    0    0     8     0 2226  475 59 17  6  2
 3  0      0 166364  68956 3778252    0    0     0    21 2288  386 65 19  1  0

And finally, output from Apache's server-status:

Server uptime: 31 days 2 hours 18 minutes 31 seconds
Total accesses: 60102946 - Total Traffic: 974.5 GB
CPU Usage: u209.62 s75.19 cu0 cs0 - .0106% CPU load
22.4 requests/sec - 380.3 kB/second - 17.0 kB/request
107 requests currently being processed, 6 idle workers

C.KKKW..KWWKKWKW.KKKCKK..KKK.KKKK.KK._WK.K.K.KKKKK.K.R.KK..C.C.K
K.C.K..WK_K..KKW_CK.WK..W.KKKWKCKCKW.W_KKKKK.KKWKKKW._KKK.CKK...
KK_KWKKKWKCKCWKK.KKKCK..........................................
................................................................

From my limited experience I draw the following conclusions/questions:

  • We may be allowing far too many KeepAlive requests

  • I do see some time spent waiting for IO in the vmstat although not consistently and not a lot (I think?) so I am not sure this is a big concern or not, I am less experienced with vmstat

  • Also in vmstat, I see in some iterations a number of processes waiting to be served, which is what I am attributing the initial page load delay on our web server to, possibly erroneously

  • We serve a mixture of static content (75% or higher) and script content, and the script content is often fairly processor intensive, so finding the right balance between the two is important; long term we want to move statics elsewhere to optimize both servers but our software is not ready for that today

I am happy to provide additional information if anybody has any ideas, the other note is that this is a high-availability production installation so I am wary of making tweak after tweak, and is why I haven't played with things like the KeepAlive value myself yet.

futureal
  • 397
  • 1
  • 3
  • 17

5 Answers5

7

I'll start by admitting that I don't much about running stuff in clouds - but based on my experience elsewhere, I'd say that this webserver config reflects a fairly low volume of traffic. That the runqueue is so large suggests that there just isn't enough CPU available to deal with it. What else is in the runqueue?

We may be allowing far too many KeepAlive requests

No - keeplive still improves performance, modern browsers are very smart about knowing when to pipeline and when to run requests in parallel, although a timeout of 5 seconds is still rather high, and you've got a LOT of servers waiting - unless you've got HUGE latency problems I'd recommend cranking this down to 2-3. This should shorten the runqueue a little.

If you've not already got mod_deflate installed on the webserver - then I'd recommend you do so - and add the ob_gzhandler() to your PHP scripts. You can do this as an auto-prepend:

if(!ob_start("ob_gzhandler")) ob_start();

(yes, copression uses more CPU - but you should save CPU overall by getting servers out of the runqueue faster / handling fewer TCP packets - and as a bonus, your site is also faster).

I'd recommend setting an upper limit on MaxRequestsPerChild - say something like 500. This just allows some turnover on processes in case you've got a memory leak somewhere. Your httpd processes look to be HUGE - make sure you've removed any apache modules you don't need and make sure you're serving up static content with good caching information.

If you're still seeing problems, then the problem is probably within the PHP code (if you switch to using fastCGI, this should be evident without any major performance penalty).

update

If the static content doesn't vary much across pages, then it might also be worth experimenting with:

if (count($_COOKIE)) {
    header('Connection: close');
}

on the PHP scripts too.

symcbean
  • 19,931
  • 1
  • 29
  • 49
  • Among a variety of good answers I am marking this as the accepted one because you clearly stated that this was a CPU-bound problem (largely due to the poor application we are running) and that certainly was the case. I redeployed everything on 2xlarge EC2 instances (up from large) and most of the problems went away, although many of the other performance characteristics are still there. We only have the single app running on these servers, and it is just ugly. – futureal Feb 21 '11 at 21:40
4

You should consider installing an asynchronous reverse proxy, because a number of processes in W state is quite high too. Your Apache processes seem to spend a lot of time sending content to slow clients over network being blocked on that. Nginx or lighttpd as a frontend to your Apache server can reduce a number of processes in W state dramatically. And yes, you should limit a number of keepalive requests. Probably it is worth trying to turn keepalive off.

BTW, 107 Apache processes are too high for 22 rps, I was able to serve 100-120 rps using only 5 Apache processes. Probably, the next step is to profile your application.

Alex
  • 7,789
  • 4
  • 36
  • 51
  • Yea, definitely agreed that the application is a large part of the problem. It was outsourced and has since been subject to a lot of patches and whatnot that have just made it worse, and a redesign effort is ongoing. I did tonight try turning off KeepAlive to no real effect, and my next step is to try that reverse proxy, probably with nginx based on all I've since read. – futureal Feb 04 '11 at 04:30
  • To follow up, I have begun experimenting with the reverse proxy and will probably deploy it in production in the near future. Thank you (and the others who suggested it) for the idea, it is not something I'd ever tinkered with before but I think it will make an impact until we can do a full-fledged redesign. – futureal Feb 21 '11 at 21:41
1

You have two rows in your vmstat that show your CPU wait time is fairly high, and around those, you do a fair number of writes (io - bo) and context switching. I would look at what's writing blocks, and how to eliminate that wait. I think the most improvement could be found in improving your disk IO. Check syslog - set it to write async. Make sure your controller's write cache is working (check it -- you might have a bad battery).

Keepalive isn't causing your perf problem, it saves you time on connection setup if you're not running a cache in front. You might bump MaxSpareServers a bit so that in a crunch you're not waiting for all the forks.

beans
  • 1,550
  • 13
  • 16
  • I am not familiar enough with syslog to know how to set it for asynchronous writes under Apache, although I will certainly search and seek that out. I did make some changes tonight related to KeepAlive and MaxSpareServers to no real effect, I do agree about leaving more spares up, I had missed that. One (poor) quality of our application is that it writes heavily to user session files (yes, files) which is where I am beginning to think we are suffering. I have the option of moving session management to the database, which I am likely to try next. – futureal Feb 04 '11 at 04:34
  • Yes, I would agree that your session writes are the source of the problem.You can lose the session disk writes if you're using php sessions -- install memcache, and set PHP's session.save_handler to memcache, and session.save_path to tcp://127.0.0.1:11211 (or wherever you set up memcache). Apache's logging is async by default, but sometimes web apps can use syslog, or syslog can be chatty and it's doing a sync for every line. It doesn't sound like it would be the problem in your case, after all. You can prefix file entry lines with '-' in syslog.conf to omit syncing. – beans Feb 04 '11 at 14:40
0

you should consider turning keepalive off as a first try...

with 107 request processed I would keep MaxSpareServers higher then what you set...

IMHO in the long-term nginx as reverse proxy for static content should be take into consideration

evcz
  • 151
  • 1
  • 6
0

First suggestion: disable keepalives. I've only ever needed it when I could identify a specific situation that the performance increased, but in general requests/sec decreased with Keepalive enabled.

Second suggestion: Set a MaxRequestsPerChild. I echo symcbean here, it will help with process rollover in the case of a memory leak. 500 is a good starting point.

Third Suggestion: Increase MaxClients. A ballpark calculation for this is (physical memory - memory used by non-httpd process)/size of each httpd process. Depending on how httpd was compiled, this number maxes at 255. I use 250 for my public servers to deal with google/yahoo/MS crawling the systems.

Forth Suggestion: Increase MaxSpareServers: something like 4-5x MinSpareServers.

Baring those suggestions failing, I would look at load-balancing with reverse-proxy or memcache for DB.

Paul S
  • 186
  • 4