11

My setup: I have 3 nearly identical webserver machines serving the same high loaded dynamic website with simple load balancing over dns. The service has been working for over two years with the same apache config: apache2, php5, ubuntu 8.04 linux 2.6.24-29-server.

My problem: Since about two weeks ago I'm experiencing problems with this config. Nearly every day I have one small moment for about 5 minutes, in which the website is unreachable. I'm still able to login to the servers over ssh. If i run htop, I see the machine simply doing nothing. I have about 1000 apache processes running, but no cpu activity.

I've used the apache mod_status to debug this situation. The process scoreboard looks like this:

_C.___K_______________________R._______.__K_K____K___C_______.__
_______C__________.___________________________________.________C
_.____K__________K___K_WK_____._K_____________________________._
W______K__________K________.____________________._______C_______
_C_.__K__K____.._.._____________________________________C_______
_R___________K___.______C________.C_________.______._____C______
____________KKC____K_____K__WC_________________C_____.__.____.__
_____________________C_________K______.____C______._____________
_.___C____.___.___________________________.K______.____K________
W__.___________________C.__.____K________K_______R_._.__._______
__C__C_.__________C__C_______._____W______________C_.___C_______
____.______C_____________C________.____C____________.________._K
__.__________.K_____________K_________._____C____.K__________KW_
__K.W________R_________._______.___W___________.____.__K_____W__
W___.___..________W____K

Scoreboard Key:
"_" Waiting for Connection, "S" Starting up, "R" Reading Request,
"W" Sending Reply, "K" Keepalive (read), "D" DNS Lookup,
"C" Closing connection, "L" Logging, "G" Gracefully finishing,
"I" Idle cleanup of worker, "." Open slot with no current process

So the most of the processes are just waiting for connection. after about 5 minutes the situation will return to normal: i have lot least processes on every machine, the most workers have the "."-status (meaing they are open to process a request) and of course the website is reachable!

so i'm trying to find something in the logs, but there is simply nothing... the apache access log is silent for about 4 minutes, the same is for the error log. i also can not figure out anything wrong in other system logs.

the situation is the same on all 3 webservers (all of them have this load peak and unresponsive condition at the same time), so i do not thing this is hardware related. but i think, this might be related to some network (tcp) issue.

any ideas?

EDIT: some more information, that I have just discovered:

It has just happened again and I was able to verify that I'm also not able to connect locally when this problem occurs.

I have made some connection statistics with the following command after it happened: netstat -an|awk '/tcp/ {print $6}'|sort|uniq -c

  • 109 CLOSE_WAIT
  • 2652 ESTABLISHED
  • 2 FIN_WAIT1
  • 11 LAST_ACK
  • 12 LISTEN
  • 91 SYN_RECV
  • 1 SYN_SENT
  • 16 TIME_WAIT

If I execute the same command some time later, I have something like this:

  • 4 CLOSING
  • 108 ESTABLISHED
  • 18 FIN_WAIT1
  • 182 FIN_WAIT2
  • 37 LAST_ACK
  • 12 LISTEN
  • 50 SYN_RECV
  • 11276 TIME_WAIT

So in the normal situation I have only 100-200 open connections by clients being handled by apache in this moment. When I have this "crash", I have a lot more connections. What is the best way to analyse this?

EDIT2: the important lines in apache2.conf are:

KeepAlive On
MaxKeepAliveRequests 20
KeepAliveTimeout 1
<IfModule mpm_prefork_module>
ServerLimit           920
StartServers          30
MinSpareServers       80
MaxSpareServers      120
MaxClients          920
MaxRequestsPerChild   700
</IfModule>

It is an apache2 prefork with php_mod.

The server has 8GB ram and a 4gb swap partition.

edhgoose
  • 133
  • 7
Jeff
  • 423
  • 1
  • 5
  • 10
  • Does the web site show the same symptoms when you run a wget or curl from the local host or between servers (if they're on the same network)? – Alex Forbes Jan 31 '12 at 14:16
  • Maybe a traffic dump (`tcpdump`) will help you get to the root of the problem ... btw what's your memory usage and firewall policies ? – drcelus Jan 31 '12 at 14:34
  • @al4 the last time this happend i've been able to connect to the server-status page from the local host, while i was unable to connect to the webpage from outside. i'm not quite sure, as it could also be a random thing, while some of the workers became available. i will test this more the next time the problem occurs. what would be your suggestion, if i could confirm any difference between outside and local connections? – Jeff Jan 31 '12 at 14:57
  • If you can confirm that it works locally but not from outside it strengthens the case for the network being the issue - meaning you should test with tcpdumps and wireshark at both ends to see what's getting through, rather than strace'ing the apache processes. I'd also test from a host on the same LAN if possible. And check dmesg to see if there are any messages that could be related but sounds like you've already done that. – Alex Forbes Jan 31 '12 at 16:30
  • it has just happened again. and i was able to verify that i'm also not able to connect locally when this problem occures. i have also made some connection statistics with netstat: see question text – Jeff Feb 01 '12 at 14:16
  • Did you resolve this issue? – pauska Jul 26 '13 at 06:30
  • I am experiencing a similar issue as well, any ideas? – GomoX Feb 10 '14 at 21:05
  • Show your apache MPM settings, and keepalive settings. It is probably a bad combination of these. EDIT: I just saw you mentioned php. If this is mod_php you're using, that machine better have 64GB of memory, or you will never sustain 2500 connections. – adaptr Feb 01 '12 at 14:17
  • i have added the apache config to the question. i do not know how it could affect the accepting of connections, but the MaxClients setting is really set to high. i have checked this: my apache workers take usually 15-30 MB RAM. Do you think, the higher setting could have such effect? – Jeff Feb 01 '12 at 16:55
  • i have reduced the ServerLimit and MaxClients to 500. since then the problem occured two times again. for about 4 minutes i have no activity in my apache log on all three servers and the website is not reachable! so the ServerLimit was not the point... – Jeff Feb 03 '12 at 13:49

3 Answers3

2

You should enable extended status of mod_status (http://httpd.apache.org/docs/2.2/mod/mod_status.html#extendedstatus) in order to monitor the current hosts and requests being processed. I think there's a script(s)/page(s) which takes too much time to release the connection and it makes the connections stacking.

berthni
  • 31
  • 5
1

First: Check your Max open files limit on the process. An active socket connection counts as an open file. cat /proc/###/limits is a good way to check the effective value for another process. You can get a list of open files with lsof -p ### where ### is your web server's process id. You can compare lsof -p ### | wc -l to see how close you are getting to the limit. You should also see messages in apache's error_log if you are hitting the limit.

You need a file handle for each socket connection, and also for each cgi script or data file reference. For 920 MaxClients, you should configure at least 4,000 files for the httpd process. You can increase the number of files by adding a file in /etc/security/limits.d/ with the following contents. Ensure the user name matches what you are using for your web server.

apache soft nofile 10000
apache hard nofile 10000

Second: If port exhaustion is your problem, you can adjust the some ip settings in /etc/sysctl.conf. (Starting with net.ipv4.tcp_fin_timeout). This is usually a problem only with lots of very small connections. Many TIME_WAIT sockets is one indicator of this, but this indicates port exhaustion only when accompanied by errors in syslog about possible SYN flooding and Sending cookies. You should also ensure your server is behind a firewall that can thwart malicious SYN attacks.

0

Also, bear in mind that in the prefork MPM, each process will have PHP in its memory space (what's its memory limit setting?). You may like to try changing to the worker MPM, which may require a slightly different PHP module.

Also worth remote earring to trim your Apache config of extraneous modules

In my experience, such things are triggered by things like a search engine crawler, or things like ARP conflicts. Or traffic levels in some related part of the network.

You may well find 'sar' useful... not the most friendly, but certainly useful.

Possibly also io related. Sar can tell you (if you configure it to record disk activity), what the average io wait time is. You can also look at the IO Wait time in top (which is a percentage, read up on what it actually means). This can be significant if you are using a SAN or virtual environment.

Cameron Kerr
  • 3,919
  • 18
  • 24