5

I have two Dell R410 web servers (2x quad core Xeon E5520 w/ 8gb ram) running Debian 5 stable. Their patching had been neglected for a while, so recently we did a patching run to bring everything up to date - neccessitated by a new version of the application it runs which requires PHP 5.3.6. The kernel wasn't updated because it came from the Debian backports repository (the installed version is 2.6.30-bpo.1-amd64).

Since the patching, users have complained that the web site is slow. The majority of requests are served instantly, but now and again it'll get "stuck" on a request. There doesn't seem to be any discernible pattern in the requests that get stuck.

These servers are behind a load balancer, they were updated at the same time and both started exhibiting this issue at the time of the patching run. They were not rebooted at the time, but have been since with no effect.

I setup a script on the servers themselves to loop over time curl localhost:80/alive, which has a simple index.html file in it containing only "OK". Strangely these requests still get delayed with the same frequency and duration as requests for actual php content. The common times are 3 seconds, 9s, 25s 45s and some are over 3 minutes. 45 seconds is a common response time but of course browsers give up well before this so it's effectively no response.

The apache worker config is as follows:

<IfModule mpm_prefork_module>
    StartServers        50
    MinSpareServers     10
    MaxSpareServers     150
    ServerLimit         500
    MaxClients          500
    MaxRequestsPerChild   5000
</IfModule>

It seems sensible to me for a server with 8gb of ram. In practice the worker count seldom goes over 170 so we're not hitting that limit and there is plenty of free memory. Load averages are low, they hover around 0.5-1.5

The kernel is an old backport so I tried updating it to the latest backport for lenny (2.6.32-bpo.5-amd64), but it panicked on boot and I had to get our host to restart it with the old one, so I'd like to explore other options before we try updating their bioses and formatting them with Debian 6.

Apache seems to be a likely culprit, so the next step is to update to the latest apache backport, but the version was a fairly minor bump from 2.2.9-10+lenny4 to 2.2.9-10+lenny9, so I wasn't expecting any significant changes.

PHP is installed, version 5.3.6 from dotdeb. Previous version was 5.3.0 custom compiled. In addition, my boss has just informed me that requests over https do not get delayed but I have not confirmed this myself.

# apache2 -V
Server version: Apache/2.2.9 (Debian)
Server built:   Dec 11 2010 21:34:00
Server's Module Magic Number: 20051115:15
Server loaded:  APR 1.2.12, APR-Util 1.2.12
Compiled using: APR 1.2.12, APR-Util 1.2.12
Architecture:   64-bit
Server MPM:     Prefork
  threaded:     no
    forked:     yes (variable process count)
Server compiled with....
 -D APACHE_MPM_DIR="server/mpm/prefork"
 -D APR_HAS_SENDFILE
 -D APR_HAS_MMAP
 -D APR_HAVE_IPV6 (IPv4-mapped addresses enabled)
 -D APR_USE_SYSVSEM_SERIALIZE
 -D APR_USE_PTHREAD_SERIALIZE
 -D SINGLE_LISTEN_UNSERIALIZED_ACCEPT
 -D APR_HAS_OTHER_CHILD
 -D AP_HAVE_RELIABLE_PIPED_LOGS
 -D DYNAMIC_MODULE_LIMIT=128
 -D HTTPD_ROOT=""
 -D SUEXEC_BIN="/usr/lib/apache2/suexec"
 -D DEFAULT_PIDLOG="/var/run/apache2.pid"
 -D DEFAULT_SCOREBOARD="logs/apache_runtime_status"
 -D DEFAULT_LOCKFILE="/var/run/apache2/accept.lock"
 -D DEFAULT_ERRORLOG="logs/error_log"
 -D AP_TYPES_CONFIG_FILE="/etc/apache2/mime.types"
 -D SERVER_CONFIG_FILE="/etc/apache2/apache2.conf"

# apache2ctl -t -D DUMP_MODULES
Loaded Modules:
 core_module (static)
 log_config_module (static)
 logio_module (static)
 mpm_prefork_module (static)
 http_module (static)
 so_module (static)
 alias_module (shared)
 auth_basic_module (shared)
 authn_file_module (shared)
 authz_default_module (shared)
 authz_groupfile_module (shared)
 authz_host_module (shared)
 authz_user_module (shared)
 autoindex_module (shared)
 cgi_module (shared)
 deflate_module (shared)
 dir_module (shared)
 env_module (shared)
 geoip_module (shared)
 mime_module (shared)
 negotiation_module (shared)
 php5_module (shared)
 rewrite_module (shared)
 setenvif_module (shared)
 ssl_module (shared)
 status_module (shared)
Syntax OK

Any assistance greatly appreciated!

Alex Forbes
  • 2,392
  • 2
  • 19
  • 26

1 Answers1

9

Been banging my against the wall on this for a week now, and my boss has fixed it.

Once we looked at Apache's response times in the logs we saw that it was responding quickly - the delays were happening before the request even reached Apache. Thus he looked at the tcp stack settings, comparing them to another server running Red Hat 5.6.

To cut a long story short, enabling tcp syn cookies (net.ipv4.tcp_syncookies=1 in /etc/sysctl.conf) has fixed the problem. This setting is designed to protect against SYN floods and apparently does allow faster responses. It's possible we're getting flooded accidentally (or deliberately).

More info is in this link, the symptoms described are exactly what we were seeing: http://baheyeldin.com/technology/linux/detecting-and-preventing-syn-flood-attacks-web-servers-running-linux.html

I was looking at netstat -alnt and the vast majority of connections were in state TIME_WAIT, not SYN_RECV (maybe the -l option doesn't show half-open connections).

However we are now seeing this in dmesg frequently:

possible SYN flooding on port 80. Sending cookies.

I shall do some more digging.

Alex Forbes
  • 2,392
  • 2
  • 19
  • 26
  • 1
    That message means your SYN queue gets repeatedly full. That may be due to legitimate requests or not. If the requests are legitimate, you can increase the size of the queue with tcp_max_syn_backlog. See http://www.linuxinsight.com/proc_sys_net_ipv4_tcp_syncookies.html and http://www.redhat.com/archives/rhl-devel-list/2005-January/msg00447.html – Vinko Vrsalovic Jul 26 '11 at 09:41
  • Thanks! Yes, it could well be legitimate traffic so I've set net.ipv4.tcp_max_syn_backlog = 4096. Still seeing possible SYN flood messages though, one would think quadrupling it would be plenty for legitimate traffic. – Alex Forbes Jul 26 '11 at 10:13
  • I would personally try to use a big number, like 16 or 64K to make sure it's not an attack (on an attack virtually every limit would get surpassed), and if it's not an attack, I would gradually diminish the queue size until a good small value is found) – Vinko Vrsalovic Jul 26 '11 at 13:03
  • I've raised it to 65536 but am still getting possible SYN flooding on port 80. I've been watching the number of connections in SYN_RECV state (on a terminal running `watch --interval=5 'netstat -tuna |grep "SYN_RECV"|wc -l'` and it never goes higher than about 240. Yet I have a Red Hat server which hovers around 512 (limit on this server is the default of 1024). Do you now of any other settings which might impact the maximum size of the backlog? – Alex Forbes Jul 26 '11 at 13:26
  • Not sure, you should probably now open a followup question on how to tune / debug SYN queue issues – Vinko Vrsalovic Jul 26 '11 at 13:38
  • 1
    I love you!! I really love you! – Tony Aug 20 '14 at 20:32