4

I have an Ubuntu 10.10 server with plenty of RAM, bandwidth and CPU. I'm seeing a strange, repeatable pattern in the distribution of latencies when serving static files from both Apache and nginx. Because the problem is common to both http servers, I'm wondering if I have misconfigured or poorly tuned Ubuntu's networking or cache parameters.

ab -n 1000 -c 4 http://apache-host/static-file.jpg:

Percentage of the requests served within a certain time (ms)
  50%      5
  66%   3007
  75%   3009
  80%   3011
  90%   9021
  95%   9032
  98%  21068
  99%  45105
 100%  45105 (longest request)

ab -n 1000 -c 4 http://nginx-host/static-file.jpg:

Percentage of the requests served within a certain time (ms)
  50%     19
  66%     19
  75%   3011
  80%   3017
  90%   9021
  95%  12026
  98%  12028
  99%  18063
 100%  18063 (longest request)

The results consistently follow this kind of pattern - 50% or more of requests served as expected, then the remainder falling into discrete bands, with the slowest a few orders of magnitude slower.

Apache is 2.x and has mod_php installed. nginx is 1.0.x and has Passenger installed (but neither app server should be in the critical path for a static file). Load average was around 1 when each test was run (server has 12 physical cores). 5GB free ram, 7GB cached swap. Tests were run from localhost.

Here are the configuration changes I have made from Ubuntu server 10.10 defaults:

/etc/sysctl.conf:
    net.core.rmem_default = 65536
    net.core.wmem_default = 65536
    net.core.rmem_max = 16777216
    net.core.wmem_max = 16777216
    net.ipv4.tcp_rmem = 4096 87380 16777216
    net.ipv4.tcp_wmem = 4096 65536 16777216
    net.ipv4.tcp_mem = 16777216 16777216 16777216
    net.ipv4.tcp_window_scaling = 1
    net.ipv4.route.flush = 1
    net.ipv4.tcp_no_metrics_save = 1
    net.ipv4.tcp_moderate_rcvbuf = 1
    net.core.somaxconn = 8192 

/etc/security/limits.conf:
    * hard nofile 65535
    * soft nofile 65535
    root hard nofile 65535
    root soft nofile 65535

other config:
    ifconfig eth0 txqueuelen 1000

Please let me know if this kind of problem rings any bells, or if more information about the config would be helpful. Thanks for your time.

Update: Here's what I'm seeing after increasing net.netfilter.nf_conntrack_max as suggested below:

Percentage of the requests served within a certain time (ms)
  50%      2
  66%      2
  75%      2
  80%      2
  90%      3
  95%      3
  98%      3
  99%      3
 100%      5 (longest request)
slowernet
  • 43
  • 4
  • 1
    You check the error logs on both apache and nginx as well as `dmesg`? – Kyle Brandt Oct 25 '11 at 20:04
  • Also, consider that you are maybe hitting a limit with `ab` side of things.... – Kyle Brandt Oct 25 '11 at 20:08
  • Also also, when you say localhost are you hitting `http://localhost` ? Could there be a DNS bottleneck? How big is this static file? Even your 50% doesn't add up to me, should be 0-1MS for localhost – Kyle Brandt Oct 25 '11 at 20:12
  • 1
    Wow. `dmesg` told the tale: `nf_conntrack: table full, dropping packet.` Did `sudo sysctl -w net.netfilter.nf_conntrack_max=131072` and the problem is gone: 100% of requests in 6ms. Thank you, @KyleBrandt! – slowernet Oct 25 '11 at 21:14

1 Answers1

6

Going off your comment that it was the nf_conntrack full problem, you can either increase the conntrak table:

sysctl -w net.netfilter.nf_conntrack_max=131072

Or if you are already behind a firewall you can just exempt HTTP traffic from connection tracking:

# iptables -L -t raw
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination         
NOTRACK    tcp  --  anywhere             anywhere            tcp dpt:www 
NOTRACK    tcp  --  anywhere             anywhere            tcp spt:www 

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination         
NOTRACK    tcp  --  anywhere             anywhere            tcp spt:www 
NOTRACK    tcp  --  anywhere             anywhere            tcp dpt:www
Kyle Brandt
  • 82,107
  • 71
  • 302
  • 444
  • @EliotShepard: One other thing when benchmarking is to watch how many sockets are sitting open with `netstat`. You were probably starting a new test before old ones cleared up. If you use a load balancer, you will pay 2x in terms of connections as well. You might want to looking into some `proc` socket recycle / reuse options as well if you have a lot of traffic in production. – Kyle Brandt Oct 26 '11 at 14:21