nginx reverse proxy greatly increases worst-case latency

Question

(edit: partially understood and worked around, see comment)

I have a setup with nginx acting as a reverse proxy in front of a CherryPy app server. I'm using ab to compare performance going through nginx vs. not, and noticing that the former case has much worse worst-case performance:

$ ab -n 200 -c 10 'http://localhost/noop'
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:        nginx
Server Hostname:        localhost
Server Port:            80

Document Path:          /noop
Document Length:        0 bytes

Concurrency Level:      10
Time taken for tests:   3.145 seconds
Complete requests:      200
Failed requests:        0
Write errors:           0
Total transferred:      29600 bytes
HTML transferred:       0 bytes
Requests per second:    63.60 [#/sec] (mean)
Time per request:       157.243 [ms] (mean)
Time per request:       15.724 [ms] (mean, across all concurrent requests)
Transfer rate:          9.19 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   0.1      0       1
Processing:     5   48 211.7     31    3007
Waiting:        5   48 211.7     31    3007
Total:          5   48 211.7     31    3007

Percentage of the requests served within a certain time (ms)
  50%     31
  66%     36
  75%     39
  80%     41
  90%     46
  95%     51
  98%     77
  99%    252
 100%   3007 (longest request)
$ ab -n 200 -c 10 'http://localhost:8080/noop'
This is ApacheBench, Version 2.3 <$Revision: 655654 $>
Copyright 1996 Adam Twiss, Zeus Technology Ltd, http://www.zeustech.net/
Licensed to The Apache Software Foundation, http://www.apache.org/

Benchmarking localhost (be patient)
Completed 100 requests
Completed 200 requests
Finished 200 requests


Server Software:        CherryPy/3.2.0
Server Hostname:        localhost
Server Port:            8080

Document Path:          /noop
Document Length:        0 bytes

Concurrency Level:      10
Time taken for tests:   0.564 seconds
Complete requests:      200
Failed requests:        0
Write errors:           0
Total transferred:      27600 bytes
HTML transferred:       0 bytes
Requests per second:    354.58 [#/sec] (mean)
Time per request:       28.202 [ms] (mean)
Time per request:       2.820 [ms] (mean, across all concurrent requests)
Transfer rate:          47.79 [Kbytes/sec] received

Connection Times (ms)
              min  mean[+/-sd] median   max
Connect:        0    0   1.7      0      11
Processing:     6   26  23.5     24     248
Waiting:        3   25  23.6     23     248
Total:          6   26  23.4     24     248

Percentage of the requests served within a certain time (ms)
  50%     24
  66%     27
  75%     29
  80%     31
  90%     34
  95%     40
  98%     51
  99%    234
 100%    248 (longest request)

What could be causing this? The only thing I can think of is that nginx is sending requests to the backend in a different order than they arrived, but that seems implausible.

The machine is an EC2 c1.medium instance with 2 cores, CherryPy is using a thread pool with 10 threads, and nginx has worker_connections = 1024.

UPDATE: Two more confusing findings:

At a given concurrency, sending more requests improves performance. With a concurrency of 40 and 40 requests, I get a median time of 3s and max 10.5s; with a concurrency of 40 and 200 requests, I get a median of 38ms (!) and max 7.5s. In fact, the total time is less for 200 requests! (6.5s vs. 7.5s for 40). This is all repeatable.
Monitoring both of the nginx worker processes with strace greatly improves their performance, taking e.g. a median time of 3s to 77ms, without noticeably changing its behavior. (I tested with a nontrivial API call and confirmed that strace doesn't change the response, as well as all of these performance observations still holding.) This is also repeatable.

Is it repeatable with [`wrt`](https://github.com/wg/wrk/) or `httperf`? `ab` is slow and buggy. — VBart, Jul 27 '12 at 01:52
Thanks for the recommendations. This appears to result from TCP-level problems; tshark on lo shows retransmissions of GET requests, increasing the request queue size (`listen(2)` parameter) in CherryPy changes the behavior, and using a UNIX domain socket between nginx and CherryPy eliminates the problem (I've gone with this solution). — npt, Jul 28 '12 at 03:45

score 4 · Accepted Answer · edited Apr 13 '17 at 12:14

The 3 seconds worst case in your first ab run looks like a packet loss. It's probably a result of some insufficient buffers/resources configured, some possible causes in no particular order:

Too small listen queue on a backend resulting in occasional listen queue overflows (Linux is usually configured to just drop SYN packet in this case, thus making it indestinguishable from a packet loss; see netstat -s | grep listen to find out if it's the problem).
Statefull firewall on localhost aproaching it's limit on number of states, and dropping some random SYN packets due to this.
System is out of sockets/local ports due to sockets in TIME_WAIT state, see this question if you are using Linux.

You have to examine your OS thoughtfully to find out the cause and configure your OS accordingly. You may also want to follow some network subsystem tuning guide for your OS. Note that EC2 might be a bit specific here, as there were reports about very limited network performance on EC2 instances.

From nginx point of view any solution would be more or less wrong (as the problem isn't in nginx but rather in OS which can't cope with load and drops packets). Nevertheless you may try some tricks to reduce load on OS's network subsystem:

Configure keepalive connections to a backend.
Configure backend to listen on a unix domain socket (if your backend supports it), and configure nginx to proxy requests to it.

score 0 · Answer 2 · answered Sep 05 '12 at 17:56

NGINX uses HTTP/1.0 for backend connections, and has no keepalive by default (see link in Maxim's post for backend keepalive), so this means making a fresh backend connection for each request, increasing the latency somewhat. You should probably also have more worker processes, 2* the number of CPU cores, with a minimum of 5. If you have more than 10 concurrent requests, you might need more threads in CherryPy as well.

nginx reverse proxy greatly increases worst-case latency

2 Answers2