29

I am running an nginx server that acts as a proxy to an upstream unix socket, like this:

upstream app_server {
        server unix:/tmp/app.sock fail_timeout=0;
}

server {
        listen ###.###.###.###;
        server_name whatever.server;
        root /web/root;

        try_files $uri @app;
        location @app {
                proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
                proxy_set_header X-Forwarded-Proto $scheme;
                proxy_set_header Host $http_host;
                proxy_redirect off;
                proxy_pass http://app_server;
        }
}

Some app server processes, in turn, pull requests off /tmp/app.sock as they become available. The particular app server in use here is Unicorn, but I don't think that's relevant to this question.

The issue is, it just seems that past a certain amount of load, nginx can't get requests through the socket at a fast enough rate. It doesn't matter how many app server processes I set up.

I'm getting a flood of these messages in the nginx error log:

connect() to unix:/tmp/app.sock failed (11: Resource temporarily unavailable) while connecting to upstream

Many requests result in status code 502, and those that don't take a long time to complete. The nginx write queue stat hovers around 1000.

Anyway, I feel like I'm missing something obvious here, because this particular configuration of nginx and app server is pretty common, especially with Unicorn (it's the recommended method in fact). Are there any linux kernel options that needs to be set, or something in nginx? Any ideas about how to increase the throughput to the upstream socket? Something that I'm clearly doing wrong?

Additional information on the environment:

$ uname -a
Linux servername 2.6.35-32-server #67-Ubuntu SMP Mon Mar 5 21:13:25 UTC 2012 x86_64 GNU/Linux

$ ruby -v
ruby 1.9.3p194 (2012-04-20 revision 35410) [x86_64-linux]

$ unicorn -v
unicorn v4.3.1

$ nginx -V
nginx version: nginx/1.2.1
built by gcc 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5)
TLS SNI support enabled

Current kernel tweaks:

net.core.rmem_default = 65536
net.core.wmem_default = 65536
net.core.rmem_max = 16777216
net.core.wmem_max = 16777216
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
net.ipv4.tcp_mem = 16777216 16777216 16777216
net.ipv4.tcp_window_scaling = 1
net.ipv4.route.flush = 1
net.ipv4.tcp_no_metrics_save = 1
net.ipv4.tcp_moderate_rcvbuf = 1
net.core.somaxconn = 8192
net.netfilter.nf_conntrack_max = 524288

Ulimit settings for the nginx user:

core file size          (blocks, -c) 0
data seg size           (kbytes, -d) unlimited
scheduling priority             (-e) 20
file size               (blocks, -f) unlimited
pending signals                 (-i) 16382
max locked memory       (kbytes, -l) 64
max memory size         (kbytes, -m) unlimited
open files                      (-n) 65535
pipe size            (512 bytes, -p) 8
POSIX message queues     (bytes, -q) 819200
real-time priority              (-r) 0
stack size              (kbytes, -s) 8192
cpu time               (seconds, -t) unlimited
max user processes              (-u) unlimited
virtual memory          (kbytes, -v) unlimited
file locks                      (-x) unlimited
Ben Lee
  • 626
  • 2
  • 9
  • 23

5 Answers5

18

It sounds like the bottleneck is the app powering the socket rather than it being Nginx itself. We see this a lot with PHP when used with sockets versus a TCP/IP connection. In our case, PHP bottlenecks much earlier than Nginx ever would though.

Have you checked over the sysctl.conf connection tracking limit, socket backlog limit

  • net.core.somaxconn
  • net.core.netdev_max_backlog
quanta
  • 50,327
  • 19
  • 152
  • 213
Ben Lessani
  • 5,174
  • 16
  • 37
  • 2
    I figured out the problem. See the answer I posted. It actually *was* the app bottlenecking, not the socket, just as you posit. I had ruled this out earlier due to a mis-diagnosis, but turns out the problem was throughput to another server. Figured this out just a couple hours ago. I'm going to award you the bounty, since you pretty much nailed the source of the problem even despite the mis-diagnosis I put in the question; however, going to give the checkmark to my answer, because my answer describes the exact circumstances so might help someone in the future with a similar issue. – Ben Lee Jun 20 '12 at 20:23
  • Got a new server moved to a location to provide adequate throughput, completely rebuilt the system, and still have the same problem. So it turns out my problem is unresolved after all... =( I still think it's app-specific, but can't think of anything. This new server is set up exactly like another server where it's working fine. Yes, somaxconn and netdev_max_backlog are st up correctly. – Ben Lee Jun 28 '12 at 20:05
  • Your issue isn't nginx, it is more than capable - but that's not to say you might not have a rogue setting. Sockets are particularly sensitive under high load when the limits aren't configured correctly. Can you try your app with tcp/ip instead? – Ben Lessani Jun 28 '12 at 20:09
  • same problem with even a worse magnitude using tcp/ip (write queue climbs even faster). I have nginx / unicorn / kernel all set up exactly the same (as far as I can tell) on a different machine, and that other machine is not exhibiting this problem. (I can switch dns between the two machines, to get live load testing, and have dns on a 60-sec ttl) – Ben Lee Jun 28 '12 at 20:30
  • Throughput between each machine and a db machine is the same now, and latency between the new machine and db machine is about 30% more than between old machine and db. But 30% more that a tenth of a millisecond is not the problem. – Ben Lee Jun 28 '12 at 20:37
  • You've not accidentally capped yourself with a unaccounted ulimit setting? – Ben Lessani Jun 28 '12 at 20:45
  • Nope, ulimit settings on both machines are the same (specifically open files is 65535, everything else looks fine too). – Ben Lee Jun 28 '12 at 20:56
  • I added my ulimit settings to the end of the question. – Ben Lee Jun 28 '12 at 20:59
  • @BenLee Did you figure this one out? I may be facing similar problem. – tarkeshwar Mar 04 '13 at 11:54
  • @tarkeshwar, no, never figured it out. Eventually ended up going with different hardware and and somewhat different server stack instead of solving the problem. – Ben Lee Mar 04 '13 at 16:19
3

tl;dr

  1. Make sure Unicorn backlog is large (use socket, faster than TCP) listen("/var/www/unicorn.sock", backlog: 1024)
  2. Optimise NGINX performance settings, for example worker_connections 10000;

Discussion

We had the same problem - a Rails app served by Unicorn behind a NGINX reverse proxy.

We were getting lines like these in Nginx error log:

2019/01/29 15:54:37 [error] 3999#3999: *846 connect() to unix:/../unicorn.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: xx.xx.xx.xx, request: "GET / HTTP/1.1"

Reading the other answers we also figured that maybe Unicorn is to blame, so we increased it's backlog, but this did not resolve the problem. Monitoring server processes it was obvious that Unicorn was not getting the requests to work on, so NGINX appeared to be the bottleneck.

Searching for NGINX settings to tweak in nginx.conf this performance tuning article pointed out several settings that could impact how many parallel requests NGINX can process, especially:

user www-data;
worker_processes auto;
pid /run/nginx.pid;
worker_rlimit_nofile 400000; # important

events {    
  worker_connections 10000; # important
  use epoll; # important
  multi_accept on; # important
}

http {
  sendfile on;
  tcp_nopush on;
  tcp_nodelay on;
  keepalive_timeout 65;
  types_hash_max_size 2048;
  keepalive_requests 100000; # important
  server_names_hash_bucket_size 256;
  include /etc/nginx/mime.types;
  default_type application/octet-stream;
  ssl_protocols TLSv1 TLSv1.1 TLSv1.2;
  ssl_prefer_server_ciphers on;
  access_log /var/log/nginx/access.log;
  error_log /var/log/nginx/error.log;
  gzip on;
  gzip_disable "msie6";
  include /etc/nginx/conf.d/*.conf;
  include /etc/nginx/sites-enabled/*;
}
Epigene
  • 131
  • 3
2

You might try looking at unix_dgram_qlen, see proc docs. Although this may compound the problem by pointing more in the queue? You'll have to look (netstat -x...)

jmw
  • 84
  • 4
1

I solved by increasing the backlog number in the config/unicorn.rb... I used to have a backlog of 64.

 listen "/path/tmp/sockets/manager_rails.sock", backlog: 64

and I was getting this error:

 2014/11/11 15:24:09 [error] 12113#0: *400 connect() to unix:/path/tmp/sockets/manager_rails.sock failed (11: Resource temporarily unavailable) while connecting to upstream, client: 192.168.101.39, server: , request: "GET /welcome HTTP/1.0", upstream: "http://unix:/path/tmp/sockets/manager_rails.sock:/welcome", host: "192.168.101.93:3000"

Now, I increased to 1024 and I don't get the error:

 listen "/path/tmp/sockets/manager_rails.sock", backlog: 1024
Adrian
  • 111
  • 4
-1

backlog default value is 1024 in unicorn config.

http://unicorn.bogomips.org/Unicorn/Configurator.html

listen "/path/to/.unicorn.sock", :backlog => 1024

1024 client is unix domain socket limit.

Falcon Momot
  • 24,975
  • 13
  • 61
  • 92