I work for a rather busy internet site that is often gets very large spikes of traffic. During these spikes hundreds of pages per second are requested and this produces random 502 gateway errors.

Now we run Nginx (1.0.10) and PHP-FPM on a machine with 4x SAS 15k drives (raid10) with a 16 core CPU and 24GB of DDR3 ram. Also we make use of the latest Xcache version. The DB is located on another machine, but this machine's load is very low, and has no issues.

Under normal load everything runs perfect, system load is below 1, and PHP-FPM status report never really shows more than 10 active processes at one time. There is always about 10GB of ram still available. Under normal load the machine handles about 100 pageviews per second.

The problem arises when huge spikes of traffic arrive, and hundreds of page-views per second are requested from the machine. I notice that FPM's status report then shows up to 50 active processes, but that is still way below the 300 max connections that we have configured. During these spikes Nginx status reports up to 5000 active connections instead of the normal average of 1000.

OS Info: CentOS release 5.7 (Final)

CPU: Intel(R) Xeon(R) CPU E5620 @ 2.40GH (16 cores)


daemonize = yes
listen = /tmp/fpm.sock
pm = static
pm.max_children = 300
pm.max_requests = 1000

I have not setup rlimit_files, because as far as I know it should use the system default if you don't.

fastcgi_params (only added values to standard file)

fastcgi_connect_timeout 60;
fastcgi_send_timeout 180;
fastcgi_read_timeout 180;
fastcgi_buffer_size 128k;
fastcgi_buffers 4 256k;
fastcgi_busy_buffers_size 256k;
fastcgi_temp_file_write_size 256k;
fastcgi_intercept_errors on;

fastcgi_pass            unix:/tmp/fpm.sock;


worker_processes        8;
worker_connections      16384;
sendfile                on;
tcp_nopush              on;
keepalive_timeout       4;

Nginx connects to FPM via Unix Socket.


net.ipv4.ip_forward = 0
net.ipv4.conf.default.rp_filter = 1
net.ipv4.conf.default.accept_source_route = 0
kernel.sysrq = 1
kernel.core_uses_pid = 1
net.ipv4.tcp_syncookies = 1
kernel.msgmnb = 65536
kernel.msgmax = 65536
kernel.shmmax = 68719476736
kernel.shmall = 4294967296
net.ipv4.conf.all.send_redirects = 0
net.ipv4.conf.default.send_redirects = 0
net.ipv4.tcp_max_syn_backlog = 2048
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.conf.all.accept_source_route = 0
net.ipv4.conf.all.accept_redirects = 0
net.ipv4.conf.all.secure_redirects = 0
net.ipv4.conf.all.log_martians = 1
net.ipv4.conf.default.accept_redirects = 0
net.ipv4.conf.default.secure_redirects = 0
net.ipv4.icmp_echo_ignore_broadcasts = 1
net.ipv4.icmp_ignore_bogus_error_responses = 1
net.ipv4.conf.default.rp_filter = 1
net.ipv4.tcp_timestamps = 0
net.ipv4.ip_conntrack_max = 100000


* soft nofile 65536
* hard nofile 65536

These are the results for the following commands:

ulimit -n

ulimit -Sn

ulimit -Hn

cat /proc/sys/fs/file-max

Question: If PHP-FPM is not running out of connections, the load is still low, and there is plenty of RAM available, what bottleneck could be causing these random 502 gateway errors during high traffic?

Note: by default this machine's ulimit's were 1024, since I changed it to 65536 I have not fully rebooted the machine, as it's a production machine and it would mean too much downtime.

  • 1,441
  • 4
  • 24
  • 41

2 Answers2


Sporadic 502 errors from load balancers, such as HAProxy and nginx, are usually caused by something getting cut off mid stream between the LB and the Web Server.

Try running one of your web servers, or a test copy of it, through GDB and see if you see a segmentation fault when generating test traffic (use ab or jMeter or similar to simulate the traffic).

I had to solve a very similar scenario/problem recently. I'd ruled out resources etc causing the issue as I had pretty comprehensive monitoring that was helping me there. In the end I found that the 502 error was coming from the web server behind the load balancer returning invalid (in this case empty) HTTP responses to the LB.

I took one of the web servers and stopped the web server, then started it again via gdb then browsed the sited. Eventually after some clicking around I saw a segmentation fault happen, and this caused a 502 error to be visible. I took the backtrace from GDB and submitted it to PHP team as a bug, but the only fix for me was to switch distribution to work around the PHP bug that was there.

The segfault was causing the web server to send invalid content to the LB, and the LB was displaying a 502 error because as far as it's concerned the web server has disappeared "mid flow".

I know this doesn't directly answer your question, but it's a place to start looking. Assuming you do see a segfault you can get the stack trace from GDB then you can hopefully work backwards and find what function is causing the segmentation fault.

  • 3,195
  • 5
  • 30
  • 55
  • Thank you for the extended answer, but in this setup we do not use a loadbalancer. – Mr.Boon Jan 07 '12 at 17:22
  • No I understand that, however the bad gateway concept is the same. nginx receives the connection, and passes it to php-fpm. I would expect the bad gateway will be communication failure between the two. – SimonJGreen Jan 07 '12 at 17:24
  • 1
    Still not sure if this is the way to go though. As under normal load there are never 502 errors, only when it's really busy, and it's hitting some kind of limit somewhere. Problem is of course that I am unsure what this limit is, and which value to adjust. – Mr.Boon Jan 07 '12 at 17:59

Official recommendation: worker_processes = number of cores CPU

set worker_processes 16;

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
Artur Zh
  • 36
  • 1
  • Thank you, I have changed that value. Any other recommendations? worker_rlimit_nofile for example? If so, what value. Also because Nginx connects to FPM via unix socket, could I be filling up the socket use too much? – Mr.Boon Jan 07 '12 at 18:14
  • Just got another traffic spike again, and with worker_processes on 16 it seems to make no difference. – Mr.Boon Jan 07 '12 at 21:03
  • what shows the message log at this moment? Can you try to use the listen =; instead listen = /tmp/fpm.sock; ? and test again – Artur Zh Jan 08 '12 at 18:45
  • Hi, I actually tried that yesterday, and it ran very well for a long time, seemed to be better than via Unix Socket. But I also run (d)dos deflate to ban people who make more than a set a mount of connections. After running php-fpm via for about 18 hours, all the sudden my sites are offline, and i notice that for some yet unknown reason, dos deflate had banned, when this IP was of course whitelisted. So i switched back to UNIX socket, later on i found out that localhost wasn't whitelisted anymore, so i'm a bit freaked out, and clueless how this happened. – Mr.Boon Jan 09 '12 at 20:36