We are running a website on AWS with a specific setup: An ELB splits the load onto 2 x t2.medium instances running nginx. From there the PHP traffic is split into 2 (frontend and API) streams, both to internal ELBs fronting our PHP servers. For the record, we have 2 frontend PHP servers (t2.medium) and 3 API PHP servers (m4.large). All running the same version of PHP-FPM on port 9000.
All worked great until a few days ago. For some reason, yet to be determined, the traffic on the PHP API servers just dies and only an nginx restart brings it back to life.
We assume that we may have some long running process causing one of the PHP server to become busy and it all goes downhill from there. However the CPU usage is fairly constant on all the PHP servers until the point were they stop responding. PHP-FPM is still running and the load on the ngnix servers is still very low.
The responses received by the clients are 504 and this is what I see in the nginx error-log:
2016/10/04 14:34:25 [error] 17661#0: *2309784 connect() failed (113: No route to host) while connecting to upstream, client: xxx.xxx.xxx.xxx, server: api.mywebsite.com, request: "GET /some/route HTTP/1.1", upstream: "fastcgi://internalip:9000", host: "api.mywebsite.com"
nginx.conf
worker_processes 4;
worker_connections 19000;
nginx site conf
location ~ \.php$ {
try_files $uri =404;
fastcgi_buffer_size 512k;
fastcgi_buffers 16 256k;
fastcgi_busy_buffers_size 1024k;
include fastcgi_params;
fastcgi_pass route53-php:9000;
fastcgi_index index.php;
fastcgi_param REQUEST_URI /api$request_uri;
fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}
www.conf
listen = 9000
pm = dynamic
pm.max_children = 50
pm.start_servers = 25
pm.min_spare_servers = 25
pm.max_spare_servers = 25
pm.max_requests = 500
As the setup is far from trivial, I wonder if the PHP location block is properly setup. It could also be the size of the servers used but the CPU usage is very low.