1

We are running a website on AWS with a specific setup: An ELB splits the load onto 2 x t2.medium instances running nginx. From there the PHP traffic is split into 2 (frontend and API) streams, both to internal ELBs fronting our PHP servers. For the record, we have 2 frontend PHP servers (t2.medium) and 3 API PHP servers (m4.large). All running the same version of PHP-FPM on port 9000.

All worked great until a few days ago. For some reason, yet to be determined, the traffic on the PHP API servers just dies and only an nginx restart brings it back to life.

We assume that we may have some long running process causing one of the PHP server to become busy and it all goes downhill from there. However the CPU usage is fairly constant on all the PHP servers until the point were they stop responding. PHP-FPM is still running and the load on the ngnix servers is still very low. The responses received by the clients are 504 and this is what I see in the nginx error-log: 2016/10/04 14:34:25 [error] 17661#0: *2309784 connect() failed (113: No route to host) while connecting to upstream, client: xxx.xxx.xxx.xxx, server: api.mywebsite.com, request: "GET /some/route HTTP/1.1", upstream: "fastcgi://internalip:9000", host: "api.mywebsite.com"

nginx.conf

worker_processes 4;
worker_connections 19000;

nginx site conf

location ~ \.php$ {
    try_files $uri =404;

    fastcgi_buffer_size 512k;
    fastcgi_buffers 16 256k;
    fastcgi_busy_buffers_size 1024k;

    include fastcgi_params;

    fastcgi_pass route53-php:9000;
    fastcgi_index index.php;

    fastcgi_param REQUEST_URI /api$request_uri;
    fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
}

www.conf

listen = 9000
pm = dynamic
pm.max_children = 50
pm.start_servers = 25
pm.min_spare_servers = 25
pm.max_spare_servers = 25
pm.max_requests = 500

As the setup is far from trivial, I wonder if the PHP location block is properly setup. It could also be the size of the servers used but the CPU usage is very low.

Alex
  • 111
  • 4
  • "No route to host while connecting to upstream" would seem to be a significant clue. When this happens I suggest you SSH into the web server and do diagnostics. Ping, curl the ELB, curl the instance directly, that sort of thing. Report back what you find. – Tim Oct 04 '16 at 18:26
  • Thanks for the comment - it happened again right now and I am starting to wonder if the ELB timeout is not a problem here. Nginx sees it as "dead" and ignores it - is Nginx even capable of doing that itself without using upstream? – Alex Oct 06 '16 at 09:21
  • Actually it is showing an error in the logs I had not noticed before `upstream prematurely closed connection while reading response header from upstream` - right before it goes downhill. Will investigate what's causing PHP to die but I suspect there's a memory leak somewhere. – Alex Oct 06 '16 at 09:28

1 Answers1

0

Right so it turns out this is a common problem with nginx talking to an AWS internal ELB. After a bit more googling, I found this question: Some nginx reverse proxy configs stops working once a day and adding the resolver helped - I have not had any downtime for 3 days now.

It's also interesting to point out that every article I found is talking about proxy_pass but it seems to work just fine with fastcgi_pass as well.

Hopefully this will help someone in the same situation!

Alex
  • 111
  • 4