We've been running a couple websites off Amazons AWS infrastructure for about two years now and as of about two days ago the webserver started to go down once or twice a day with the only error I can find being:
HTTP/1.1 503 Service Unavailable: Back-end server is at capacity
No alarms (CPU/Disk IO/DB Conn) are being triggered by CloudWatch. I tried going to the site via the elastic IP to skip the ELB and got this:
HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers. Retrying.
I don't see anything out of the ordinary in the apache logs and verified that they were being properly rotated. I have no problems accessing the machine when it's "down" via SSH and looking at the process list I see 151 apache2 processes that appear normal to me. Restarting apache temporarily fixes the problem. This machine operates as just a webserver behind an ELB. Any suggestions would be greatly appreciated.
CPU Utilization Average: 7.45%, Minimum: 0.00%, Maximum: 25.82%
Memory Utilization Average: 11.04%, Minimum: 8.76%, Maximum: 13.84%
Swap Utilization Average: N/A, Minimum: N/A, Maximum: N/A
Disk Space Utilization for /dev/xvda1 mounted on / Average: 62.18%, Minimum: 53.39%, Maximum: 65.49%
Let me clarify I think the issue is with the individual EC2 instance and not the ELB I just didn't want to rule that out even though I was unable to reach the elastic IP. I suspect ELB is just returning the results of hitting the actual EC2 instance.
Update: 2014-08-26
I should have updated this sooner but the "fix" was to take a snapshot of the "bad" instance and start the resulting AMI. It hasn't gone down since then. I did look at the health check when I was still experiencing issues and could get to the health check page (curl http://localhost/page.html
) even when I was getting capacity issues from the load balancer. I'm not convinced it was a health check issue but since no one, including Amazon, can provide a better answer I'm marking it as the answer. Thank you.
Update: 2015-05-06 I thought I'd come back here and say that part of the issue I now firmly believe was the health check settings. I don't want to rule out their being an issue with the AMI because it definitely got better after the replacement AMI was launched but I found out that our health checks were different for each load balancer and that the one that was having the most trouble had a really aggressive unhealthy threshold and response timeout. Our traffic tends to spike unpredictably and I think between the aggressive health check settings and the spikes in traffic it was a perfect storm. In diagnosing the issue I was focused on the fact that I could reach the health check endpoint at the moment but it is possible the health check had failed because of latency and then we had a high healthy threshold (for that particular ELB) so it would take while to see the instance as being healthy again.