42

We've been running a couple websites off Amazons AWS infrastructure for about two years now and as of about two days ago the webserver started to go down once or twice a day with the only error I can find being:

HTTP/1.1 503 Service Unavailable: Back-end server is at capacity

No alarms (CPU/Disk IO/DB Conn) are being triggered by CloudWatch. I tried going to the site via the elastic IP to skip the ELB and got this:

HTTP request sent, awaiting response... Read error (Connection reset by peer) in headers. Retrying.

I don't see anything out of the ordinary in the apache logs and verified that they were being properly rotated. I have no problems accessing the machine when it's "down" via SSH and looking at the process list I see 151 apache2 processes that appear normal to me. Restarting apache temporarily fixes the problem. This machine operates as just a webserver behind an ELB. Any suggestions would be greatly appreciated.

CPU Utilization Average: 7.45%, Minimum: 0.00%, Maximum: 25.82%

Memory Utilization Average: 11.04%, Minimum: 8.76%, Maximum: 13.84%

Swap Utilization Average: N/A, Minimum: N/A, Maximum: N/A

Disk Space Utilization for /dev/xvda1 mounted on / Average: 62.18%, Minimum: 53.39%, Maximum: 65.49%

Let me clarify I think the issue is with the individual EC2 instance and not the ELB I just didn't want to rule that out even though I was unable to reach the elastic IP. I suspect ELB is just returning the results of hitting the actual EC2 instance.

Update: 2014-08-26 I should have updated this sooner but the "fix" was to take a snapshot of the "bad" instance and start the resulting AMI. It hasn't gone down since then. I did look at the health check when I was still experiencing issues and could get to the health check page (curl http://localhost/page.html) even when I was getting capacity issues from the load balancer. I'm not convinced it was a health check issue but since no one, including Amazon, can provide a better answer I'm marking it as the answer. Thank you.

Update: 2015-05-06 I thought I'd come back here and say that part of the issue I now firmly believe was the health check settings. I don't want to rule out their being an issue with the AMI because it definitely got better after the replacement AMI was launched but I found out that our health checks were different for each load balancer and that the one that was having the most trouble had a really aggressive unhealthy threshold and response timeout. Our traffic tends to spike unpredictably and I think between the aggressive health check settings and the spikes in traffic it was a perfect storm. In diagnosing the issue I was focused on the fact that I could reach the health check endpoint at the moment but it is possible the health check had failed because of latency and then we had a high healthy threshold (for that particular ELB) so it would take while to see the instance as being healthy again.

JSP
  • 533
  • 1
  • 4
  • 6
  • I found more information about at: https://meta.discourse.org/t/ssl-aws-elb-503-service-unavailable-back-end-server-is-at-capacity/29098 – Andre Mesquita Jul 13 '15 at 14:32

5 Answers5

44

You will get a "Back-end server is at capacity" when the ELB load balancer performs its health checks and receives a "page not found" (or other simple error) due to a mis-configuration (typically with the NameVirtual host).

Try grepping the log files folder using the "ELB-HealthChecker" user agent. e.g.

grep ELB-HealthChecker  /var/log/httpd/*

This will typically give you a 4x or 5x error which is easily fixed. e.g. Flooding, MaxClients etc is giving the problem way too much credit.

FYI Amazon: Why not show the returned response from request? Even a status code would help.

Charlie Dalsass
  • 616
  • 6
  • 4
18

I just ran into this issue myself. The Amazon ELB will return this error if there are no healthy instances. Our sites were misconfigured, so the ELB healthcheck was failing, which caused the ELB to take the two servers out of rotation. With zero healthy sites, the ELB returned 503 Service Unavailable: Back-end server is at capacity.

6

[EDIT after understanding the question better] Not having any experience of the ELB, I still think this sounds suspiciously like the 503 error which may be thrown when Apache fronts a Tomcat and floods the connection.

The effect is that if Apache delivers more connection requests than can be processed by the backend, the backend input queues fill up until no more connections can be accepted. When that happens, the corresponding output queues of Apache start filling up. When the queues are full Apache throws a 503. It would follow that the same could happen when Apache is the backend, and the frontend delivers at such a rate as to make the queues fill up.

The (hypothetical) solution is to size the input connectors of the backend and output connectors of the frontend. This turns into a balancing act between the anticipated flooding level and the available RAM of the computers involved.

So as this happens, check your maxclients settings and monitor your busy workers in Apache (mod_status.). Do the same if possible with whatever ELB has that corresponds to Tomcats connector backlog, maxthreads etc. In short, look at everything concerning the input queues of Apache and the output queues of ELB.

Although I fully understand it is not directly applicable, this link contains a sizing guide for the Apache connector. You would need to research the corresponding ELB queue technicalities, then do the math: http://www.cubrid.org/blog/dev-platform/maxclients-in-apache-and-its-effect-on-tomcat-during-full-gc/

As observed in the commentary below, to overwhelm the Apache connector a spike in traffic is not the only possibility. If some requests are slower served than others, a higher ratio of those can also lead to the connector queues filling up. This was true in my case.

Also, when this happened to me I was baffled that I had to restart the Apache service in order to not get served 503:s again. Simply waiting out the connector flooding was not enough. I never got that figured out, but one can speculate in Apache serving from its cache perhaps?

After increasing the number of workers and the corresponding pre-fork maxclients settings (this was multithreaded Apache on Windows which has a couple of other directives for the queues if I remember correctly), the 503-problem disappeared. I actually didn't do the math, but just tweaked the values up until I could observe a wide margin to the peak consumption of the queue resources. I let it go at that.

Hope this was of some help.

ErikE
  • 4,676
  • 1
  • 19
  • 25
  • I just realized you are writing the Apache is your backend. Still, the workers, maxclients etc would play in I guess, however my answer is too off and needs a complete rewrite. I may just delete it instead. Lesson learned: read the question properly. – ErikE Nov 21 '13 at 22:06
  • Thank you. For this to be the case there would have to be a large spike in traffic? And once said traffic let up shouldn't apache be able to recover? – JSP Nov 21 '13 at 22:08
  • In theory, yes. However, when this has happened to me I had to restart the service. This led me first looking in places which had nothing to do with what actually happened, but even after proper diagnose and cure I still haven't been able to understand the necessity of service restart. I silently suspected it was due to running Apache on Windows, as I found an unrelated bug reference which apparently only surfaced with that combo. Very strange in any case. – ErikE Nov 21 '13 at 22:16
  • And yes, there was traffic overwhelming the connectors - not spikey (for us) but too much. It was rather certain requests which were slower to serve which just happened to come too many on occasion. After monitoring a bit and just upping related values the 503's disappeared along with the necessity for subsequent restarts. – ErikE Nov 21 '13 at 22:25
4

you can up the values of the elb health checker, so as a single slow response wont pull a server from elb. better to have a few users get service unavailable, than the site being down for everyone.

EDIT: We are able to get away without pre-warming cache by upping health check timeout to 25 seconds......after 1-2 minutes... site is responsive as hell

EDIT:: just launch a bunch of on demand, and when your monitoring tools shows management just how fast your are, then just prepay RI amazon :P

EDIT: it is possible, a single backend elb registered instance is not enough. just launch a few more, and register them with elb, and that will help you narrow down your problem

nandoP
  • 2,001
  • 14
  • 15
0

It's a few years late, but hopefully this helps someone out.

I was seeing this error when the instance behind the ELB did not have a proper public IP assigned. I needed to manually create an Elastic IP and associate it with the instance after which point in time the ELB picked it up nearly instantly.

Ben Randall
  • 101
  • 1