1

I am running an application on AWS Elastic Beanstalk. If an instance responds too often with an HTTP status code in the 500 (server error) range AWS marks this instance as unhealthy and removes the instance from the load balancer.

I understand this and agree that this is actually a good behavior. But unfortunately, this leads to problems with my application.

My application needs to connect to several external APIs and aggregates their data. One of the external APIs – which is not under my control – is flaky and responds quite often with a 500 status code.

At the moment if an API raises an error my application just passes that error back to the user. Causing AWS thinking my application had an error and therefore terminating that instance and starting a new server. But actually, it is only one of the endpoint causing a constant rate of 500 errors, whereas all other endpoints are still fine.

On the one hand, it is correct that an external server error causes my application to just return that error. On the other hand, this kind of external server error is not in my application and I could catch it. But even if I catch the error I cannot return anything useful to the user and therefore still need to return with an error code.

How can handle this? Avoiding server error status codes to not trigger unhealthy instances, but at the same time not using a client error status code because there is nothing the user can do, they just need to retry?

What do you suggest? Or is there another option to fine-tune AWS Elastic Beanstalks behavior?

spickermann
  • 113
  • 6
  • This is IMHO more a software design issue and therefor not quite on-topic on ServerFault. I *think* that a 500 error response in an end-point should not necessarily result in your applications healthcheck also generating a 500 response, but that really depends... Does your application gracefully deal with failure of that end-point or will the work-flow it supports die a horrible death halfway through at the stage where it requires that failing API? Or does your application already gracefully deal with that flapping API? – HBruijn May 10 '19 at 13:34
  • Just for clarification: The health check route returns `ok` all the time. And my application gets that 500 response and at the moment returns 500 on purpose. The fact that AWS not only looks at the response of the health check but at all traffic is causing this issue. I can return something else (like 4xx) to avoid this AWS behavior, but this is not really client error. @HBruijn – spickermann May 10 '19 at 13:41
  • The health check itself is not a problem. But AWS Beanstalk monitoring and analyzing all traffic is the issue. Once request to that specific endpoint return 500 in 15% of all app requests AWS disables the instance because it is considered unhealthy although all other endpoints including the health check are still fine. – spickermann May 10 '19 at 13:45

1 Answers1

2

The question is then mainly: when requests to that API fail, does your application workflow require that your clients/users
a) be notified of that
b) need to take a follow up action
c) is an HTTP error response the only way they can be notified of that?

If so: then consider when the remote API generates a 500 internal server error to have your application return a 408 error response, which is somewhat appropriate as that it allows the client to resubmit the same request at a later moment. (A "502 Bad Gateway" would be better if not for the restriction below:)

Additionally, you can configure advanced health rules in Elastic Beanstalk where you instruct elastic beanstalk to ignore 4xx errors as indicative of bad health. Unfortunately at the time of writing you can't do the same for 5xx or even more specific http status codes.

HBruijn
  • 72,524
  • 21
  • 127
  • 192