ECS restarts due to health_check failure when multiple other requests are slow to return

Question

We noticed that our ECS Fargate backend services restart due to a health check response timeout:

(service our-site-com-stack-BackendApiServiceStack...) (port 8000) is unhealthy in (target-group arn:aws:elasticloadbalancing:us-east-1:1234:targetgroup/dev-d-ABC-ABC123/ABC123) due to (reason Request timed out).

We are trying to figure out how to conduct a health_check on our application for ECS that won't needlessly restart our services whenever the database gets busy (or other slow requests are pending).

We originally felt the situation may be similar to that which is described here: https://cloudsoft.io/blog/consequences-of-bad-health-checks-in-aws-application-load-balancer. Basically, that if our database was busy/slow, then the request could timeout.

However, we modified the health_check to not hit our RDS postgres database and even tried shutting off our database. We are able to reach the endpoint even with the database off but we no longer can reach it when we trigger as few as 7 requests that will timeout (such as login requests with the database down) or a similar number of requests that will be slow to return (with the database up).

In our AWS Application Stack, Route 53 is used to route traffic to our CloudFront distribution. CloudFront routes traffic for this endpoint to our Application Load Balancer for the Django application.

Our health check is part of our Django application and basically just returns a 200 response:

def health_check(request):
    response = JsonResponse({"message": "OK"})
    return response

Here's how our health check is setup in CDK:

        self.https_listener = self.alb.add_listener(
            "HTTPSListener",
            port=443,
            certificates=[scope.certificate],
            open=True,
        )

        scope.https_listener.add_targets(
            "BackendTarget",
            port=80,
            targets=[self.backend_service],
            priority=2,
            path_patterns=["*"],
            health_check=elbv2.HealthCheck(
                healthy_http_codes="200-299",
                path="/api/core/health-check/",
            ),
        )

The command that starts our production server is:

GEVENT_RESOLVER=ares gunicorn -t 1000 -k gevent -w 4 -b 0.0.0.0:8000 backend.wsgi

During an unrelated test, we were able to reproduce the same issue using Daphne:

daphne -b 0.0.0.0 -p 8000 backend.asgi:application

score 0 · Answer 1 · answered Feb 16 '22 at 15:08

While the request wasn't hitting the database it was getting stuck behind blocking requests that were (or not returning for other reasons). We misunderstood how easily the gunicorn (or other server) event loop could get blocked and will be redesigning to better use gevents and a celery results backend.

ECS restarts due to health_check failure when multiple other requests are slow to return

1 Answers1