What could explain downtime when ALB connection count is high but request rates are low?

Question

I have a backend service on EC2s, and requests to the EC2s are routed through an ALB.

My backend service had a brief downtime during which it's response latency shot up. This led to a massive build up of connection count at the ALB.

However, when trying to re-deploy my service (on new nodes) I noticed that the nodes would go down after a few minutes of uptime (responses would get slower and slower until everything started to freeze). During this period, the number of requests to the service remained constant, but the connection count built up.

Is it possible that the EC2s have been saturated with high number of connections that could cause new nodes to fail? Or is an ALB conscious enough about creating too many connections to EC2 to avoid this?

score 0 · Answer 1 · answered Jun 07 '20 at 09:45

There can be a multitude of causes here. Firstly, are your EC2 instances registered properly in your Application Load Balancer as targets, and are registered as healthy by the ALB? Remember that you have to manually add the new instances to your ALB as a target when you create them.

What sized EC2 instances did you use, where is the traffic originating from and what software is running on the instances? These are all important questions that you should write in your question so that there can be a starting point for troubleshooting. Are the instances EBS or EFS backed? If EFS, the slow read speeds of the volume could be the problem. You need to check the utilization of your EC2 instances in CloudWatch as well as it seems that they are the ones bottlenecking.

The ALB is not the root of the problem. It is simply reporting that the latency between forwarding a request to your backend and getting a response is high. This means your backend is taking it's sweet time to give a response.

According to AWS support: As the ELB (when configured with HTTP listeners) acts as a proxy (request headers comes in and gets validated, and then sent to the backend) the latency metric will start ticking as soon as the headers are sent to the backend until the backend sends the first byte responses.

Your EC2 instances are slow to respond and get overloaded for whatever reason, that's where your problem is. An ALB does not have the ability to actively monitor the CPU usage of your instances and actively balance the requests. As long as an instance is healthy, it's getting the requests equally distributed.

What could explain downtime when ALB connection count is high but request rates are low?

1 Answers1