I have a service which looks like: Internet -> Load Balancer -> Instances
What seems to be happening in my service is the following scenario:
A large surge of user traffic comes in, more than the current instances can handle.
The load balancer does health checks, and sees that they are failing, because the instances are overloaded.
The load balancer triggers termination of the failing instances and replaces them with new ones. It detects the high CPU usage across the cluster and also creates more instances than there were previously.
Now there are zero instances that can handle any requests and the service is completely down until the new instances are ready.
Obviously this is a bad state of affairs. What should be done instead?
The tools in question are AWS ECS Fargate and AWS Application Load Balancer. The configuration options seem to consist mostly of tuning various values, without changing behavior.
The health check is very simple, it loads a static file. It basically measures whether the web server in each instance is alive or dead.
The auto scaler is set up to measure CPU load and scale up within 30 seconds.
I don't see how to improve the architecture using this set of tools. Is there a better way?
I think on a conceptual level, what I am looking for is called "Quality of Service" where a single user can't disrupt the availability for other users. It is kind of similar to defending a DDOS except the traffic is legitimate.