1

We have an autoscaling group that spawns worker servers. Each worker server runs celery processes. We monitor the celery queue length using cloudwatch and depending on that queue length we spawn/kill auto scaling server. What you see in this answer is how we are doing it: Is there a way to use length of a RabbitMQ queue used by Celery to start instance in an autoscale group?

Our termination policy is to kill the oldest server first. This happens when the queue length is at zero for consistently 300 seconds.

The normal setup has 3 servers that are always available. The autoscaling group kicks in only when the queue length exceeds a certain number. Say there are 10 jobs in queue for consistently 30 seconds.

I have not set up any routing nor priority in my celery config.

Here is the problem. When the scale down occurs, I am not entirely sure if the host that is getting killed is free because all workers are treated equally. Tasks sometimes take up to 5-10 minutes and I do not want the server to be killed if it is in the middle of executing a task

I have not faced any problems so far. But I am afraid some of our customers might face a problem because of this

1 Answers1

2

You can use a lifecycle event to do custom actions when the instance is in the "terminating:wait" state.

enter image description here

Create a lifecycle hook as per the steps on this page, copied below. In this state a script or Lambda can hold the instance open until all jobs are done. The page I linked to has additional information on cooldown periods.

The Auto Scaling group responds to scale-out events by launching instances and scale-in events by terminating instances.

The lifecycle hook puts the instance into a wait state (Pending:Wait or Terminating:Wait). The instance is paused until either you continue or the timeout period ends.

You can perform a custom action using one or more of the following options:

Define a CloudWatch Events target to invoke a Lambda function when a lifecycle action occurs. The Lambda function is invoked when Amazon EC2 Auto Scaling submits an event for a lifecycle action to CloudWatch Events. The event contains information about the instance that is launching or terminating, and a token that you can use to control the lifecycle action.

Define a notification target for the lifecycle hook. Amazon EC2 Auto Scaling sends a message to the notification target. The message contains information about the instance that is launching or terminating, and a token that you can use to control the lifecycle action.

Create a script that runs on the instance as the instance starts. The script can control the lifecycle action using the ID of the instance on which it runs.

By default, the instance remains in a wait state for one hour, and then the Auto Scaling group continues the launch or terminate process (Pending:Proceed or Terminating:Proceed). If you need more time, you can restart the timeout period by recording a heartbeat. If you finish before the timeout period ends, you can complete the lifecycle action, which continues the launch or termination process.

Tim
  • 30,383
  • 6
  • 47
  • 77