0

I use Terraform to set up ASG for my worker servers running sidekiq. On deploy when AMI image_id changes the instances require long graceful shutdown before terminating to finish job processing (max. 30m).

My idea to accomplish that is to:

  1. Use autoscaling module with an initial lifecycle hook that sets up a transition state to autoscaling:EC2_INSTANCE_TERMINATING when removing old ASG and creating new ASG on deploy.

  2. Send transition state to SQS queue.

  3. Set up AWS Lambda function that receives state messages from SQS, sends a remote command with SSM to initiate graceful shutdown on the EC2 instance. Graceful shutdown script running on the instance will send SIGTERM to sidekiq, monitor if sidekiq finished and then send shutdown -h now command at the end to stop the instance.

  4. Lambda exits immediately and does not wait until shutdown script finishes which is expected. It sends SNS notification on success and failure. Now, the problem is that Terraform waits until ASG is successfully removed and it hits timeout which by default is 10m. I would like to force Terraform to continue after ASG is only selected to be removed without actually waiting for it to finish removal.

I have force_delete=true and wait_for_capacity_timeout=0 options set up.

From docs:

force_delete - (Optional) Allows deleting the autoscaling group without waiting for all instances in the pool to terminate. You can force an autoscaling group to delete even if it's in the process of scaling a resource. Normally, Terraform drains all the instances before deleting the group. This bypasses that behavior and potentially leaves resources dangling.

Terraform exists after 10 minutes with the following error message:

Error: Error deleting autoscaling group: Auto Scaling Group still exists

Why isn't it working? Do you think it is a bug?

ahes
  • 95
  • 2
  • 10

1 Answers1

0

I think the issue is that the ASG won't delete itself until all the instances are removed, and its waiting for the hooks to finish running.

Thoughts: 1) Unless you need the queing mechanism of SQS for something, its usually easier to trigger lambda via a CW event

2) When sending the shutdown -h command, you may also want to have it send a command to finish the lifecycle hook so that you don't have to wait for the hook to timeout for the ASG to finish the scaling activity https://docs.aws.amazon.com/cli/latest/reference/autoscaling/complete-lifecycle-action.html

3) The setting you are changing is for launching instances, the docs say "A maximum duration that Terraform should wait for ASG instances to be healthy before timing out". Which implies its only for launching https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html

4) Try changing the 'delete' timeout: https://www.terraform.io/docs/providers/aws/r/autoscaling_group.html#delete

Shahad
  • 326
  • 1
  • 6