0

From what I read here ECS capacity providers should (generally) prevent tasks immediately failing on resource limits by putting them in a "Provisioning" state and spinning up a new EC2 instance

This means, for example, if you call the RunTask API and the tasks don’t get placed on an instance because of insufficient resources (meaning no active instances had sufficient memory, vCPUs, ports, ENIs, and/or GPUs to run the tasks), instead of failing immediately, the task will go into the provisioning state (note, however, that the transition to provisioning only happens if you have enabled managed scaling for the capacity provider; otherwise, tasks that can’t find capacity will fail immediately, as they did previously).

I've setup an ECS cluster, with autoscaling group and ECS capacity provider in terraform. The autoscaling group is set with min_size = 1 and it immediately spins up a single instance... so I'm confident my launch configuration is fine.

However when I repeatedly call "RunTask" through the API (tasks with memory=128), I get tasks failing to start immediately with reason RESOURCE:MEMORY. Also no new instances start up.

I can't figure out what I've misconfigured.


This was all setup in terraform:

resource "aws_ecs_cluster" "ecs_cluster" {
  name = local.cluster_name


  setting {
    name  = "containerInsights"
    value = "enabled"
  }
  tags = var.tags
  capacity_providers = [aws_ecs_capacity_provider.capacity_provider.name]

  # I added this in an attempt to make it spin up new instance 
  default_capacity_provider_strategy {
    capacity_provider = aws_ecs_capacity_provider.capacity_provider.name
  }

}

resource "aws_ecs_capacity_provider" "capacity_provider" {
  name = "${var.tags.PlatformName}-stack-${var.tags.Environment}"

  auto_scaling_group_provider {
    auto_scaling_group_arn         = aws_autoscaling_group.autoscaling_group.arn
    managed_termination_protection = "DISABLED"

    managed_scaling {
      maximum_scaling_step_size = 4
      minimum_scaling_step_size = 1
      status                    = "ENABLED"
      target_capacity           = 100
    }
  }

  tags = var.tags
}

#Compute
resource "aws_autoscaling_group" "autoscaling_group" {
  name                      = "${var.tags.PlatformName}-${var.tags.Environment}"
  # If we're not using it, lets not pay for it
  min_size                  = "1"
  max_size                  = var.ecs_max_size
  launch_configuration      = aws_launch_configuration.launch_config.name
  health_check_grace_period = 60
  default_cooldown          = 30
  termination_policies      = ["OldestInstance"]
  vpc_zone_identifier       = local.subnets
  protect_from_scale_in     = false

  tag {
    key                 = "Name"
    value               = "${var.tags.PlatformName}-${var.tags.Environment}"
    propagate_at_launch = true
  }

  tag {
    key                 = "AmazonECSManaged"
    value               = ""
    propagate_at_launch = true
  }

  dynamic "tag" {
    for_each = var.tags
    content {
      key = tag.key
      propagate_at_launch = true
      value = tag.value
    }
  }

  enabled_metrics = [
    "GroupDesiredCapacity",
    "GroupInServiceInstances",
    "GroupMaxSize",
    "GroupMinSize",
    "GroupPendingInstances",
    "GroupStandbyInstances",
    "GroupTerminatingInstances",
    "GroupTotalInstances",
  ]
}
Philip Couling
  • 1,535
  • 1
  • 17
  • 32

1 Answers1

0

It looks like this down to a mistake I made while executing "RunTask" in the API (documented here). I had specified launchType and not capacityProviderStrategy.

From the RunTask documentation:

When you use cluster auto scaling, you must specify capacityProviderStrategy and not launchType.

It seems the result of this is that tasks will start if there is capacity, but will immediately fail if there is insufficient capacity and not give auto-scaling a chance to respond.

I was able to get it to work simply by deleting launchType because default_capacity_provider_strategy was set on the cluster.

Philip Couling
  • 1,535
  • 1
  • 17
  • 32