3

We have a bunch of services running in ECS. All of them are set to run at least two instances. With some of the services I notice that at irregular intervals one of the instances gets de-registered. In the logs there are no errors, and the health check never fails. So I'm wondering why does ECS decide to de-register a seemingly perfectly fine running ECS task instance? Is there a way to find out the reason?

This would make it much easier to decide what needs to be done to stabilize it.

Erik
  • 55
  • 1
  • 6

1 Answers1

2

There are a couple of ways to debug this:

  • Obviously logs are helpful in discovering why an instance became unhealthy. If you're using an ELB with a health check, you'll want to check your access logs to see if the health check endpoint returned an error response. You said that you didn't see anything in the logs, but I figured I would mention this for anyone who sees this answer in the future in case it helps in their case.
  • Check the Events tab on the page for a service that had an instance die - when tasks are registered or deregistered, ECS logs the event to the events list. However, you'll want to make sure to check soon after the event happens since the events list will only display the most recent events.
  • If you have the information page for a task open before the task dies, the container definition area may list information under the exit reason section. Similarly to the events page, the deregistered task will eventually get removed after a certain period of time, so it helps to check soon after the task gets removed.
  • If none of the above works, maybe try creating a CloudWatch Dashboard. Use the HTTPCode_ELB_5XX_Count statistic for the ALB/ELB sitting in front of the service - typically these are 504s indicating a timeout (enabling S3 logging for the ELB will tell you for sure), and you might find an elevated rate of 5XX responses if a task is dying due to timeouts during the health check, so this may point you in the right direction - however, do note that such an event will definitely be logged to the events list for the service as well.
John Nicely
  • 148
  • 1
  • 6
  • The exit reason indicated the task gets killed because of memory usage. It's rather inconvenient in this case that ECS removes the stopped task rather quickly so you need to be around at the time it dies to find out why it died. In this case it takes several days to build up enough before it gets killed. – Erik Sep 26 '18 at 07:48