Amazon ECS Batch job failing after six hours

Question

We have an AWS Batch system that processes geospatial imagery out of one S3 bucket and into another. It has an instance policy allowing it to access the buckets. The system kicks off quite a few parallel tasks, and most run only for a few minutes or tens of minutes. Some run considerably longer, but NONE will run for longer than six hours.

After six hours, the Python3 script they are running throws a TypeError (not a permissions error, not an out of memory error, not any kind of interrupt like a SIGKILL) and stops running. Then the batch job stops.

We would assume a bug in our script - except that when the exact same scripts, using the exact same inputs, are run on EC2 (or on a real PC) the scripts run to completion with no errors, even if they run longer than six hours.

We are wondering if there is some internal limit in AWS Batch? There are no long-lived AWS calls happening, the session tokens are happily renewing themselves, we are not hitting any account limits as far as we can tell.

Does your script using AWS API or need connection to internet? Last time, we have the same problem because we forgot to setup VPC endpoint. — Ilham Sulaksono, Jun 02 '19 at 20:09

Amazon ECS Batch job failing after six hours

0 Answers0