2

This has happened a couple of times since we moved our cluster project from Google to AWS.

We have an EFS volume that's mounted on a load-balanced cluster in a Beanstalk project.

I will be in the middle of setting something up, either uploading a large ZIP file to that EFS volume (via an instance on the load-balanced cluster), or unzipping one from an ssh session on the cluster instance, and I will suddenly find the instance ripped out from under me, and find that the cluster has bred two (or more) new instances, and is shutting down the one I was accessing.

What is going on here? The instances are all "t2-micro" instances; are they inadequate to the sustained load, and running out of burst capacity? Has anybody seen anything like this?

hbquikcomjamesl
  • 219
  • 1
  • 13
  • How large is your EFS filesystem (in GB/TB)? Are you using provisioned throughput on EFS? How do the CloudWatch metrics on EFS look after these incidents occur? – Michael - sqlbot Feb 19 '19 at 01:56
  • You need to look at your cloudwatch metrics and give us additional info. Generally I wouldn't use t2/t3 instance in an auto scaling group as they have variable performance, eg when you run out of CPU credits their performance drops. Why the instance is being deleted depends on your configuration, and AWS is very configurable. – Tim Feb 19 '19 at 02:46
  • Approximately leading up to the time the proverbial rug was pulled out from under me, the CloudWatch on on the EFS shows first a downspike in "Burst Credit balance (but the entire range shown on the vertical axis of that graph is from 2.31T to 2.31T [?!?]) and a spike in "Percent IO Limit Average" of 2.31%. Then, for about an hour and 45 minutes, apparently coinciding with my upload, "Data Write IO Bytes Average" shows 1.05M. Then "Client Connections Sum" spikes at 20, right around the time the new instances are spawned and the one I was using shut down. – hbquikcomjamesl Feb 19 '19 at 17:39
  • The "Metered Size" is 399.3M. The Performance Mode is "General Purpose"; the Throughput Mode" is "Bursting." – hbquikcomjamesl Feb 19 '19 at 17:40

1 Answers1

2

So you've got this t2.micro in an Auto Scaling Group (ASG) I assume?

And this ASG is configured to scale up/down based on average CPU load?

You overload it with some large ZIP file manipulation, run out of CPU Credits, CloudWatch notices the average CPU load goes above the threshold and starts a new instance. As expected.

That takes the average CPU load down and ASG terminates the longest running instance (the one you're working on). Also as expected.

  1. I guess your scaling up/down thresholds are too close to each other (maybe you've got scale up when load > 60% and scale down when load is < 50%) - configure a bigger gap, e.g. 60% / 30%).

  2. Don't overload T2/T3, use T2/T3 Unlimited, or use some other instance type like M4, M5 or C5 that don't use CPU credits and provide consistent performance.

  3. Treat instances in ASG as immutable - you should never need to login to instance in ASG, all their configuration should be done automatically through Launch Config scripts. Because you never know when they start or stop.

Hope that helps :)

MLu
  • 23,798
  • 5
  • 54
  • 81
  • Thanks, @MLu. And yes, there is an ASG. Understand that I'm not the one who set any of this up, and I don't know (and can't seem to find) where to set the instance type or the "T2/T3 Unlimited" option on this thing. But I definitely agree that doing this through a cluster node, instead of through a separate maintenance instance, or by finding a way to go directly to the EFS volume, is not a good idea, and not something I want to keep doing long-term. – hbquikcomjamesl Feb 19 '19 at 17:59
  • I am, at this moment, uploading to the EFS volume through a maintenance instance I created by cloning a cluster instance with "Launch more like this" on the "Actions" menu. I checked our ASGs, and it doesn't show up in any of them. Outlook: hopeful. – hbquikcomjamesl Feb 19 '19 at 18:46