4

I am building a render farm using SQS and autoscaling group(s).

I believe my use-case is one of the few where I actually want my group's capacity to match the size of the queue, up to a limit.

Right now, I'm using a target tracking policy that scales based on "BacklogPerInstance", which is just queue_size/group_capacity.

The problem with this approach is that in my case, I want BacklogPerInstance to be 0, which is an invalid target for the scaling policy. I have hacked it by using a target of 0.001, but it's not working very well.

How do I write an autoscaling policy that keeps the group's capacity at the size of the queue?

Brennan
  • 91
  • 5
  • Is your goal to dequeue messages from the SQS queue as fast as the queue is filling up? – Shadi Oct 02 '19 at 14:40
  • Yes. The queue fills with frames that are to be rendered, and they should be cleared as fast as possible, using up to a maximum number of instances. – Brennan Oct 02 '19 at 16:38
  • So your dequeueing code on the ec2 instances are all running at 100% CPU if the queueing is faster than the dequeueing. Otherwise, the CPU for some machines would drop to, say, 90%. It seems to me like you just want to scale up if CPU hits 100% and scale down if CPU drops below 90%. Am I missing anything? – Shadi Oct 03 '19 at 07:53
  • Huh, that makes a lot of sense, I overlooked it completely. The one problem I see with this approach is chronic over-scaling. If there are 20 jobs in the queue, it totally makes sense to have exactly 20 instances. Using a high-CPU metric for scaling would overscale in that case. It's probably not a big deal, but not ideal. Going to test this out. – Brennan Oct 03 '19 at 16:31
  • Another consideration here that I'm not certain how to handle. I like the ASG to sit at 0 capacity until jobs need to run. Using the predefined CPU metric, this causes insufficient data ad infinitum. Does it make sense to create a custom CPU metric, or maybe even a metric that combines CPU and backlog per limit? – Brennan Oct 03 '19 at 17:41
  • Found this answer which has an interesting approach too: https://stackoverflow.com/questions/42890315/is-it-possible-to-have-an-aws-ec2-scale-group-that-defaults-to-0-and-only-contai – Brennan Oct 03 '19 at 17:54
  • That last SO post is really good. Did you consider the lambda path? If you can send a sns message with every new sqs queue item, you could trigger a separate lambda function to render every new frame (or group of frames). Then you don't need to worry about Autoscaling at all. But that depends on whether or not your program architecture fits lambda. – Shadi Oct 04 '19 at 02:19
  • Did you find a solution yet? – Shadi Oct 29 '19 at 15:38

1 Answers1

1

If, with 'up to a limit' you mean a limit of of 5 or so, you could use a step scaling poilicy for this. Something along the lines of this:

SQS queue size | scaling policy group capacity
0              | 1
1              | 2
2              | 3
3              | 4
4              | 5
5 and above    | 6
M. Glatki
  • 1,868
  • 1
  • 16
  • 33
  • This is not scalable enough for my problem, unfortunately. The farm should be able to have a configurable limit. I’ve been using up to 50 instances, by setting the scaling groups “max” attribute – Brennan Sep 30 '19 at 14:48