Virtual Kubelet: node selection on scale-out and scale-in

Question

We have a simple 3-node Kubernetes cluster on Azure Kubernetes Service (AKS). A lot of the time that's more than sufficient infrastructure-wise, but occasionally we need to be able scale out for a couple hours to 50 instances of a service then scale back again.

Combining AKS with Azure Container Instances (ACI) via the Virtual Kubelet seems like an ideal solution for this scenario.

From a cost management perspective, we'd prefer stuff running on our VMs if there's available capacity. We're already paying for them, so there's no point paying for ACI instances too.

Question 1

If we scale out via the Virtual Kubelet using ACI, how does Kubernetes choose which pods to scale back later on? Is its approach consistent with our cost management requirement - i.e. "genuine" nodes are preferred - or could it be made to be consistent?

Question 2

Scenario: we run two apps - "App1" and "App2" - in Kubernetes. App1 is the one that causes the need to scale out to 50 instances via ACI; App2 is running on the "genuine" nodes.

Say we then use Helm to update App2. When it restarts, I'm guessing Kubernetes could place it on an ACI instance since we're at scale at that point.

Later, the need for the 50 instances of App1 goes away and we scale back again.

App2 would still be running on ACI instances when we'd prefer them to be back on genuine nodes at this point.

Would Kubernetes manage this according to our cost managemet requirements, or is some extra shepherding required?

It looks like traits and tolerances will play a role in scaling *out* (i.e. `PreferNoSchedule`), but still not clear on scaling *in*. — Chris Wood, Nov 04 '18 at 11:05
do you know when this is supposed to happen? this looks like a good fir for cronjob resource in that case — 4c74356b41, Nov 05 '18 at 17:16
Unfortunately not - it's not a time-based thing but a queue-based thing. — Chris Wood, Nov 08 '18 at 11:24
yeah, i dont think you can change your placement constrains on scale (i mean, you have to change it before scaling takes place) — 4c74356b41, Nov 08 '18 at 11:25

score 2 · Answer 1 · answered Nov 08 '18 at 11:33

Answering my own question - a PreferNoSchedule taint on the Virtual Kubelet will deter pods from being scheduled on the more expensive resources when scaling up, but the scheduler isn't involved in scaling down.

The "victim" is chosen as a result of the following tests:

Pod unassigned vs assigned
Pod status
Pod readiness
Recentness of pod readiness
Pod restart counts
Pod creation time

See:

https://github.com/kubernetes/kubernetes/blob/886e04f1fffbb04faf8a9f9ee141143b2684ae68/pkg/controller/controller_utils.go#L726

In short, there's no real way of controlling victims in a scale-down scenario without resorting to hacky approaches and race conditions.

In the end, we've gone a different path: having two different deployments of the same application and a taint on the Virtual Kubelet:

One deployment that doesn't have a tolerance for the Virtual Kubelet taint - min replicas of 1 and also a max replicas of 5.
Another deployment that has a tolerance for the Virtual Kubelet taint and also a nodeSelector for it. Default replica count is 0 but we can obviously scale that when required.

Outside of that, we have our own microservice running that monitors an Azure Service Bus queue length and makes scaling decisions for the two deployments.

Not quite as elegant as we wished, but does give us complete control over what lives where in our cluster.

Hope this helps somebody!

Virtual Kubelet: node selection on scale-out and scale-in

1 Answers1