7

I am trying to run a machine learning job on GKE, and need to use a GPU.

I created a node pool with Tesla K80, as described in this walkthrough.

I set the minimum node size to 0, and hoped that the autoscaler would automatically determine how many nodes I needed based on my jobs:

gcloud container node-pools create [POOL_NAME] \
--accelerator type=nvidia-tesla-k80,count=1 --zone [COMPUTE_ZONE] \
--cluster [CLUSTER_NAME] --num-nodes 3 --min-nodes 0 --max-nodes 5 \
--enable-autoscaling

Initially, there are no jobs that require GPUs, so the cluster autoscaler correctly downsizes the node pool to 0.

However, when I create job with the following specification

resources:
  requests:
    nvidia.com/gpu: "1"
  limits:
    nvidia.com/gpu: "1"

Here is the full job configuration. (Please note that this configuration is partially auto-generated. I have also removed some environment variables that are not pertinent to the issue).

the pod is stuck pending with Insufficient nvidia.com/gpu until I manually increase the node pool to at least 1 node.

Is this a current limitation of GPU node pools, or did I overlook something?

anna_hope
  • 173
  • 1
  • 5
  • 1
    Does it scale from 1 to 2 correctly? I'm curious if it's an issue with autoscaler + GPU nodes or an issue of 0 to 1 scaling. – Robert Bailey Apr 09 '19 at 17:03
  • @notnami, Can you share the full job configuration? – Nick_Kh Apr 10 '19 at 10:21
  • @RobertBailey As far as I can tell, the cluster also doesn't autoscale from 0 to 1. But I have only tested that cursorily – anna_hope Apr 11 '19 at 14:53
  • @mk_sta I added a more complete configuration (minus some environment variables that should not be related to the issue) – anna_hope Apr 11 '19 at 15:02
  • GPU autoscaling was added to the cluster autoscaler (see https://github.com/kubernetes/autoscaler/issues/392) so now I'm wondering what version of Kubernetes you are running in your cluster. – Robert Bailey Apr 11 '19 at 15:53
  • @RobertBailey 1.12.6-gke.10 which I believe is quite recent – anna_hope Apr 12 '19 at 15:10
  • @notnami did you solve this issue? I am experiencing the same. – Alessandro Apr 02 '20 at 18:38
  • @Alessandro yes, I solved this. As Maciek Pytel suggested, auto scaling wasn’t working because my Node Autoprovisioning limits prevented new nodes from being spun up. – anna_hope Apr 03 '20 at 13:24

1 Answers1

7

Autoscaler supports scaling GPU nodepools (including to and from 0).

One possible reason for this problem is if you have enabled Node Auto-Provisioning and set resouce limits (via UI or gcloud flags such as --max-cpu, max-memory, etc). Those limits apply to ALL autoscaling in the cluster, including nodepools you created manually with enabled autoscaling (see note in documentation: https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#resource_limits).

In particular if you have enabled NAP and you want to autoscale nodepools with GPUs you need to set resouce limits for GPUs as described in https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#gpu_limits.

Finally, autoprovisioning also supports GPUs, so (assuming you set the resource limits as described above) you don't actually need to create nodepool for your GPU workload - NAP will create one for you automatically.

===

Also, for future reference - if autoscaler fails to create nodes for some of your pods, you can try to debug it using autoscaler events:

  • On your pod (kubectl describe pod <your-pod>) there should be one of the 2 events (it may take a minute until they show up):
    • TriggeredScaleUp - this mean the autoscaler decided to add a node for this pod.
    • NotTriggerScaleUp - autoscaler spotted your pod, but it doesn't think any nodepool can be scaled up to help it. In 1.12 and later the event contains a list of reasons why adding nodes to different nodepools wouldn't help the pod. This is usually the most useful event for debugging.
  • kubectl get events -n kube-system | grep cluster-autoscaler will give you events describing all autoscaler actions (scale-up, scale-down). If a scale-up was attempted, but failed for whatever reason it will also have events describing that.

Note that events are only available in Kubernetes for 1 hour after they were created. You can see historical events in Stackdriver by going to UI and navigating to Stackdriver->Logging->Logs and choosing "GKE Cluster Operations" in drop-down.

Finally you can check the current status of autoscaler by running kubectl get configmap cluster-autoscaler-status -o yaml -n kube-system.

  • Here is what I've been seeing (just tried again to confirm): '0/5 nodes are available: 3 Insufficient cpu, 5 Insufficient nvidia.com/gpu.' `kubectl get events -n kube-system | grep cluster-autoscaler` shows `No resources found` There is nothing in the Stackdriver log. I also have node auto-provisioning turned on, so if I've read the docs right, there should be no reason for the pod to be stuck pending at all. But I don't know what else to try, short of resizing the node pool every time by hand. – anna_hope Apr 12 '19 at 20:32
  • One possible cause is if you're hitting resource limits defined for autoscaling. When you enable autoprovisioning you must set resource limits (either in UI, or via --max-cpu / --max-memory gcloud flags) - those limits apply to ALL autoscaling in the cluster, including nodepools you created manually with enabled autoscaling (see note in documentation: https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#resource_limits). – Maciek Pytel Apr 15 '19 at 09:53
  • Can you check if your limits are high enough to allow to create a new node in your nodepool? Especially the GPU limit (reference: https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#gpu_limits). – Maciek Pytel Apr 15 '19 at 09:55
  • thank you for pointing out that the autoprovisioning limits apply at the cluster level -- that wasn't clear to me at first. Now that I have increased those, node autoprovisioning appears to be working. I may be closer to figuring this out! – anna_hope Apr 15 '19 at 21:07
  • So when you say you may be closer to figuring it out does that mean that GPU autoscaling is now working as expected? Or do you still have a problem with it after adding a K80 limit? – Maciek Pytel Apr 16 '19 at 13:47
  • Yes, after testing it for a few days, I can confirm that the autoscaling started working once I increased the overall node-autoprovisioning limit, both for autoprovisioned node pools and ones I created manually. I was confused about node autoprovisioning limits and had thought that those limits applied to a single node pool, not the whole cluster. If you put the comment you left about node auto-provisioning two days ago into your answer, I can approve it! – anna_hope Apr 17 '19 at 18:07