I am trying to run a machine learning job on GKE, and need to use a GPU.
I created a node pool with Tesla K80, as described in this walkthrough.
I set the minimum node size to 0, and hoped that the autoscaler would automatically determine how many nodes I needed based on my jobs:
gcloud container node-pools create [POOL_NAME] \
--accelerator type=nvidia-tesla-k80,count=1 --zone [COMPUTE_ZONE] \
--cluster [CLUSTER_NAME] --num-nodes 3 --min-nodes 0 --max-nodes 5 \
--enable-autoscaling
Initially, there are no jobs that require GPUs, so the cluster autoscaler correctly downsizes the node pool to 0.
However, when I create job with the following specification
resources:
requests:
nvidia.com/gpu: "1"
limits:
nvidia.com/gpu: "1"
Here is the full job configuration. (Please note that this configuration is partially auto-generated. I have also removed some environment variables that are not pertinent to the issue).
the pod is stuck pending with Insufficient nvidia.com/gpu
until I manually increase the node pool to at least 1 node.
Is this a current limitation of GPU node pools, or did I overlook something?