GKE can't schedule newly created pods that demand GPU on newly added nodes with GPUs

Question

When adding new pool nodes with GPU Google Kubernetes Engine can't schedule newly created pods that demand GPU on these new nodes, should be automatic but not for GPU resources I guess, new pods stays in 'pending' state forever, how to fix that?

EDIT: Here is the deployment yaml file, I aim not to bind deployment to a specific node:

    ---
    apiVersion: machinelearning.seldon.io/v1alpha2
    kind: SldDeployment
    metadata:
      labels:
        app: sld
      name: trs-sld
      namespace: trs
    spec:
      annotations:
        project_name: Trs
        deployment_version: v1.0
        seldon.io/rest-connect-retries: '5'
        seldon.io/grpc-connect-retries: '5'
        seldon.io/istio-retries: '10' 
        seldon.io/istio-retries-timeout: '12' 
      name: trs
      predictors:
      - componentSpecs:
        - spec:
            containers:
            - image: eu.gcr.io/trs-141513/trs-native:latest
              imagePullPolicy: Always
              name: classifier
              resources:
                limits:
                  nvidia.com/gpu: 2
              volumeMounts:
                - mountPath: /etc/google_storage/creds
                  name: service-account-creds
                  readOnly: true
            volumes:
              - name: service-account-creds
                secret:
                  secretName: service-account-creds
            terminationGracePeriodSeconds: 20
        graph:
          children: []
          name: classifier
          endpoint:
            type: REST
          type: MODEL
        name: model
        replicas: 1
        annotations:
          predictor_version: v1.0
    ---

welcome to serverfault, if you would like to have more information, you could try kubectl describe pod -n namespace podname — c4f4t0r, Jul 17 '20 at 08:22
Thank you, yes when you do that the description is: 'insufficient gpus', so new added node's gpus are not used — Elras, Jul 17 '20 at 08:27
but are using any nodeselector to bind your deployment to the gpu node? please could show your yaml files — c4f4t0r, Jul 17 '20 at 08:56
Oh no I am not using any nodeselector at all to bind deployment to gpu node, that could fix the problem, but I thought: Can new pod scheduling be made automatic as we add more gpu nodes. — Elras, Jul 17 '20 at 09:19

score 1 · Accepted Answer · answered Jul 29 '20 at 10:38

It turns out you need to install GPU drivers each time a new node is added, e.g., for Ubuntu containers:

kubectl apply -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/master/nvidia-driver-installer/ubuntu/daemonset-preloaded.yaml

GKE can't schedule newly created pods that demand GPU on newly added nodes with GPUs

1 Answers1