Kubernetes a job is moved to another pod

1

A long running job (45h) is moved to another pod causing it to restart.

From the logs I can see that the job received a SIGTERM then it was restarted on another pod and probably on another node too.

The informations retrieved in google cloud are not helping. The pages Yaml or events do not describe this event except for the pod creation.

The job Yaml creationTimestamp: 2019-06-15T10:39:25Z

The pod Yaml creationTimestamp: 2019-06-17T13:26:25Z

I use mostly a default configuration 1.12.6-gke.11 with several of nodes and the servers are not preemptible.

Is it a default behavior of k8s ? If it is, how can I disable it ?

should_be_working

Posted 2019-06-17T15:06:46.553

Reputation: 13

Are you using cluster autoscaling? Does the pod request adequate resources - i.e. was it evicted (showing status "Evicted"), or simply moved because of an issue with its node? Do you have node automatic upgrades enabled? Do you have a PodDisruptionBudget for the pod? – John – 2019-06-19T01:51:29.557

We are using autoscaling. There was no status evicted, if it was a node problem we didn't see it in GCC. We have automatic upgrade enabled and the pod has no PodDisruptionBudget. This is a recurring problem. – should_be_working – 2019-06-20T14:11:36.047

Answers

0

Since you've said that you're using cluster autoscaling, I'm going to assume that the pod is getting removed because the cluster is getting scaled in. We saw a similar issue because we're running video transcoding jobs using a 0-scaled node pool (which then scales out as jobs are added).

Looking into it, we found the autoscaler documentation about the autoscaler and then modified our jobs accordingly:

What types of pods can prevent CA from removing a node?

  • Pods with restrictive PodDisruptionBudget.

  • Kube-system pods that:

    • are not run on the node by default, *
    • don't have a pod disruption budget set or their PDB is too restrictive (since CA 0.6).

Pods that are not backed by a controller object (so not created by deployment, replica set, job, stateful set etc). *

Pods with local storage. *

Pods that cannot be moved elsewhere due to various constraints (lack of resources, non-matching node selectors or affinity, matching anti-affinity, etc)

Pods that have the following annotation set: "cluster-autoscaler.kubernetes.io/safe-to-evict": "false"

It was the last one that did the trick for us. I recommend using this as a starting point.

John

Posted 2019-06-17T15:06:46.553

Reputation: 136

Ok thanks, We'll try that and come back with results (this might takes some days) – should_be_working – 2019-06-20T14:30:08.413

@should OK :) I hope it helps. – John – 2019-06-20T14:30:37.433

1Yes the annotation fixed the problem. Thanks – should_be_working – 2019-06-25T11:44:45.143