K8s failed Job keeps running (Duration counting up)

Question

I defined a test job:

apiVersion: batch/v1
kind: Job
metadata:
  name: testjob
spec:
  activeDeadlineSeconds: 100
  backoffLimit: 3
  template:
    spec:
      containers:
      - name: testjob
        image: bitnami/kubectl:1.20
        imagePullPolicy: IfNotPresent
        command:
        - /bin/sh
        - -c
        - echo "Test" && exit 1
      restartPolicy: Never

All of the pods failed "properly", but the duration counter of the job won't stop.

$ kubectl get pods,jobs
NAME                                            READY   STATUS    RESTARTS   AGE
pod/testjob-s2cbf                               0/1     Error     0          3m15s
pod/testjob-nhfgn                               0/1     Error     0          3m14s
pod/testjob-8jw74                               0/1     Error     0          3m4s
pod/testjob-jh7hl                               0/1     Error     0          2m24s

NAME                COMPLETIONS   DURATION   AGE
job.batch/testjob   0/1           3m15s      3m15s

$ kubectl describe job testjob
Name:                     testjob
Namespace:                default
Selector:                 controller-uid=8a1f31c7-8d9d-4b4d-a687-e8e297509a71
Labels:                   controller-uid=8a1f31c7-8d9d-4b4d-a687-e8e297509a71
                          job-name=testjob
Annotations:              <none>
Parallelism:              1
Completions:              1
Start Time:               Wed, 17 Mar 2021 18:13:56 +0000
Active Deadline Seconds:  100s
Pods Statuses:            0 Running / 0 Succeeded / 4 Failed
Pod Template:
  Labels:  controller-uid=8a1f31c7-8d9d-4b4d-a687-e8e297509a71
           job-name=testjob
  Containers:
   testjob:
    Image:      bitnami/kubectl:1.20
    Port:       <none>
    Host Port:  <none>
    Command:
      /bin/sh
      -c
      echo "Test" && exit 1
    Environment:  <none>
    Mounts:       <none>
  Volumes:        <none>
Events:
  Type     Reason                Age    From            Message
  ----     ------                ----   ----            -------
  Normal   SuccessfulCreate      4m11s  job-controller  Created pod: testjob-s2cbf
  Normal   SuccessfulCreate      4m10s  job-controller  Created pod: testjob-nhfgn
  Normal   SuccessfulCreate      4m     job-controller  Created pod: testjob-8jw74
  Normal   SuccessfulCreate      3m20s  job-controller  Created pod: testjob-jh7hl
  Warning  BackoffLimitExceeded  2m     job-controller  Job has reached the specified backoff limit

However, if one the pods finish successfully (status: completed) the duration counter stops as expected.

What is the problem here?

What do you want to achieve ? You can clean up finished `Jobs` (either `Complete` or `Failed`) automatically using a TTL mechanism as described [here](https://kubernetes.io/docs/concepts/workloads/controllers/job/#ttl-mechanism-for-finished-jobs) (this TTL mechanism is alpha). — matt_j, Mar 18 '21 at 16:38
I want to understand why the duration counter won't stop counting if a job failed (backoff-limit or deadline reached). It seems that the job stays in a pending state despite all of the created pods failed. Is this expected behavior? Is this a bug? Or is it a configuration issue on my side? — adroste, Mar 19 '21 at 17:07

score 2 · Accepted Answer · answered Mar 23 '21 at 16:16

If a Job finishes successfully (type=Complete), its .status.completionTime will be set to a specific date. When a Job is Failed (type=Failed), its .status.completionTime is not set at all and therefore DURATION keeps increasing ( to be honest i'm not sure if it's a bug or not ).

I've created a simple example to illustrate how it works.

I have two Jobs: testjob (type=Failed) and testjob-2 (type=Complete):

$ kubectl get jobs
NAME        COMPLETIONS   DURATION   AGE
testjob     0/1           3m15s      3m15s
testjob-2   1/1           1s         2m49s

We can display more information using the -o custom-columns= option:
NOTE: As you can see, .status.completionTime is not set for failed Job.

$ kubectl get jobs testjob testjob-2 -o custom-columns=NAME:.metadata.name,TYPE:.status.conditions[].type,REASON:.status.conditions[].reason,COMPLETIONTIME:.status.completionTime
NAME        TYPE       REASON                 COMPLETIONTIME
testjob     Failed     BackoffLimitExceeded   <none>
testjob-2   Complete   <none>                 2021-03-23T15:51:33Z

Additionally, you can find helpful information on Github: API docs for job status.

Thanks for the explanation. There seems to be a confusion between the domain names "Completion", "Duration", etc. They should either set CompletionTime also if a job failes or introduce another timestamp like "FailedTime" and caluculate the "Duration" via "StartTime - (CompletionTime || FailedTime || 0)" — adroste, Mar 23 '21 at 19:58

K8s failed Job keeps running (Duration counting up)

1 Answers1