As part of my CI test logic, I have a script that creates a Kubernetes deployment file for each of a bunch of dedicated nodes, deletes any previous deployments on them, and then starts the new deployments. (Config appended at the bottom because it's probably not important.) Once I've run the tests with them, they're shut down ready for the next test run. The nodes only run my deployments, so I'm not bothering with declaring how much CPU/memory/whatever they need, and I don't have any readiness scripts because the containers work that stuff out amongst themselves, and I only need to talk to the status-monitoring service, once it has an IP address.

Usually they're ready and working within a minute or so - my script monitors the output of the following command until nothing reports 'false' - but every so often they don't start up within the time I'm allowing: I don't want to wait an indeterminate time if something goes wrong - I need to collect the addresses to feed them to the downstream processes, to set my tests up with the deployments which DID complete - but if kubernetes can't show me a meaningful progress or diagnosis for why things are slow, I can't do much more than abort the incomplete deployments.

kubectl get pods -l pod-agent=$AGENT_NAME \
      -o 'jsonpath={range .items[*]}{..status.conditions[?(@.type=="Ready")].status}:{.status.podIP}:{.status.phase}:{.metadata.name} '

I theorised that it might be that one of the containers hadn't been used on that host before, and that maybe it took so long to copy it to each host that the overall deployment exceeded my script's timeout, so I added this (ignore the | cat | - it's a workaround for a IntelliJ terminal bug)

kubectl describe pod $REPLY | cat | sed -n '/Events:/,$p; /emulator.*:/,/Ready:/p'

to give me a sense of what each pod is doing, each time the first command returns 'false', but I get what looks like inconsistent results: although the 'events' section claims that the containers are pulled and started, the structured output of the same command shows the containers as 'ContainerCreating':

     1  False::Pending:kubulator-mysh-automation11-dlan-666b96d788-6gfl7
    Container ID:   
    Image:          dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Container ID:   
    Image:          dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False

..more of the same, and then

  Type    Reason     Age   From                        Message
  ----    ------     ----  ----                        -------
  Normal  Scheduled  23s   default-scheduler           Successfully assigned auto/kubulator-mysh-automation11-dlan-666b96d788-6gfl7 to automation11.dlan
  Normal  Pulling    16s   kubelet, automation11.dlan  Pulling image "dockerio.dlan/auto/ticket-machine"
  Normal  Pulled     16s   kubelet, automation11.dlan  Successfully pulled image "dockerio.dlan/auto/ticket-machine"
  Normal  Created    16s   kubelet, automation11.dlan  Created container ticket-machine
  Normal  Started    16s   kubelet, automation11.dlan  Started container ticket-machine
  Normal  Pulling    16s   kubelet, automation11.dlan  Pulling image "dockerio.dlan/qa/cgi-bin-remote"
  Normal  Created    15s   kubelet, automation11.dlan  Created container cgi-adb-remote
  Normal  Pulled     15s   kubelet, automation11.dlan  Successfully pulled image "dockerio.dlan/qa/cgi-bin-remote"
  Normal  Started    15s   kubelet, automation11.dlan  Started container cgi-adb-remote
  Normal  Pulling    15s   kubelet, automation11.dlan  Pulling image "dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240"
  Normal  Pulled     15s   kubelet, automation11.dlan  Successfully pulled image "dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240"
  Normal  Created    15s   kubelet, automation11.dlan  Created container emulator-5554
  Normal  Started    15s   kubelet, automation11.dlan  Started container emulator-5554
  Normal  Pulled     15s   kubelet, automation11.dlan  Successfully pulled image "dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240"
  Normal  Pulling    15s   kubelet, automation11.dlan  Pulling image "dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240"
  Normal  Created    14s   kubelet, automation11.dlan  Created container emulator-5556
  Normal  Started    14s   kubelet, automation11.dlan  Started container emulator-5556
  Normal  Pulling    14s   kubelet, automation11.dlan  Pulling image "dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240"

so the events claim the containers are started, but the structured data contradicts it. I'd use the events as authoritative, but they're rather bizarrely truncated at the 26 leading(!) entries despite the server not being set up with any event rate-limiting configuration.

I include a full description of one of the containers which events claims has 'started' at the end, but I don't see any clues in the full output.

Once the deployment has started - i.e. the first line shows 'true', all the containers abruptly show as 'Running'.

So my fundamental question is how can I determine the actual state of my deployment - apparently as represented by 'events' - to understand why and where it's stuck on those occasions when it fails, given that describe pod is apparently unreliable and/or incomplete?

Is there something beyond 'kubectl get pods' I can use to find the REAL state of play? (Preferably not something crufty like ssh-ing to the server and sniffing its raw logs.)


kubectl version Client Version: version.Info{Major:"1", Minor:"16", GitVersion:"v1.16.3", GitCommit:"b3cbbae08ec52a7fc73d334838e18d17e8512749", GitTreeState:"clean", BuildDate:"2019-11-13T11:23:11Z", GoVersion:"go1.12.12", Compiler:"gc", Platform:"linux/amd64"} Server Version: version.Info{Major:"1", Minor:"15", GitVersion:"v1.15.3", GitCommit:"2d3c76f9091b6bec110a5e63777c332469e0cba2", GitTreeState:"clean", BuildDate:"2019-08-19T11:05:50Z", GoVersion:"go1.12.9", Compiler:"gc", Platform:"linux/amd64"}

My deployment file:

apiVersion: v1
kind: Service
  name: kubulator-mysh-automation11-dlan
    run: kubulator-mysh-automation11-dlan
    pod-agent: mysh
  type: ClusterIP
  clusterIP: None
    - name: http
      protocol: TCP
      port: 8088
      targetPort: 8088
    - name: adb-remote
      protocol: TCP
      port: 8080
      targetPort: 8080
    - name: adb
      protocol: TCP
      port: 9100
      targetPort: 9100
    run: kubulator-mysh-automation11-dlan
    kubernetes.io/hostname: automation11.dlan
apiVersion: apps/v1
kind: Deployment
  name: kubulator-mysh-automation11-dlan
    pod-agent: mysh
      run: kubulator-mysh-automation11-dlan
      pod-agent: mysh
  replicas: 1
        run: kubulator-mysh-automation11-dlan
        pod-agent: mysh
        kubernetes.io/hostname: automation11.dlan
        - name: dev-kvm
            path: /dev/kvm
            type: CharDevice
        - name: logs
          emptyDir: {}
- name: ticket-machine
  image: dockerio.dlan/auto/ticket-machine
  args: ['--', '--count', '20']  # --adb /local/adb-....
  imagePullPolicy: Always
    - mountPath: /logs
      name: logs
    - containerPort: 8088
      value: "9100"
      value: host
- name: cgi-adb-remote
  image: dockerio.dlan/qa/cgi-bin-remote
  args: ['/root/git/CgiAdbRemote/CgiAdbRemote.pl', '-foreground', '-port=8080', "-adb=/root/adb-8aug-usbbus-maxemu-v39"]
  imagePullPolicy: Always
    - containerPort: 8080
      value: "tcp:localhost:9100"
      value: host
- name: emulator-5554
  image: dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240
  imagePullPolicy: Always
    privileged: true
    - mountPath: /logs
      name: logs
    - mountPath: /dev/kvm
      name: dev-kvm
      value: v39
      value: '9100'
    - name: EMULATOR_PORT
      value: '5554'
      value: '2400'
      value: host
    - name: EMU_WINDOW
      value: '2'
- name: emulator-5556
  image: dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240
... etc - several more of these emulator containers.

And the full 'describe' of a container declared as 'started' by events:

    Container ID:   
    Image:          dockerio.dlan/auto/android-avd-10a29v8-emu29_0_11_kuber-snapshot-skin_name-540x1060-hw_lcd_density-240
    Image ID:       
    Port:           <none>
    Host Port:      <none>
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
      ANDROID_ADB_VERSION:      v39
      EMULATOR_PORT:            5554
      EMULATOR_MAX_SECS:        2400
      ANDROID_ADB_SERVER:       host
      EMU_WINDOW:               2
      /dev/kvm from dev-kvm (rw)
      /logs from logs (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-2jrv5 (ro)

1 Answers1


You can use kubectl wait to pause your test execution until the pods are in Ready status.

Keep in mind that if you don't use a readiness prob for your app, the pods being in a Ready status won't imply that your app is actually ready to take in traffic, which could make your tests flaky.

  • Thanks for this. My containers' logic already has handling for the emulators coming up before the main monitoring entrypoint is ready, and what to do if any fail: my interest is in talking to that entrypoint as soon as possible, and in understanding why it sometimes it takes a long time to gain access, so I can avoid those conditions or in the case when they arise, make an automated decision whether to keep on waiting or to abort and only use the other nodes' containers. Currently it doesn't seem that Kubernetes is capable of reporing accurately on what it's doing. – Tim Baverstock Feb 03 '20 at 10:26