4

I don't know where to look for hints.

We have installed gitlab-runners using a helm chart in our development cluster. Most of the time this works, but in the last week or so we have experienced pods being stuck in Pending state without any further logs. At some point which I cannot define better, all pods are being scheduled on nodes, then the next batch is stuck in Pending again.

We use GKE and have set up a node pool of preemtible nodes only for gitlab-runner pods. We run kubernetes v1.15.4-gke.18.

We know there are several reasons for pods being stuck in Pending but I always expect some form of logs/indication when running kubectl describe <GITLAB_RUNNER_POD> or kubectl get events. The problem is, there is none. No events.

We have stackdriver logging enabled and I can see Kubernetes Apiservice Requests logs under Kubernetes Cluster but they don't have any meaningful content to me.

Any ideas where to look?

  • 2
    Its strange that '[describe pod](https://cloud.google.com/kubernetes-engine/docs/troubleshooting#workload_issues)' command is not giving any outputs. Usually, a pod stays in pending state if there aren't any sufficient resources of one type or another that prevent it from scheduling. As explained in [the help center article](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-pod-replication-controller/#my-pod-stays-pending), it is possible that the CPU or Memory in your cluster may have exhausted. – Digil Jan 06 '20 at 17:05
  • 1
    Furthermore, you could also use [this guide](https://kubernetes.io/docs/tasks/debug-application-cluster/debug-application/) which help users to debug applications that are deployed into Kubernetes and not behaving correctly. – Digil Jan 06 '20 at 17:05
  • 1
    Since you have stackdriver logging enabled, check the node logs and for the kubelet and docker, see if there are any node level issue with container creation or resource allocation – Patrick W Feb 06 '20 at 05:42
  • Also check to make sure there is sufficient space in the cluster, these pods may be stuck in pending because they have not been properly scheduled to a node. – Patrick W Feb 06 '20 at 05:43
  • What HELM version did you use, v2 or v3? Could you post what exactly command you've used to deploy it? – PjoterS Nov 02 '20 at 08:10
  • No idea, this was ten months ago. Sorry. – Moritz Schmitz v. Hülst Nov 08 '20 at 22:10
  • 1
    As there is no way to know now what exactly happened now, I do think it's worth to mention that the logs from events (like in `$ kubectl describe ...`) are stored for **60 minutes**. After that you won't be able to get those events unless you have some logging facility like `StackDriver`. – Dawid Kruk Feb 01 '21 at 19:28

2 Answers2

1

Posting this answer to give more of a general idea for where to look for information why Pod is in Pending state as for now it's impossible to tell on this specific setup.

The ways to check why the Pod can be in Pending state:

  • $ kubectl describe pod POD_NAME
  • $ kubectl get events -A
  • Inspecting the Cloud Logging (more on that below)

Assuming following situation where the Pod is in Pending state:

  • $ kubectl get pods
NAME                           READY   STATUS    RESTARTS   AGE
nginx-four-99d88fccb-rwzmp     0/1     Pending   0          2s
nginx-one-8584c66446-h92rm     1/1     Running   0          5d22h
nginx-three-5bcb988986-tmshp   1/1     Running   0          5d22h
nginx-two-6c9545d7d4-2zlmh     1/1     Running   0          5d22h

To get more information about it's state you can run:

  • $ kubectl describe pod POD_NAME

The Event part of above output:

Events:
  Type     Reason            Age                 From               Message
  ----     ------            ----                ----               -------
  Warning  FailedScheduling  26s (x2 over 114s)  default-scheduler  0/1 nodes are available: 1 Insufficient cpu.

As you can see there is an information on why the Pod is in Pending state (Insufficient CPU).

You can also run:

  • $ kubectl get events
LAST SEEN   TYPE      REASON              OBJECT                            MESSAGE
20s         Warning   FailedScheduling    pod/nginx-four-99d88fccb-rwzmp    0/1 nodes are available: 1 Insufficient cpu.
14m         Normal    SuccessfulCreate    replicaset/nginx-four-99d88fccb   Created pod: nginx-four-99d88fccb-rwzmp
14m         Normal    ScalingReplicaSet   deployment/nginx-four             Scaled up replica set nginx-four-99d88fccb to 1

Disclaimer!

Kubernetes events are stored in the etcd for the 1 hour. If the message of Pod state was not repeating over time, it will be deleted after 1 hour. Additional reference on this particular topic:


Retrieving logs from Cloud Logging:

You can run below query to get the Pods that were in Pending state:

resource.type="k8s_cluster"
resource.labels.cluster_name="gke-serverfault"
protoPayload.response.status.phase="Pending"

This query will not show the reason (like Insufficient CPU) of why Pod is in Pending state. There is a feature request on Issuetracker.google.com for this reason. You can follow it to receive further updates:


Additional resources:

Dawid Kruk
  • 588
  • 2
  • 8
0

In my case is some worker nodes lost connections to master node and the corddns pods and flannel pods in master failed to start.

My solution is deleting the worker nodes and rejoining the worker nodes back.

xiaojueguan
  • 111
  • 2