1

I do have a GKE k8s cluster (k8s 1.22) that consists of preemptible nodes only, which includes critical services like kube-dns. It's a dev machine which can tolerate some broken minutes a day. Every time a node gets shut down which hosts a kube-dns pod, I run into DNS resolution problems that persist until I delete the failed pod (in 1.21, pods stay "Status: Failed" / "Reason: Shutdown" until manually deleted).

While I do expect some problems on preemptible nodes while they are being recycled, I would expect this to self-repair after some minutes. The underlying reason for the persistent problems seems to be that the failed pod does not get removed from the k8s Service / Endpoint. This is what I can see in the system:

Status of the pods via kubectl -n kube-system get po -l k8s-app=kube-dns

NAME                        READY   STATUS       RESTARTS   AGE
kube-dns-697dc8fc8b-47rxd   4/4     Terminated   0          43h
kube-dns-697dc8fc8b-mkfrp   4/4     Running      0          78m
kube-dns-697dc8fc8b-zfvn8   4/4     Running      0          19h

IP of the failed pod is 192.168.144.2 - and it still is listed as one of the endpoints of the service:

kubectl -n kube-system describe ep kube-dns brings this:

Name:         kube-dns
Namespace:    kube-system
Labels:       addonmanager.kubernetes.io/mode=Reconcile
              k8s-app=kube-dns
              kubernetes.io/cluster-service=true
              kubernetes.io/name=KubeDNS
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2022-02-21T10:15:54Z
Subsets:
  Addresses:          192.168.144.2,192.168.144.7,192.168.146.29
  NotReadyAddresses:  <none>
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    dns-tcp  53    TCP
    dns      53    UDP

Events:  <none>

I know others worked around these issues by Scheduling kube-dns to other pods, but I would rather want to make this self-healing instead, as node failures can still happen on non-preemptible nodes, they are just less likely.

My questions:

  • Why is the failed pod still listed as one of the endpoints of the service, even hours after the initial node failure?
  • What can I do to mitigate the problem (besides adding some non-ephemeral nodes)?

It seems that kube-dns in the default deployment in GKE does not have a readiness probe attached to dnsmasq (port 53), which is targeted in the kube-dns service, and that having that could solve the issue - but I suspect it's not there for a reason that I don't yet understand.

EDIT: Apparently this does not happen on 1.21.6-gke.1500 (regular channel), but it does on 1.22.6-gke.1500 (rapid channel). I do not have a good explanation, but despite having a few failed pods today the kube-dns service only contains the working ones.

lena_punkt
  • 111
  • 2
  • Update: Looks like a k8s bug that will be fixed in 1.22 later on: https://github.com/kubernetes/kubernetes/issues/108594 - I will update with an answer to my own question once I have verified this working. Florian, if you can read this, if you make your now-deleted comment an answer to this post I can accept it as an answer later on and you get the credit. – lena_punkt Apr 04 '22 at 06:25

2 Answers2

0

Preemptible nodes are not recommended for running critical workloads such as kube-dns (1) so situations like this are to be expected.

You can try mitigating the issue by marking pod as critical (2), using node auto-provisioning (3) or PodDisruptionBudget (4).
There are more information on this topic in this doc (5).

Additionally, some suggestions have been already made to Google (6).

If none of these resolve your problem you can report this via Public Issue Tracker.

Sergiusz
  • 310
  • 2
  • 13
  • Correct, adding a node pool with standard nodes will make this less likely - but those nodes can still fail, and I do not see how this cannot happen in the same way e.g. when an availability zone fails. That is the main reason why I asked initially. Human intervention would be necessary for that case as well, correct? – lena_punkt Feb 25 '22 at 13:32
  • I never witnessed such situation and did not found any reports of such behavior in the issue tracker. But if you encounter this problem on a non-preemptible node then this should be reported to Google. – Sergiusz Feb 28 '22 at 08:02
0

It started happening on my env(preemptible nodes on gke) as well and it happens to all deployments, but kube-dns is the most critical one. I think it might be related to revisionHistoryLimit parameter. The default value is 10, so old replicas up to a count of 10 will be present for some period of time. I've set it to 0 and expect nodes to be replaced, let's see :)