1

Following an upgrade to v1.19.7 with kubeadm, my pods are unable to request the kube-dns service via the service's ClusterIP. When using the kube-dns pod IP address instead, DNS resolution works.

kube-dns pods are up and running:

$ kubectl get pods -n kube-system -l k8s-app=kube-dns
NAME                       READY   STATUS    RESTARTS   AGE
coredns-7674cdb774-2m58h   1/1     Running   0          33m
coredns-7674cdb774-x44b9   1/1     Running   0          33m

logs are clear:

$ kubectl logs coredns-7674cdb774-2m58h -n kube-system
.:53
[INFO] plugin/reload: Running configuration MD5 = 7442f38ca24670d4af368d447670ad91
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] 127.0.0.1:40705 - 31415 "HINFO IN 7224361654609676299.2243479664305694168. udp 57 false 512" NXDOMAIN qr,rd,ra 132 0.003954173s

kube-dns service is exposed:

$ kubectl get svc  -n kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   301d

endpoints are also configured:

$ kubectl describe endpoints kube-dns --namespace=kube-system
Name:         kube-dns
Namespace:    kube-system
Labels:       k8s-app=kube-dns
              kubernetes.io/cluster-service=true
              kubernetes.io/name=KubeDNS
Annotations:  endpoints.kubernetes.io/last-change-trigger-time: 2021-01-19T14:23:13Z
Subsets:
  Addresses:          10.44.0.1,10.47.0.2
  NotReadyAddresses:  <none>
  Ports:
    Name     Port  Protocol
    ----     ----  --------
    dns-tcp  53    TCP
    dns      53    UDP
    metrics  9153  TCP

Events:  <none>

here is my coredns ConfigMap:

$ kubectl describe cm -n kube-system coredns
Name:         coredns
Namespace:    kube-system
Labels:       <none>
Annotations:  <none>

Data
====
Corefile:
----
.:53 {
    log
    errors
    ready
    health
    kubernetes cluster.local in-addr.arpa ip6.arpa {
       pods insecure
       fallthrough in-addr.arpa ip6.arpa
       ttl 30
    }
    prometheus :9153
    forward . /etc/resolv.conf
    cache 30
    loop
    reload
    loadbalance
}

Events:  <none>

on workers, kube-proxy is running:

$ kubectl get pods -n kube-system -o wide --field-selector spec.nodeName=ccqserv202
NAME               READY   STATUS    RESTARTS   AGE    IP              NODE         NOMINATED NODE   READINESS GATES
kube-proxy-8r65s   1/1     Running   0          78m    10.158.37.202   ccqserv202   <none>           <none>
weave-net-kvnzg    2/2     Running   0          6h3m   10.158.37.202   ccqserv202   <none>           <none>

networking between pods is working, as I am able to communicate between pods running on separate nodes (here, dnsutils runs on node ccqserv202, while 10.44.0.1 is the pod IP address from coredns-7674cdb774-x44b9, running on node ccqserv223).

$ kubectl exec -i -t dnsutils -- ping 10.44.0.1
PING 10.44.0.1 (10.44.0.1): 56 data bytes
64 bytes from 10.44.0.1: seq=0 ttl=64 time=2.101 ms
64 bytes from 10.44.0.1: seq=1 ttl=64 time=1.184 ms
64 bytes from 10.44.0.1: seq=2 ttl=64 time=1.107 ms
^C
--- 10.44.0.1 ping statistics ---
3 packets transmitted, 3 packets received, 0% packet loss
round-trip min/avg/max = 1.107/1.464/2.101 ms

I am using "ipvs" as kube-proxy mode (although I can confirm the exact same behavior happens when using "iptables" or "userspace" modes).

Here is my ipvsadm -Ln on node ccqserv202:

$ ipvsadm -Ln 
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.46.128.0:30040 rr
TCP  10.96.0.1:443 rr
  -> 10.158.37.223:6443           Masq    1      0          0         
  -> 10.158.37.224:6443           Masq    1      0          0         
  -> 10.158.37.225:6443           Masq    1      1          0         
TCP  10.96.0.10:53 rr
TCP  10.96.0.10:9153 rr
TCP  10.97.147.126:2746 rr
TCP  10.100.162.140:9000 rr
TCP  10.101.126.110:5432 rr
TCP  10.109.184.125:4040 rr
TCP  10.110.163.112:9090 rr
TCP  10.110.215.252:8443 rr
TCP  10.158.37.202:30040 rr
TCP  127.0.0.1:30040 rr
TCP  134.158.237.2:30040 rr
UDP  10.96.0.10:53 rr

as you can see, then are no realservers configured under 10.96.0.10 virtual addresses, but there are under 10.96.0.1 (which corresponds to the kubernetes API service).

I am able to open a connection to 10.96.0.1 on port 443

$ kubectl exec -i -t dnsutils -- nc -vz 10.96.0.1 443
10.96.0.1 (10.96.0.1:443) open

I am able to open a connection to 10.44.0.1 on port 53

$ kubectl exec -i -t dnsutils -- nc -vz 10.44.0.1 53
10.44.0.1 (10.44.0.1:53) open

it evens resolves!

$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default 10.44.0.1
Server:     10.44.0.1
Address:    10.44.0.1#53

Name:   kubernetes.default.svc.cluster.local
Address: 10.96.0.1

but this does not work when I use kube-dns ClusterIP 10.96.0.10

$ kubectl exec -i -t dnsutils -- nc -vz 10.96.0.10 53
command terminated with exit code 1
$ kubectl exec -i -t dnsutils -- nslookup kubernetes.default 10.96.0.10
;; connection timed out; no servers could be reached

here is dnsutils resolv.conf file:

$ kubectl exec -i -t dnsutils -- cat /etc/resolv.conf
search default.svc.cluster.local svc.cluster.local cluster.local xxxxx.fr
nameserver 10.96.0.10
options ndots:5

finally, when I try manually adding realservers to ipvs on the node,

$ ipvsadm -a -u 10.96.0.10:53 -r 10.44.0.1:53 -m

kube-proxy detects it and immediately cleans it:

I0119 16:17:27.062890       1 proxier.go:2076] Using graceful delete to delete: 10.96.0.10:53/UDP/10.44.0.1:53
I0119 16:17:27.062906       1 graceful_termination.go:159] Trying to delete rs: 10.96.0.10:53/UDP/10.44.0.1:53
I0119 16:17:27.062974       1 graceful_termination.go:173] Deleting rs: 10.96.0.10:53/UDP/10.44.0.1:53

also, we can see with tcpdump that DNS requests from dnsutils to 10.96.0.10 are NOT rewritten to 10.44.0.1 or 10.47.0.2 as they should be with ipvs

    10.46.128.8.53140 > 10.96.0.10.domain: [bad udp cksum 0x94f7 -> 0x12c4!] 4628+ A? kubernetes.default.default.svc.cluster.local. (62)
16:27:56.950950 IP (tos 0x0, ttl 64, id 45349, offset 0, flags [none], proto UDP (17), length 90)
    10.46.128.8.53140 > 10.96.0.10.domain: [bad udp cksum 0x94f7 -> 0x12c4!] 4628+ A? kubernetes.default.default.svc.cluster.local. (62)
16:27:56.951321 IP (tos 0x0, ttl 64, id 59811, offset 0, flags [DF], proto UDP (17), length 70)

a tcpdump on kube-dns pods on the other side shows that these requests never arrive.

I've now spent a full day trying to understand what is happening and how to fix, and I am now running out of ideas. Any help would be very much welcome.

https://kubernetes.io/docs/tasks/administer-cluster/dns-debugging-resolution/ unfortunately did not help.

Thank you!

tl;dr: DNS resolution in the Kubernetes cluster does not work when using the kube-dns service ClusterIP, although I am able to resolve when using the kube-dns pods IP address. I would think something is wrong with my kube-proxy configuration, but I can't find what.

sqw
  • 11
  • 3
  • Hi sqw, welcome to S.F. What CNI provider are you using, and have you checked that its Pods are in good shape? Actually, related to that: is kube-dns the only Service IP that is messed up? – mdaniel Jan 19 '21 at 17:03
  • Thanks @mdaniel, I am using weave as CNI. All weave pods are healthy, and logs look good. I actually didn't try before but I just did: other services are messed up when I try to access them via their ClusterIP, whereas direct pod access is OK. – sqw Jan 19 '21 at 19:43
  • Hi @sqw, have you tried to use other CNI? – Jakub Jan 20 '21 at 14:08
  • Hi @Jakub, I haven't try with other CNI as I hoped to understand what was wrong with this configuration, but in the end I redeployed the whole cluster from scratch, and the issue is now gone... I'd still be interested to know if other people had similar issues with similar configurations though – sqw Jan 20 '21 at 17:40

0 Answers0