0

this is an odd one.

I've setup my K8s cluster, 1 master and 1 worker. It uses calico as cni, and everything looks to be working as expected (I'm able to deploy pods, services, etc). I'm able to reach my pods/services via IP, however I was trying to reach them using their dns name, i.e. myservice.default.svc and it is not reachable. So I started digging and troubleshooting DNS resolution, until I finally have come to the conclusion that my kube-dns pods are not reachable.

Here's a bit of information:

DNS pods running:

kubectl --kubeconfig mycluster get pods --namespace=kube-system -l k8s-app=kube-dns
NAME                      READY   STATUS    RESTARTS   AGE
coredns-f9fd979d6-jsqp9   1/1     Running   0          20h
coredns-f9fd979d6-tppbt   1/1     Running   0          20h

DNS Service running:

kubectl --kubeconfig cluster get svc --namespace=kube-system
NAME       TYPE        CLUSTER-IP   EXTERNAL-IP   PORT(S)                  AGE
kube-dns   ClusterIP   10.96.0.10   <none>        53/UDP,53/TCP,9153/TCP   21h

DNS Endpoints exposed:

kubectl --kubeconfig cluster get endpoints kube-dns --namespace=kube-system
NAME       ENDPOINTS                                                 AGE
kube-dns   10.45.83.1:53,10.45.83.2:53,10.45.83.1:9153 + 3 more...   21h

From a busybox pod, I'm able to access other services - for example a database:

/ # ping 10.36.12.13
PING 10.36.12.13 (10.36.12.13): 56 data bytes
64 bytes from 10.36.12.13: seq=0 ttl=63 time=0.213 ms
64 bytes from 10.36.12.13: seq=1 ttl=63 time=0.091 ms

# telnet 10.36.12.13 3306
Connected to 10.36.12.13

/etc/resolv.conf looks to be setup as expected:

cat /etc/resolv.conf
nameserver 10.96.0.10
search default.svc.cluster.local svc.cluster.local cluster.local

However, if I try to do an DNS lookup, it hangs with unreachable errors:

nslookup backend.default.svc.cluster.local
;; connection timed out; no servers could be reached

If I try to do telnet or ping to the coreDNS pods, it fails:

telnet 10.45.83.1 53
^C

ping 10.45.83.1
PING 10.45.83.1 (10.45.83.1): 56 data bytes
^C
--- 10.45.83.1 ping statistics ---
2 packets transmitted, 0 packets received, 100% packet loss

Logs on both the DNS pods are looking good:

kubectl --kubeconfig cluster logs --namespace=kube-system -l k8s-app=kube-dns
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
.:53
[INFO] plugin/reload: Running configuration MD5 = db32ca3650231d74073ff4cf814959a7
CoreDNS-1.7.0
linux/amd64, go1.14.4, f59c03d
[INFO] Reloading
[INFO] plugin/health: Going into lameduck mode for 5s
[INFO] plugin/reload: Running configuration MD5 = 3d3f6363f05ccd60e0f885f0eca6c5ff
[INFO] Reloading complete
[INFO] 127.0.0.1:40656 - 48819 "HINFO IN 1796540929503221175.488499616278261636. udp 56 false 512" NXDOMAIN qr,rd,ra 131 0.015421704s

Any ideas on what to check would be appreciated. I'd be happy to add any further info.

Dan V
  • 119
  • 2
  • 4
  • Your (successful) ping shows a subnet that isn't the one in which your Pods live, I would guess based on what you're describing that your CNI isn't happy, but without more info it's hard to say what exactly is going on – mdaniel Oct 07 '20 at 16:06
  • As mdaniel said, it looks networking/cni. What does `kubectl get nodes -o jsonpath='{.items[].spec.podCIDR}'` show for your pods network range? Can you see routes for that network on both hosts `ip ro sh`? Does the config in `/etc/cni/net.d/` match up to the cidr/range in `/etc/kubernetes/manifests/kube-controller-manager.yaml` – Matt Oct 08 '20 at 00:31

1 Answers1

0

So, the root cause was that my pods weren't able to reach out other pods in another host. Both DSN pods ended up being created on host1, and everything worked on host1, but on host2 (since it wasn't able to see anything on host1) everything got messed up as it wasn't able to resolve any dns queries.

This got resolved by changing the CNI to Weave instead of Calico. I was troubleshooting calico well over a week and then I gave up; seems the pods weren't able to get from one node to the other. Checked the BGP stuff, the networking ports opened and working, etc. and calicoctl node status kept throwing that the peer connection wasn't established. At this point, I don't know what was causing it, the one thing I noticed is that this weird virtual interface with a very odd CIDR got created every time I installed calico, and that cidr didn't match any of my networking needs. I decided it wasn't worth the effort as there are no hard requirements for calico.

Thanks everyone that checked!

Dan V
  • 119
  • 2
  • 4