0

A while ago a strange problem occurred in our kubernetes cluster. We have a network containing windows servers (webserver, mailserver, etc.) and a kubernetes cluster running Rancher v2.6.0.

The cluster is communicating with the windows server via http requests and smtp/imap to send and read emails. For a while now random http requests fail with the error message no route to host. It seems to only be limited to connections within the network and not affecting requests to third party apis. And the error does not always occur. A lot of requests go through without any issues and some fail. I implemented a retry-policy to try the same request again a few seconds later and sometimes it works on the first retry, sometimes the second and sometimes not at all.

I tried to google for a solution but I couldn't come up with anything, especially since only a percentage of all requests are affected.

Our sysadmin maintaining the network and windows server cannot identify any issues or even see the requests. So my guess is that the requests do not leave the cluster.. if that makes sense.

Unfortunately the kubernetes cluster used to be maintained by a colleague who is not available anymore. I'd be very grateful for suggestions where to start looking for a solution.

mboldt
  • 101
  • First of all, you need to describe exactly what network infrastructure you have. What exactly is working? Have you tried checking the network logs? – Mikołaj Głodziak Feb 03 '22 at 11:39

2 Answers2

0

I would get inside one of the pods:

kubectl exec -it <podname> -n <namespace> -- bash

and run :

for i in {0..100}; do curl http://<windows.server>;done

And check if there are any errors, if there are - I would run the same test from the node itself (ssh to the node and run the for loop) preferably the same node that the pod you used for testing is running on.

If you still see errors or timeouts, I would ssh to a node outside the cluster and run the same tests, to try and focus if the problem is only from a pod, from cluster nodes or with the windows server.

0

I had similar errors and it seemed to be linked to DNS query failures. Coredns was forwarding non-kubernetes requests to pfsense DNS, and it failed randomly, causing no route to host. Check Coredns logs.

Dave M
  • 4,494
  • 21
  • 30
  • 30
Akos
  • 1