6

[Question rewritten with details of findings.]

I am running a Google Container Engine cluster with about 100 containers which perform about 100,000 API calls a day. Some of the pods started getting 50% failure in DNS resolution. I dug into this and it only happens for pods on nodes that are running kube-dns. I also noticed that this only happens just before a node in the system gets shut down for being out-of-memory.

The background resque jobs are attaching to Google APIs and then uploading data to S3. When I see failed jobs, they fail with "Temporary failure in name resolution." This happens for "accounts.google.com" and "s3.amazonaws.com".

When I log into the server and try to connect to these (or other hosts) with host, nslookup, or dig it seems to work just fine. When I connect to the rails console and run the same code that's failing in the queues I can't get a failure to happen. Howerver, as I said these background failures seem to be intermittent (about 50% of the time for the workers running on nodes running kube-dns).

So far, my interim fix was to delete the pods that were failing, and let kubernetes reschedule them, and keep doing this until kubernetes scheduled them to a node not running kube-dns.

Incidentally, removing the failing node did not resolve this. It just caused kubernetes to move everything to other nodes and moved the problem.

jwadsack
  • 201
  • 1
  • 8
  • Failing other options I deleted the VM in the cluster. GKE re-created it but in doing so re-located the pods to one of the other running nodes which immediately began exhibiting the DNS resolution problem. So now I'm thinking it's a kubernetes problem. I noticed that kube-dns is running in two pods both on the same node and on the node that was exhibiting the problem. – jwadsack Sep 09 '16 at 22:59
  • Could the DNS issues be caused by excessive logging in kube-dns? https://github.com/kubernetes/kubernetes/issues/28515 – jwadsack Sep 15 '16 at 22:42
  • I noticed in the logs that it appears to be trying to resolve _local_ addresses for things that should not be local. From the kube-dns logs: `Received DNS Request:accounts.google.com.default.svc.cluster.local., exact:false` – jwadsack Oct 03 '16 at 18:21
  • Also, we are having this problem again and it's only happening for containers that are on the nodes where kube-dns is running (in this case it's running two pods on two different nodes). – jwadsack Oct 03 '16 at 18:21
  • I think that line from the `kubedns` logs may be a red herring. I think that's just that it tries to check the locals first before trying remotes. I was able to resolve this again, temporarily, by deleting the affected pods until k8s scheduled them to nodes that are not running kube-dns. – jwadsack Oct 03 '16 at 19:02
  • I discovered that this happens just before some node (not necessarily the nodes running kube-dns) hits a SystemOOM error. So I think this may be related to OOM issues and decided to upgrade the cluster to 1.4 which addresses a number of stability issues under OOM conditions. – jwadsack Oct 06 '16 at 18:21

2 Answers2

0

The problem for me was indeed kube-dns being scheduled to nodes with high memory pressure, constantly causing it to die. What I did was creating a node pool exclusive to the kube-system services. I edited the deployments via kubectl and set a nodeslector for them so they will always be scheduled to an exclusive node pool and won't compete with my services for resources.

Mauricio
  • 203
  • 1
  • 2
  • 8
0

I solved this by upgrading to Kubernetes 1.4.

The 1.4 release included several fixes to keep kubernetes from crashing under out-of-memory conditions. I think this helped reduce the likelihood of hitting this issue, although I'm not convinced that the core issue was fixed (unless the issue was that one of the kube-dns instances was crashed or non-responsive due to kubernetes system being unstable when a node hit OOM).

jwadsack
  • 201
  • 1
  • 8
  • 1
    Not sure if it's the same issue but we're also experiencing this in GKE on 1.4.0. – Trevor Hartman Dec 01 '16 at 18:21
  • 1
    @TrevorHartman agreed that this issue is still happening although we see it less than we used to. We have had a couple instances where the DNS inside the cluster just seems to fail for several minutes and then recover. But these have been very rare. We also added memory requests and limits to our containers to help the kubernetes scheduler work better and to ensure that they containers didn't crash kubernetes. – jwadsack Dec 01 '16 at 20:13