[Question rewritten with details of findings.]
I am running a Google Container Engine cluster with about 100 containers which perform about 100,000 API calls a day. Some of the pods started getting 50% failure in DNS resolution. I dug into this and it only happens for pods on nodes that are running kube-dns
. I also noticed that this only happens just before a node in the system gets shut down for being out-of-memory.
The background resque jobs are attaching to Google APIs and then uploading data to S3. When I see failed jobs, they fail with "Temporary failure in name resolution." This happens for "accounts.google.com" and "s3.amazonaws.com".
When I log into the server and try to connect to these (or other hosts) with host
, nslookup
, or dig
it seems to work just fine. When I connect to the rails console and run the same code that's failing in the queues I can't get a failure to happen. Howerver, as I said these background failures seem to be intermittent (about 50% of the time for the workers running on nodes running kube-dns
).
So far, my interim fix was to delete the pods that were failing, and let kubernetes reschedule them, and keep doing this until kubernetes scheduled them to a node not running kube-dns
.
Incidentally, removing the failing node did not resolve this. It just caused kubernetes to move everything to other nodes and moved the problem.