1

I have a problem where dns entry for a external domain broke. The nature of the problem at the time is unknown.

That domain got queried from kubernetes cluster pod in the Google Kubernetes Engine while the entry was broken. The problem persists (incident happened over 2 months ago) when querying that domain from the cluster.

The cluster dns resolver uses metadata.google.internal for dns resolving and from the cluster these queries with dig will:

dig problematic.external.domain @169.254.169.254
# does not resolve and takes over 2 seconds
dig problematic.external.domain  @1.1.1.1
# resolves correctly under 200ms

Creating a new vm in the same project and zone resolves the problematic domain correctly. This is affects only the active cluster metadata server dns resolver.

Is there a way to flush dns caches or any other suggestions?

In general I'm trying to avoid editing in-cluster dns settings and would prefer some other means to fix it.

Edit more info: NodeLocal DNSCache is already active in the cluster and referencing that documentation https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache the problem is the metadata dns server. This excerpt from the benefits list:

DNS queries for external URLs (URLs that don't refer to cluster resources) are forwarded directly to the local Cloud DNS metadata server, bypassing kube-dns.

Which is the ip 169.254.169.254

Manwe
  • 528
  • 3
  • 13
  • Has your issue been resolved?If yes, Can you accept the solution which is provided? – Fariya Rahmat Nov 10 '21 at 07:45
  • It has not been solved. – Manwe Nov 11 '21 at 09:00
  • @Manwe If it works by the following [tool](https://developers.google.com/speed/public-dns/cache) by GCP and read the FAQs. If TTL(Time-to-live) didn’t expire and you have already tried the above methods in the link. Just in case if you want K8 cluster up and running just need to disable it temporarily and you can see that in the [warning](https://cloud.google.com/kubernetes-engine/docs/how-to/nodelocal-dns-cache#enable). – Abhijith Chitrapu Nov 13 '21 at 16:53

2 Answers2

1

Although there is no specific way to flush Cloud DNS's metadata server, still each query has TTL, and mostly GCE DNS respects that, it expires after a certain time and cache becomes invalidated.

Nevertheless, if the problem is with cache, it should be fixed by cordoning the GKE node using kubectl cordon $NODENAME command.

Furthermore, you can bypass GCE DNS by specifying a stub DNS configuration. Check out this link for details.

Anant Swaraj
  • 169
  • 4
  • As I specified, I'm trying to avoid in-cluster configuration for the dns. The problem is with google metadata-dns-server so cordon will not help (the dns server IS NOT on the cluster). And yes, normally the caches clear etc, but the dns serviver is borged and does not clear the cache for that specific domain. – Manwe Aug 17 '21 at 05:45
  • There's no way to directly interact with the metadata server. If it was just an issue with node-local-dns, ‘kubectl -n kube-system rollout restart daemonsets node-local-dns’ would help. Usually the quickest way to deal with such issues is to move the workloads off that node and onto a new one, and prevent new ones from starting on that node. – Anant Swaraj Aug 18 '21 at 10:49
0

NodeLocal DNS cache addon can help resolving the mentioned domains in your case as it forwards DNS queries for external URLs directly to the local Cloud DNS metadata server, bypassing kube-dns, and since your Compute Engine VM can resolve the mentioned DNS (using local cloud DNS) so your cluster would also be able to do so.

Refer to this documentation for detailed instruction on how to configure NodeLocal DNSCache on a GKE cluster.

Anant Swaraj
  • 169
  • 4
  • I'm already using NodeLocal DNS cache. The problem is on the google metadataserver dns implementation. `DNS queries for external URLs (URLs that don't refer to cluster resources) are forwarded directly to the local Cloud DNS metadata server, bypassing kube-dns.` Taht Cloud DNS metadata server is the problem. – Manwe Aug 13 '21 at 11:18