0

Update 1

I changed the nameserver in /etc/resolv.conf and that fixes the problem, but it will cause issues with internal name resolutions. So what does it mean? dig querying the same NS gets all 7 records while apps querying the same NS fails but work if NS is 8.8.8.8.

Original

Now I have hit a very strange behaviour with DNS. DNS resolutions for a particular domain are failing for all apps (curl, wget, python) on pods in EKS cluster. At the same time dig and nslookup work perfectly. Additionally, other names like google.com, internal domains, aws domains and other external domains work fine. The domain which fails is areocrapi.cognitiveservices.azure.com. Contents of /etc/resolv.conf are below. I have also added tcpdump and dig results below.

Platform info

$ uname -a
Linux areocr-98da57763-vrefw 4.14.203-156.332.amzn2.x86_64 #1 SMP Fri Oct 30 19:19:33 UTC 2020 x86_64 GNU/Linux

Kubernetes Version: 1.18

Docker image: debian:stable-slim

/etc/resolv.conf

nameserver 172.20.0.10
search default.svc.cluster.local svc.cluster.local cluster.local ap-south-1.compute.internal
options ndots:1

Dig result

; <<>> DiG 9.11.5-P4-5.1+deb10u2-Debian <<>> areocrapi.cognitiveservices.azure.com
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 12136
;; flags: qr rd ra; QUERY: 1, ANSWER: 7, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
; COOKIE: 542f9837d8a4f664 (echoed)
;; QUESTION SECTION:
;areocrapi.cognitiveservices.azure.com. IN A

;; ANSWER SECTION:
areocrapi.cognitiveservices.azure.com. 10 IN CNAME centralindia.api.cognitive.microsoft.com.
centralindia.api.cognitive.microsoft.com. 10 IN CNAME cognitiveincprod.trafficmanager.net.
cognitiveincprod.trafficmanager.net. 10 IN CNAME cognitiveincprod.azure-api.net.
cognitiveincprod.azure-api.net. 10 IN   CNAME   apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net.
apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net. 10 IN CNAME cognitiveincprod-centralindia-01.regional.azure-api.net.
cognitiveincprod-centralindia-01.regional.azure-api.net. 10 IN CNAME apimgmthsgqucefft6mdmwwycilgdkzacce38eazszdtwrksob.cloudapp.net.
apimgmthsgqucefft6mdmwwycilgdkzacce38eazszdtwrksob.cloudapp.net. 10 IN A 104.211.88.173

;; Query time: 205 msec
;; SERVER: 172.20.0.10#53(172.20.0.10)
;; WHEN: Thu Jan 28 18:50:33 UTC 2021
;; MSG SIZE  rcvd: 799

Now there is a chain of 6 CNAMEs + 1 A record. When I try to resolve centralindia.api.cognitive.microsoft.com which is 1 CNAME query removed, the curl command works.

$ curl https://centralindia.api.cognitive.microsoft.com/
{"error":{"code":"404","message": "Resource not found"}}

Result of curl/wget/python (fail)

$ curl https://areocrapi.cognitiveservices.azure.com/
curl: (6) Could not resolve host: areocrapi.cognitiveservices.azure.com

$ wget -4 https://areocrapi.cognitiveservices.azure.com/
--2021-01-28 18:55:23--  https://areocrapi.cognitiveservices.azure.com/
Resolving areocrapi.cognitiveservices.azure.com (areocrapi.cognitiveservices.azure.com)... failed: Name or service not known.
wget: unable to resolve host address ‘areocrapi.cognitiveservices.azure.com

$ python3
Python 3.8.6 (default, Nov 18 2020, 13:49:49)
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import socket
>>> socket.gethostbyname('areocrapi.cognitiveservices.azure.com')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
socket.gaierror: [Errno -2] Name or service not known

tcpdump for full domain when curl command is issued (fail)

curl https://areocrapi.cognitiveservices.azure.com/

IP (tos 0x0, ttl 255, id 52638, offset 0, flags [DF], proto UDP (17), length 83)
    10.2.73.136.43389 > 172.20.0.10.53: 36572+ A? areocrapi.cognitiveservices.azure.com. (55)
IP (tos 0x0, ttl 255, id 52639, offset 0, flags [DF], proto UDP (17), length 83)
    10.2.73.136.43389 > 172.20.0.10.53: 41700+ AAAA? areocrapi.cognitiveservices.azure.com. (55)
IP (tos 0x0, ttl 253, id 1066, offset 0, flags [DF], proto UDP (17), length 345)
    172.20.0.10.53 > 10.2.73.136.43389: 41700 5/0/0 areocrapi.cognitiveservices.azure.com. CNAME centralindia.api.cognitive.microsoft.com., centralindia.api.cognitive.microsoft.com. CNAME cognitiveincprod.trafficmanager.net., cognitiveincprod.trafficmanager.net. CNAME cognitiveincprod.azure-api.net., cognitiveincprod.azure-api.net. CNAME apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net., apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net. CNAME cognitiveincprod-centralindia-01.regional.azure-api.net. (317)
IP (tos 0x0, ttl 253, id 1067, offset 0, flags [DF], proto UDP (17), length 345)
    172.20.0.10.53 > 10.2.73.136.43389: 36572 5/0/0 areocrapi.cognitiveservices.azure.com. CNAME centralindia.api.cognitive.microsoft.com., centralindia.api.cognitive.microsoft.com. CNAME cognitiveincprod.trafficmanager.net., cognitiveincprod.trafficmanager.net. CNAME cognitiveincprod.azure-api.net., cognitiveincprod.azure-api.net. CNAME apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net., apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net. CNAME cognitiveincprod-centralindia-01.regional.azure-api.net. (317)

tcpdump for 1 CNAME removed when curl command is issued (success)

curl https://centralindia.api.cognitive.microsoft.com/

IP (tos 0x0, ttl 255, id 24853, offset 0, flags [DF], proto UDP (17), length 86)
    10.2.73.136.59453 > 172.20.0.10.53: 7813+ A? centralindia.api.cognitive.microsoft.com. (58)
IP (tos 0x0, ttl 255, id 24854, offset 0, flags [DF], proto UDP (17), length 86)
    10.2.73.136.59453 > 172.20.0.10.53: 24205+ AAAA? centralindia.api.cognitive.microsoft.com. (58)
IP (tos 0x0, ttl 253, id 60238, offset 0, flags [DF], proto UDP (17), length 387)
    172.20.0.10.53 > 10.2.73.136.59453: 7813 6/0/0 centralindia.api.cognitive.microsoft.com. CNAME cognitiveincprod.trafficmanager.net., cognitiveincprod.trafficmanager.net. CNAME cognitiveincprod.azure-api.net., cognitiveincprod.azure-api.net. CNAME apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net., apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net. CNAME cognitiveincprod-centralindia-01.regional.azure-api.net., cognitiveincprod-centralindia-01.regional.azure-api.net. CNAME apimgmthsgqucefft6mdmwwycilgdkzacce38eazszdtwrksob.cloudapp.net., apimgmthsgqucefft6mdmwwycilgdkzacce38eazszdtwrksob.cloudapp.net. A 104.211.88.173 (359)
IP (tos 0x0, ttl 253, id 60241, offset 0, flags [DF], proto UDP (17), length 371)
    172.20.0.10.53 > 10.2.73.136.59453: 24205 5/0/0 centralindia.api.cognitive.microsoft.com. CNAME cognitiveincprod.trafficmanager.net., cognitiveincprod.trafficmanager.net. CNAME cognitiveincprod.azure-api.net., cognitiveincprod.azure-api.net. CNAME apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net., apimgmttmzgruajuggsac8wjzmihugs4vsxibebwe3uiy0mylw.trafficmanager.net. CNAME cognitiveincprod-centralindia-01.regional.azure-api.net., cognitiveincprod-centralindia-01.regional.azure-api.net. CNAME apimgmthsgqucefft6mdmwwycilgdkzacce38eazszdtwrksob.cloudapp.net. (343)
  • Is there a name service caching server in use, such as `nscd`? – mpez0 Jan 28 '21 at 19:43
  • AFAIK, there is none. I have tried replacing CoreDNS pods, replacing application pods in an attempt to remove cache. The only thing that I have not done till now is replacing the nodes. – Ashwini Dhekane Jan 28 '21 at 19:48
  • I have updated the question. Changing the NS in /etc/resolv.conf fixes the issue, but this is not something I want to do. – Ashwini Dhekane Jan 28 '21 at 19:54
  • 1
    RFC 1034 discourages this sort of chaining, but does require that resolvers handle it. No limit is specified. Only loops are prohibited. – Michael Hampton Jan 28 '21 at 20:43
  • If just changing the recursive nameserver to query solves your problem, then it is probably the first nameserver misbehaving, and not some local problem. Compare answers using `dig @` for the same query towards both nameservers. – Patrick Mevzek Jan 28 '21 at 22:52

0 Answers0