13

I have a really weird problem with my DNS. My domain name (strugee.net) is unresolvable from some networks, and resolvable from others.

For example, on my home network (same network the server's on):

% dig strugee.net

; <<>> DiG 9.10.3-P4 <<>> strugee.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 10086
;; flags: qr rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;strugee.net.           IN  A

;; ANSWER SECTION:
strugee.net.        1800    IN  A   216.160.72.225

;; Query time: 186 msec
;; SERVER: 205.171.3.65#53(205.171.3.65)
;; WHEN: Sat Apr 16 15:42:36 PDT 2016
;; MSG SIZE  rcvd: 56

However, if I log in to a server I have on Digital Ocean, the domain fails to resolve:

% dig strugee.net      

; <<>> DiG 9.9.5-9+deb8u3-Debian <<>> strugee.net
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 58551
;; flags: qr rd ra; QUERY: 1, ANSWER: 0, AUTHORITY: 0, ADDITIONAL: 1

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 512
;; QUESTION SECTION:
;strugee.net.           IN  A

;; Query time: 110 msec
;; SERVER: 2001:4860:4860::8844#53(2001:4860:4860::8844)
;; WHEN: Sat Apr 16 18:44:25 EDT 2016
;; MSG SIZE  rcvd: 40

But, going directly to the authoritative nameservers works just fine:

% dig @dns1.registrar-servers.com strugee.net   

; <<>> DiG 9.9.5-9+deb8u3-Debian <<>> @dns1.registrar-servers.com strugee.net
; (1 server found)
;; global options: +cmd
;; Got answer:
;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 30856
;; flags: qr aa rd; QUERY: 1, ANSWER: 1, AUTHORITY: 5, ADDITIONAL: 1
;; WARNING: recursion requested but not available

;; OPT PSEUDOSECTION:
; EDNS: version: 0, flags:; udp: 4096
;; QUESTION SECTION:
;strugee.net.           IN  A

;; ANSWER SECTION:
strugee.net.        1800    IN  A   216.160.72.225

;; AUTHORITY SECTION:
strugee.net.        1800    IN  NS  dns3.registrar-servers.com.
strugee.net.        1800    IN  NS  dns4.registrar-servers.com.
strugee.net.        1800    IN  NS  dns2.registrar-servers.com.
strugee.net.        1800    IN  NS  dns1.registrar-servers.com.
strugee.net.        1800    IN  NS  dns5.registrar-servers.com.

;; Query time: 3 msec
;; SERVER: 216.87.155.33#53(216.87.155.33)
;; WHEN: Sat Apr 16 18:46:36 EDT 2016
;; MSG SIZE  rcvd: 172

It's pretty clear that there's a problem with some large network somewhere that's failing to resolve my domain, but I can't seem to figure out where. I skimmed the dig manpage for options that might help, but didn't find anything particularly useful.

I'm on Namecheap both as a domain registrar as well as DNS hosting. I have the DNSSEC option turned on. I haven't made any changes to my DNS settings recently.

How can I debug this problem and find the offending nameserver?

strugee
  • 901
  • 10
  • 25
  • 7
    Thank you for providing the name of the domain. Problems like this are extremely hard to troubleshoot by us on Serverfault without that information. – Andrew B Apr 16 '16 at 23:43
  • @AndrewB oh, I know. You're welcome, trust me :) – strugee Apr 17 '16 at 02:39
  • 2
    @AndrewB's answer makes sense and seems correct to me. Before i read it, though, i noticed your failed query used an IPV6 nameserver, while the successful ones used IPV4. Often (obv. not in this case) this hints at a bad IPV6 configuration, and it can be helpful to explicitly use numeric IPV[4/6] adresses of the nameservers instead of aliases. – Guntram Blohm Apr 17 '16 at 04:26
  • @Guntram So long as we keep in mind that we got a reply *from* the nameserver, which means that we have connectivity *to* the DNS server at least. Just want to make sure people don't walk away from that with the wrong impression...`SERVFAIL` may indicate an upstream problem, but it still indicates a reply packet. – Andrew B Apr 17 '16 at 04:59
  • @GuntramBlohm You are onto something. `strugee.net` has five NS records, but no `AAAA` glue records only `A` glue records. What's worse is that those five `A` glue records points to only two different IP addresses. That seems like a quite brittle setup. Even if it is not the root cause for the problem at hand, it is something to watch out for. – kasperd Apr 17 '16 at 09:44
  • @AndrewB What that means is that it is not a misconfiguration of the IPv6 path between client and recursor. The misconfiguration could theoretically have been that the provider had configured some IPv4-only recursors and some IPv6-only recursors instead of configuring the recursors as dual stack like they should. – kasperd Apr 17 '16 at 09:46
  • @kasperd I understood it implied a problem described by [BCP 91](https://tools.ietf.org/html/bcp91), that's why I didn't disagree. Please take the last sentence of the prior comment at face value. I also noticed the possible [BCP 16](https://tools.ietf.org/html/bcp16) issues, but I'm withholding commentary until it's more convenient for me to analyze how they're routed across my country. We're agreed that repeating IPs are pointless. – Andrew B Apr 17 '16 at 17:02
  • @AndrewB BCP 91 more or less agree with what I said. BCP 91 regards any IPv6-only recursor as a configuration mistake, and my comment does as well. I do however consider BCP 91 in its current form in need of an update. In 2004 when it was written it was sensible to still consider IPv4-only DNS servers acceptable. But given that IPv4 addresses ran out years ago and update to stop considering IPv4-only acceptable is long overdue, as that is a necessary step towards allowing IPv6-only deployments. – kasperd Apr 17 '16 at 17:19

2 Answers2

25

How can I debug this problem and find the offending nameserver?

daxd5 offered some good starting advice, but the only real answer here is that you need to know how to think like a recursive DNS server. Since there are numerous misconfigurations at the authoritative layer that can result in an inconsistent SERVFAIL, you need a DNS professional or online validation tools.

Anyway, the goal isn't to cop out of helping you, but I wanted to make sure that you understand that there is no conclusive answer to that question.


In your particular case, I noticed that strugee.net appears to be a zone signed with DNSSEC. This is evident from the presence of the DS and RRSIG records in the referral chain:

# dig +trace +additional strugee.net
<snip>
strugee.net.            172800  IN      NS      dns2.registrar-servers.com.
strugee.net.            172800  IN      NS      dns1.registrar-servers.com.
strugee.net.            172800  IN      NS      dns3.registrar-servers.com.
strugee.net.            172800  IN      NS      dns4.registrar-servers.com.
strugee.net.            172800  IN      NS      dns5.registrar-servers.com.
strugee.net.            86400   IN      DS      16517 8 1 B08CDBF73B89CCEB2FD3280087D880F062A454C2
strugee.net.            86400   IN      RRSIG   DS 8 2 86400 20160423051619 20160416040619 50762 net. w76PbsjxgmKAIzJmklqKN2rofq1e+TfzorN+LBQVO4+1Qs9Gadu1OrPf XXgt/AmelameSMkEOQTVqzriGSB21azTjY/lLXBa553C7fSgNNaEXVaZ xyQ1W/K5OALXzkDLmjcljyEt4GLfcA+M3VsQyuWI4tJOng184rGuVvJO RuI=
dns2.registrar-servers.com. 172800 IN   A       216.87.152.33
dns1.registrar-servers.com. 172800 IN   A       216.87.155.33
dns3.registrar-servers.com. 172800 IN   A       216.87.155.33
dns4.registrar-servers.com. 172800 IN   A       216.87.152.33
dns5.registrar-servers.com. 172800 IN   A       216.87.155.33
;; Received 435 bytes from 192.41.162.30#53(l.gtld-servers.net) in 30 ms

Before we go any further, we need to check whether or not the signing is valid. DNSViz is a tool frequently used for this purpose, and it confirms that there are indeed problems. The angry red in the picture is suggesting that you have a problem, but rather than mousing over everything we can just expand Notices on the left sidebar:

RRSIG strugee.net/A alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/DNSKEY alg 8, id 16517: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/DNSKEY alg 8, id 16517: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/MX alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/NS alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/SOA alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
RRSIG strugee.net/TXT alg 8, id 10636: The Signature Expiration field of the RRSIG RR (2016-04-14 00:00:00+00:00) is 2 days in the past.
net to strugee.net: No valid RRSIGs made by a key corresponding to a DS RR were found covering the DNSKEY RRset, resulting in no secure entry point (SEP) into the zone. (216.87.152.33, 216.87.155.33, UDP_0_EDNS0_32768_4096)

The problem is clear: the signature on your zone has expired and the keys need to be refreshed. The reason why you are seeing inconsistent results is because not all recursive servers have DNSSEC validation enabled. Ones which validate are dropping your domain, and for ones which do not it is business as usual.


Edit: Comcast's DNS infrastructure is known to implement DNSSEC validation, and as one of their customers I can confirm that I'm seeing a SERVFAIL as well.

$ dig @75.75.75.75 strugee.net | grep status
;; ->>HEADER<<- opcode: QUERY, status: SERVFAIL, id: 2011
Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • Whoops, I had `stugee.net` in the dig output, which is obviously a typo. The DNSSEC part of this analysis was done against the correct name. – Andrew B Apr 16 '16 at 23:56
5

While you are indeed seeing that the authoritative name servers are responding correctly, you need to follow up the entire chain of DNS resolution. This is, walk down the whole DNS hierachy from the root servers up.

$ dig net NS
;; ANSWER SECTION:
net.            172800  IN  NS  c.gtld-servers.net.
net.            172800  IN  NS  f.gtld-servers.net.
net.            172800  IN  NS  k.gtld-servers.net.
;; snipped extra servers given
$ dig @c.gtld-servers.net strugee.net NS
;; AUTHORITY SECTION:
strugee.net.        172800  IN  NS  dns2.registrar-servers.com.
strugee.net.        172800  IN  NS  dns1.registrar-servers.com.
;; snipped extra servers again

This basically checks that the public DNS servers are working, and you're doing the same thing that your DNS resolver should be doing. So you should be getting the same answers as above in your Digital Ocean server unless something's wrong with their DNS resolver:

$ dig net NS
$ dig strugee.net NS
$ dig strugee.net

If the first two queries fail, it's the DNS on Digital Ocean's side failing. Check your /etc/resolv.conf and try querying the secondary DNS server. If the secondary one works, just switch the order for resolvers and try again.

daxd5
  • 96
  • 5