16

We have two DNS servers listed in our NS record. Last night, one of our DNS servers went down. As expected, some DNS servers were not resolving our hostnames. I assumed this would be temporary and would start working once the TTL of our NS records would expire (1 hour).

An hour+ later, I was still getting DNS timeouts from desktops that were using Earthlink, Verizon and OpenDNS severs. I tested to see if the other DNS server was answering:

dig @ns2.example.com www.example.com +short

This worked.

My questions:

  1. Does anyone have an answer as to why other DNS servers were not hitting our other DNS server even after the TTL expired?
  2. Do DNS servers prefer a domain's main DNS server (from the SOA record)?
  3. Is there any algorithm used to pick a nameserver from the available NS records? I'm assuming this is implementation specific but perhaps there are some standards that apply here.
Belmin Fernandez
  • 10,629
  • 26
  • 84
  • 145

1 Answers1

19

This is an unfortunate irritation. Multiple DNS servers are supposed to be to increase reliability, but in practice it frequently has the reverse effect.

The problem is that the client only waits so long for a response, and the server waits about that same amount of time. Say you have two DNS servers, A and B. Say A is working and B has failed. This happens:

  1. Client connects to name server Z and asks it for the information. Z chooses B and sends a query.

  2. The client times out because name server Z did not respond.

  3. Client tries name server Y. Y chooses B and sends a query.

  4. Name server Z times out and tries A. It gets the right answer, but the client isn't waiting any more.

  5. The client times out because name server Y did not respond.

  6. The client gives up, having both its name servers fail to respond.

  7. Name server Y times out and tries A. It get the right answer, but the client isn't waiting any more.

And there's no good solution. The longer you wait to see if a nameserver replies, the longer you need to wait because the name server you are waiting for itself waits longer. Arguably, the problem was that Y and Z didn't give up on B fast enough.

Essentially, if any of your name servers are out, some clients will, through sheer bad luck, time out because they tried only the bad ones.

On the bright side, if you have two nameservers and one fails, about 75% of name servers will get an answer, instead of 0%.

David Schwartz
  • 31,215
  • 2
  • 53
  • 82
  • I see what you mean. Eek. So the client's nameserver (`Z`) will not cache which nameserver it last used that worked? – Belmin Fernandez Oct 31 '11 at 03:12
  • 1
    Some name servers do that and sometimes that helps. It often depends on the precise way in which the nameserver failed. You have to remember that this is all on top of UDP, so the failure to get a reply (even after a retransmission or two) doesn't prove there's anything wrong with the nameserver. – David Schwartz Oct 31 '11 at 03:16
  • I read in my copy of DNS and BIND (Paul Albitz and Cricket Lui, O'Rielly p278) that Bind 8.2.3 servers choose the server that responds quickest from its list of forwarders, which means that if server in the list fails, it is pretty much automatically dropped. Bind 9 doesn't yet implement this, it queries the forward servers in list order. Does anybody know if this has changed? – Jaydee Oct 31 '11 at 15:46
  • Just to clarify, for those less versed in DNS setups (it took me a while to understand this) the DNS name servers Z and Y in this example are most likely recursive name servers based in the client's network, e.g. the DNS servers an ISP provides to its customers via DHCP. And the problem arises when these servers have a longer timeout value than the client DNS resolver (e.g. device operating system.) – Jordan Rieger Jun 01 '18 at 23:04
  • @Jaydee the picking by RTT is in Bind9, see https://gitlab.isc.org/isc-projects/bind9/blob/master/lib/dns/resolver.c#L3290 and https://gitlab.isc.org/isc-projects/bind9/blob/master/lib/dns/include/dns/adb.h#L218 – Patrick Mevzek Apr 08 '19 at 21:19
  • David, I am having problem reported only for AIX clients. But note that all clients have name servers in resolv.conf files, and both name servers are Windows AD DNS servers. By reducing timeout and attempts in /etc/resolv.conf file the situation improves. – Biman Roy Jan 20 '20 at 23:16