32

This is a Canonical Question about DNS geo-redundancy.

It's extremely common knowledge that geo-redundant DNS servers located at separate physical locations are highly desirable when providing resilient web services. This is covered in-depth by document BCP 16, but some of the most frequently mentioned reasons include:

  • Protection against datacenter disasters. Earthquakes happen. Fires happen in racks and take out nearby servers and network equipment. Multiple DNS servers won't do you much good if physical problems at the datacenter knock out both DNS servers at once, even if they're not in the same row.

  • Protection against upstream peer problems. Multiple DNS servers won't prevent problems if a shared upstream network peer takes a dirt nap. Whether the upstream problem completely takes you offline, or simply isolates all of your DNS servers from a fraction of your userbase, the end result is that people can't access your domain even if the services themselves are located in a completely different datacenter.

That's all well and good, but are redundant DNS servers really necessary if I'm running all of my services off of the same IP address? I can't see how having a second DNS server would provide me any benefit if no one can get to anything provided by my domain anyway.

I understand that this is considered a best practice, but this really seems pointless!

Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • Semi-related: [Should we host our own nameservers?](http://serverfault.com/questions/23744/should-we-host-our-own-nameservers) – Andrew B Sep 16 '16 at 22:21

2 Answers2

33

Note: Content in dispute, refer to comments for both answers. Errors have been found and this Q&A is in need of an overhaul.

I'm removing the accept from this answer for the time being until the state of this canonical Q&A is properly addressed. (deleting this answer would also delete the attached comments, which isn't the way to go IMO. probably going to turn it into a community wiki answer after extensive editing.)


I could quote RFCs here and use technical terms, but this is a concept that gets missed by a lot of people on both ends of the knowledge spectrum and I'm going to try to answer this for the broader audience.

I understand that this is considered a best practice, but this really seems pointless!

It may seem pointless...but it's actually not!

Recursive servers are very good at remembering when remote servers do not respond to a query, particularly when they retry and still never see a reply. Many implement negative caching of these communication failures, and will temporarily put unresponsive nameservers in the penalty box for a period of time no greater than five minutes. Eventually this "penalty" period expires and they will resume communication. If the bad query fails again they go right back into the box, otherwise it's back to business as usual.

This is where we run into the single nameserver problem:

  • The penalty period is by nature of implementation always going to be greater than or equal to the duration of the network problem. In almost all cases it will be greater, to a maximum of an additional five minutes.
  • If your single DNS server goes into the penalty box, the query associated with the failure is going to be completely dead for the full duration.
  • Brief routing interruptions happen on the internet more than most people realize. TCP/IP retransmissions and similar application safeguards do a good job of hiding this from the user, but it's somewhat unavoidable. The internet does a good job of routing around this damage for the most part due to safeguards built into the various standards that support the network stack...but that also includes the ones built into DNS, and having geo-redundant DNS servers is a part of that.

Long story short, if you go with a single DNS server (this includes using the same IP address multiple times across NS records), this is going to happen. It's also going to happen a lot more than you realize, but the problem will be so sporadic that the odds of the failure 1) being reported to you, 2) being reproduced, and 3) being tied to this specific problem are extremely close to zero. They pretty much were zero if you came into this Q&A not knowing how this process worked, but thankfully that shouldn't be the case now!

Should this bother you? It's not really my place to say. Some people won't care about this five minute interruption problem at all, and I'm not here to convince you of that. What I am here to convince you is that you do in fact sacrifice something by only running a single DNS server, and in all scenarios.

Andrew B
  • 31,858
  • 12
  • 90
  • 128
  • 1
    Some systems are also hopelessly dependent on dns-lookups not failing. It's a common point of failure that lack a redundancy that causes much trouble. – artifex Aug 01 '15 at 06:31
  • 18
    Mail, being cached, is a classic example of how you can shoot yourself in the foot with DNS down at the same time as the rest of your infrastructure. With redundant DNS, when your site is down, mail just queues up on senders' servers, and is delivered after recovery. With single DNS, inbound mail sent while you're down will often be permanently rejected by senders' servers with *non-existent domain* or similar errors. Outbound mail sent from peripheral (still-up) systems may *also* fail, because sender domain currently doesn't resolve. – MadHatter Aug 01 '15 at 06:36
  • 5
    Also, a domain name is usually not only web - it's email, too. If you're using an email service provider for your domain, they may not be down even though your webserver is, and if you've got redundant DNS you'll still be able to get emails. – Jenny D Aug 01 '15 at 06:45
  • The 5m is just the retry period of a single server? Will this not multiply with many servers in the chain, and the client will cache bad queries, too? – Nils Aug 02 '15 at 07:03
  • @Nils Can you reword that slightly? I'm having trouble determining whether you mean many servers in a recursive cluster, or many authoritative servers. The 5m negative caching interval is per server, so you have to be getting a lot of requests to get a single record negative cached on an entire recursive cluster - making the failures even more sporadic. – Andrew B Aug 04 '15 at 13:43
  • I think about a scenario with secondary dns servers and client contacting the secondaries after a failure of the primary authoritative server in an unlucky ttl constellation (secondary timout while primary is still unreachable). – Nils Aug 05 '15 at 06:17
  • @Nils The secondaries won't have a problem because they can survive without the primary until the `expire` value in the SOA is reached. Negative caching is more of a concern for recursive DNS servers trying to query authoritative servers. – Andrew B Aug 05 '15 at 07:57
  • @Nils, the `5m` from the referenced RFC is the absolute maximum that a failure could be cached for without violating the RFC; it's highly doubtful anyone would cache failures for 5 minutes -- otherwise, each outage at the client will result in an extra 5 minutes of the DNS being out. – cnst Jan 06 '17 at 19:13
  • @cnst BIND 9.11's servfail cache timeout defaults to 1 second, with a maximum configurable value of 30 seconds. – Alnitak Jan 06 '17 at 23:29
  • @Alnitak, many thanks, feel free to comment here — http://serverfault.com/questions/479367/how-long-a-dns-timeout-is-cached-for — and also provide a point of reference, if at all possible. BTW, it should also be taken in context — the attempt to resolve the name itself will likely timeout in several seconds itself, so, a cache of 1s is basically shorter than what it takes to click that refresh button! – cnst Jan 06 '17 at 23:34
0

OP asks:

That's all well and good, but are redundant DNS servers really necessary if I'm running all of my services off of the same IP address? I can't see how having a second DNS server would provide me any benefit if no one can get to anything provided by my domain anyway.

Great question!

The best answer is provided by Professor Daniel J. Bernstein, PhD Berkeley, who is not only a world-renowned researcher, scientist and cryptologist, but has also written a very popular and well-received DNS suite known as DJBDNS (last released 2001-02-11, still popular to this day).

http://cr.yp.to/djbdns/third-party.html (2003-01-11)

Costs and benefits of third-party DNS service

Pay attention to this short and succinct part:

Erroneous arguments for third-party DNS service

The second tactic is to claim that widespread DNS clients will do something Particularly Evil when they are unable to reach all DNS servers. The problem with this argument is that the claim is false. Any such client is clearly buggy, and will be unable to survive in the marketplace: consider what happens if the client's routers briefly go down, or if the client's network is temporarily flooded.

As such, the original answer for this question couldn't be more wrong.

Yes, short temporary network outages lasting a few seconds do happen every now and then. No, a failure to resolve a name during such an outage would not be cached for any number of minutes (otherwise, even having the best setup of highly-available authoritative nameservers in the world won't help).

Any software that liberally implements the conservative guideline of the up-to 5 minutes from the 1998-03 RFC to cache failures is simply broken by design, and having an extra geo-redundant server won't make a dent.

In fact, as per How long a DNS timeout is cached for?, in BIND, the SERVFAIL condition was traditionally NOT cached at all prior to 2014, and since 2015, is cached by default for only 1 second, less than what it'd take an average user to reach a resolver timeout and hit that Refresh button again.

(And even before we get to the above point of whether or not a resolution attempt should be cached, it takes more than a couple of dropped packets even for the first SERVFAIL to occur in the first place.)

Moreover, the BIND developers have even implemented a ceiling for the feature, of only 30s, which, even as a ceiling (e.g., the maximum value that the feature will ever accept), is already 10 times lower than the 5min (300s) suggestion from the RFC, ensuring that even the most well-intentioned admins (of the eye-ball users) won't be able to shoot their own users in the foot.


In addition, there are many reasons why you may not want to run a third-party DNS service -- read through the whole djbdns/third-party.html for all the details, and renting a tiny extra server just for DNS to administer by yourself is hardly warranted when no need other than BCP 16 exists for such an endeavour.

In my personal "anecdotal" experience of owning and setting up domain names since at least 2002, I can tell you with all certainty and honesty that I've actually in total did have a significant downtime of my various domains due to the professionally-run third-party servers of my registrars and hosting providers, which, one provider at a time, and over the years, all had their incidents, were unavailable, brought my domains down unnecessarily, at the same exact time when my own IP address (where the HTTP and SMTP for a given domain was hosted from) was fully reachable otherwise. Do note that these outages happened with multiple independent, respected and professionally-run providers, and are by no means isolated incidents, and do happen on a yearly basis, and, as a third-party service, are entirely outside of your control; it just so happens that few people ever talk about it long-term.


In short:

The geo-redundant DNS is NOT at all necessary for small sites.

If you're running all of your services off of the same IP address, adding a second DNS is most likely to result in an additional point of failure, and is detrimental to the continued availability of your domain. The "wisdom" of always having to do it in any imaginable situation is a very popular myth, indeed; BUSTED.

Of course, the advice would be totally different should some of the services of the domain, be that web (HTTP/HTTPS), mail (SMTP/IMAP) or voice/text (SIP/XMPP), are already serviced by third-party providers, in which case eliminating your own IP as a single-point-of-failure would indeed be a very wise approach, and geo-redundancy would indeed be very useful.

Likewise, if you have a particularly popular site with millions of visitors, and knowingly require the additional flexibility and protections of geo-redundant DNS as per BCP 16, then… You probably aren't using a single server/site for web/mail/voice/text already, so this question and answer obviously don't apply. Good luck!

cnst
  • 12,948
  • 7
  • 51
  • 75
  • 1
    While I'm more than happy to invite established professionals to review both answers, I'm getting more than just a little vibe of theatrics out of this verbiage. As such, while I'll accept whatever opinions are rendered by parties whose opinions I trust far more than yours or mine, I'm choosing to recuse myself from participating further in this comment thread. – Andrew B Jan 06 '17 at 19:56
  • I'm not sure what your comment is meant to say. You answered your own question with an argument that's simply invalid as per the point illustrated in my answer, quoted directly from DJB. Your answer is incorrect; as such, you're doing a disservice to the community by upholding a myth. If you'd like to rescind your answer, and accept mine, I'm happy to accept constructive criticisms/edits on it. – cnst Jan 06 '17 at 20:05
  • 1
    Read as: I'm asking established professionals to read both and comment publicly. Action will follow. Undated DJB articles are not authoritative. – Andrew B Jan 06 '17 at 20:23
  • @AndrewB, the quoted paragraph from the "undated" DJB article is something that's sufficient on its own, is easy to see why would be true, and goes in direct contradiction with the main takeaway of your answer. BTW, his article is not "undated", either; it's dated 2003-01-11, which is some 5 years past the RFC at stake (for context, [last djbdns release, 1.05, was in 2001-02-11](http://marc.info/?l=djbdns&m=98193301224443&w=2)). Your answer doesn't even cite where those "many" implementations come from, other than citing that the condition itself is described in the RFC circa 1998-03. – cnst Jan 06 '17 at 21:26
  • 2
    Good software will recognise a SERVFAIL response (generated by a recursive server if none of the authoritative servers can be reached) and handle it appropriately, i.e. by queuing SMTP mail. Unfortunately not all software is good. There's a certain professor whose dogmatic approach to implementing protocols has been known to cause significant interoperability problems... – Alnitak Jan 06 '17 at 23:54
  • 2
    The current state of the industry and what is in the wild is far more relevant than anything from 2003, let alone 2001. This is why relevant third party opinions were of more value than judging the matter by a dated editorial, albeit one that could have potentially survived the test of time. Alnitak pointed out that my memory of how BIND handled this case was in error, and my reinforcing that memory with verbiage from RFC 2308 was indeed fallacious. Retraction will follow this week as I find time. – Andrew B Jan 07 '17 at 12:51
  • @Alnitak, I won't argue whether or not DJB is a "jerk", it's definitely a losing proposition; however, I don't agree that a community losing face in an argument is a correct way of manifesting "significant interoperability problems". – cnst Jan 08 '17 at 03:56
  • @AndrewB, I have since discovered that prior to 2014, BIND, apparently, didn't cache SERVFAIL at all http://serverfault.com/a/824875/110020, so, it is all the more puzzling that so many people keep spreading the very same hypothetical rumours about DNS that were debunked back in 2003 (if not earlier). – cnst Jan 08 '17 at 03:59
  • Not puzzling at all, in my experience. Shy of reading source code or extensive testing of individual products, technical people are going to fall back on the reference documentation. Even if a feature is optional, if it appears operationally useful they will assume at least one major product implements it. Sometimes people will even mistakenly attribute past problems they've troubleshooted, as it was in my case. Everyone else is just going to parrot what sounds smart to them. – Andrew B Jan 08 '17 at 08:30
  • Your view appears inconsistent — you first refer to a dated doc approved by a committee in 1998, without dates as such, as a premise of your whole answer (in 2015, no less); you then disdain a live document written by an acclaimed expert/genius/scepticist that just so happens to have never been updated since 2003 (as there was no need) as outdated; yet in the end, go back to the static 1998 doc and claim that it's reasonable to make up facts based on a document that didn't even require things be the way you claim they are — it merely defined 5m as an absolute ceiling, not any sort of default. – cnst Jan 09 '17 at 16:17
  • I explained why it's unsurprising. I didn't say it was a good thing that was unsurprising, or that it justified any of the behaviors. Beyond that, you seem to remain puzzled at why someone was inclined to brush off a dated commentary on an old standard. The difference is that one defines a standard, and the other doesn't. The other is a commentary that is a snapshot in time. The snapshot in time perspective is, again, completely irrelevant to what is in production *today*. Nor is DJB someone who is considered to be actively engaged in the DNS community these days, more of an outside observer. – Andrew B Jan 09 '17 at 19:19
  • 2
    Side note: I relented on not engaging you for the sake of acknowledging factual error on my part, but it seems we're back into the territory of borderline belligerence. I apologize for spreading misinformation and have acknowledged the error, but I have no further wish to engage you. (nor shall I, as you appear to have a history of that here) – Andrew B Jan 09 '17 at 19:34
  • "a very popular and well-received DNS suite" That is VERY subjective. And if it was so popular, why almost noone uses it in practice? – Patrick Mevzek Mar 28 '22 at 00:15