Availability at risk due to one offline Domain Name Server?

Question

A domain can have plenty of name servers registered at there domain registrar. The name servers are picked randomly and not like expected primary first, secondary second and so on.

Knowing that, does this mean that when one name server is down, there is a 50% chance that the visitors that get to question the offline name server will never reach your site? While the other 50% is able to browse to your site just fine, and so affecting the availability of the server?

Lastly, why would clients not by default question the next name server in the list when one is down?

The same applies for IPv4 and IPv6. If one of the name servers only supports IPv6 and no IPv4 and a user without IPv6 connectivity gets to question that specific name server, the site will be unreachable I suppose.

Additionally, I'm talking explicitly about the way the authoritative server is picked and the handling of a failure in case the picked authoritative server is not available due to downtime or ipv4-ipv6 incompatibilities between client and server.

Can you elaborate more on why you would believe recursive servers do not attempt to cycle through the remaining nameservers? They do not simply give up if the first authoritative is unreachable or misconfigured (i.e. returning `SERVFAIL`). They *do* stop if they successfully reach a server that replies with a statement that the requested record does not exist (`NXDOMAIN`, `NOERROR` + 0 answers), which is not an error condition. — Andrew B, Jun 21 '17 at 18:17
@AndrewB My question is not specifically about recursive DNS it might be as well iterative. My question is easy, end-users (Windows, Linux, Mac...) connect to domains of which the name servers are registered at the domain registrar. If I understand right, the order of which server is used might be random and not logically primary first and if it fails secondary and so on. Would you like me to elaborate more? If so, what specifically? — Bob Ortiz, Jun 21 '17 at 18:22
That statement is actually incorrect as worded, and may be where the confusion lies. End user machines are stub resolvers (sometimes called dumb resolvers) which rely on a recursive server to function. The recursive server talks to the authoritative servers defined at the registrar level. Stub resolvers are unable to follow referrals, never talk to authoritative servers, and are wholly dependent on recursive servers. — Andrew B, Jun 21 '17 at 18:24
@AndrewB great, but I still have no clue about the scenario user > registrar > nameserver 1,2,3? > ip. The reason why I ask is because I want to be sure about the behaviour of the software on the userside. What if the primary server is down? Will it try secondary? Or will it randomly pick the primary or secondary server which means the second nameserver also always have to be online. You seem like some guy with knowledge about DNS, would you like to describe this in an answer? — Bob Ortiz, Jun 21 '17 at 18:28
I'd like to provide an answer, but I'm trying to work my way back from the initial assumption: why are you under the impression that recursive servers do not operate in the fashion that you've described? That will help me ensure that I'm answering the *right* question. Recursive servers *do* cycle through the authoritative servers defined at the registrar, starting with a randomly chosen one, and continue on to the next if the first is unresponsive or misconfigured. — Andrew B, Jun 21 '17 at 18:30
@AndrewB I did a small check on the terminology, correct me if I'm wrong. User > recursive (for example 8.8.8.8) > authoritative (as described in domain registrar) > host ip. My question is explicitly about the authoritative servers in this case. I don't care about how and what the user configures regarding their recursive dns settings. Assuming that settings work accordingly. I want to know indeed how "recursive servers" question the "authoritative servers". Is it random order? And what if the random picked one is secondary but that does only support ipv6? I think you get my question. — Bob Ortiz, Jun 21 '17 at 18:37
Let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/60851/discussion-between-evander-consus-and-andrew-b). — Bob Ortiz, Jun 21 '17 at 18:45
I suppose the "50%" is based on the idea of having two nameservers specifically, even though you acknowledge that there could very well be more? — Håkan Lindqvist, Jun 21 '17 at 18:52
@HåkanLindqvist exactly. The 50% is just an example of a case in which their are only two indeed. — Bob Ortiz, Jun 21 '17 at 18:53

score 6 · Accepted Answer · edited Oct 07 '21 at 07:34

Lastly, why would clients not by default question the next name server in the list when one is down?

That is exactly what recursive servers do when talking to authoritative servers. RFC 1035 §7.2 describes the overall process if you're interested, but the following excerpts are the most immediately relevant:

The key algorithm uses the state information of the request to select the next name server address to query, and also computes a timeout which will cause the next action should a response not arrive. The next action will usually be a transmission to some other server, but may be a temporary error to the client.

[snip]

If a resolver gets a server error or other bizarre response from a name server, it should remove it from SLIST, and may wish to schedule an immediate transmission to the next candidate server address.

There are a few other factors considered in the selection of the authoritative server, such as the observed response time based on prior communication history. It's there in the RFC if you're interested.

The key to ensuring that you are not impacted by nameserver unreachability is covered by BCP 16. In particular, Section 3.1 states:

Secondary servers must be placed at both topologically and geographically dispersed locations on the Internet, to minimise the likelihood of a single failure disabling all of them.

That is, secondary servers should be at geographically distant locations, so it is unlikely that events like power loss, etc, will disrupt all of them simultaneously. They should also be connected to the net via quite diverse paths. This means that the failure of any one link, or of routing within some segment of the network (such as a service provider) will not make all of the servers unreachable.

This is to account for the fact that the resiliency of your domain is severely impacted by single points of failure on the network, or on the physical site. The ideal state is to have multiple authoritative nameservers that are not impacted by any change in network or physical state experienced by the others.

score 4 · Answer 2 · answered Jun 21 '17 at 19:07

I would say that the answer to the overall sentiment of the question is "no".

First off, the client machine traditionally only has a stub resolver, blindly sending all queries (with "recursion desired" set) to some configured nameserver address (resolv.conf).

It's really what happens in the next step, when that nameserver processes the recursion request, making iterative queries until it reaches the authority, that your question is applicable.

And while there is some degree of implementation specific behavior, it's absolutely the case that it is expected to try to work itself through the authoritative nameservers until it finds one which is responsive.
The caveat here is rather that there will be some overall timeout, so there is a risk that it cannot finish in time.
That said it's also common to keep tabs of which servers are working and which aren't, increasing the chances that successive queries will succeed in a timely fashion, and of course queries for already cached data will not even require communication with the authoritative servers.

All in all, no, you should not expect 50% chance of user-visible error if there are two nameservers and one is down. More likely the first lookup in a completely cold-cache scenario will just be slightly slow.

Yeah, it's important to note the cold cache scenario. The initial query may time out depending on how many bad servers a recursive system has to cycle through before finding a valid response. Usually it will be ready to respond on the retry. It's possible to get extremely unlucky and have all of your retries land on systems in a cluster that also have cold cache for that name, but statistically unlikely. (and moreso for popular names) — Andrew B, Jun 21 '17 at 19:15

score 0 · Answer 3 · answered Jun 21 '17 at 13:28

0

Saying that there is 50% chance that the visitors that get to question the offline name server will never reach your site is not accurate. From manual of Linux resolver man resolv.conf, under the section that describes nameserver option you can read:

If there are multiple servers, the resolver library queries them in the order listed. If no nameserver entries are present, the default is to use the name server on the local machine. The algorithm used is to try a name server, and if the query times out, try the next, until out of name servers, then repeat trying all the name servers until a maximum number of retries are made.

So, they will be tried according to the order specified in config file. Saying that does not mean necessarily mean that all resolvers should behave in the same way.

answered Jun 21 '17 at 13:28

Khaled

35,688
8
69
98

I also would expect that. But I it's described differently on many sources. For example "regarding the NS records returned: it is perfectly allowed to randomise the order in which those records are returned, so the order may differ each time you request it" - https://serverfault.com/a/355418/293454 – Bob Ortiz Jun 21 '17 at 13:34
5

The context of the question is interactions between recursive servers and authoritative servers. Documentation regarding the behavior of the Linux stub resolver strikes me as being wholly unrelated. – Andrew B Jun 21 '17 at 18:02
@AndrewB is correct. This answer is not addressing the issue that the question is about. – Barmar Jun 21 '17 at 22:10

Availability at risk due to one offline Domain Name Server?

3 Answers3