22

We have a small datacenter with about a hundred hosts pointing to 3 internal DNS servers (bind 9). Our problem comes when one of the internal DNS servers becomes unavailable. At that point all the clients that point to that server start performing very slowly.

The problem seems to be that the stock Linux resolver doesn't really have the concept of "failing over" to a different DNS server. You can adjust the timeout and number of retries it uses, (and set rotate so it will work through the list), but no matter what settings one uses our services perform much more slowly if a primary DNS server becomes unavailable. At the moment this is one of the largest sources of service disruptions for us.

My ideal answer would be something like "RTFM: tweak /etc/resolv.conf like this...", but if that's an option I haven't seen it.

I was wondering how other folks handled this issue?

I can see 3 possible types of solutions:

  • Use linux-ha/Pacemaker and failover IPs (so the DNS IP VIPs are "always" available). Alas, we don't have a good fencing infrastructure, and without fencing pacemaker doesn't work very well (in my experience Pacemaker lowers availability without fencing).

  • Run a local DNS server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage.

  • Run a local cache on each node. Folks seem to consider nscd "broken", but dnrd seems to have the right feature set: it marks DNS servers as up or down, and won't use 'down' DNS servers.

Any-casting seems to work only at the IP routing level, and depends on route updates for server failure. Multi-casting seemed like it would be a perfect answer, but bind does not support broadcasting or multi-casting, and the docs I could find seem to suggest that multicast DNS is more aimed at service discovery and auto-configuration rather than regular DNS resolving.

Am I missing an obvious solution?

Dave M
  • 4,494
  • 21
  • 30
  • 30
Neil Katin
  • 321
  • 1
  • 2
  • 3
  • 2
    I suggest that in addition to finding the solution you're asking for (which I can't help you with) you should be working on the real root problem and fix the reliability issues with the DNS server. – John Gardeniers Jan 04 '11 at 21:41
  • The root problem is: why do these DNS servers go down so often to make you bother about this? Consider replicating your DNS with specialized services like [BuddyNS](http://www.buddyns.com). Your latency will dip dramatically and uptime won't make you bother about /etc/resolv.conf tweaks anymore. – michele Nov 12 '12 at 13:00

8 Answers8

15

A couple of options. Both will distribute the DNS load across your DNS servers.

  • Try using options rotate in resolv.conf. This will minimize the impact of the primary server being down. If one of the other servers is down, it will slow down actions.
  • Use a different nameserver order on different clients. This will allow some clients to run normally if the primary DNS server is down. This spreads the impact of an out of service DNS server around.

These options can be combined with options timeout:1 attempts:5. Increase the attempts if you decrease timeout so you can handle slow external servers.

Depending on your router configuration you may be able to configure your DNS servers to take over the primary DNS server's IP address when it is down. This can be combined with the above techniques.

NOTE: I run years without unscheduled DNS outages. As others have noted, I would work on solving the issues causing the DNS servers to fail. The above steps, also help with misconfigured DNS servers with specifying unreachable name servers.

BillThor
  • 27,354
  • 3
  • 35
  • 69
4

Check out "man resolv.conf". You can add a timeout option to the resolv.conf. The default is 5, but adding the following to resolv.conf should bring it down to 1 second:

options timeout:1

Niall Donegan
  • 3,859
  • 19
  • 17
  • After rereading your second paragraph, I've tried the above on a Centos and Debian VPS. After bringing down the primary dns, the resolver performed exactly as expected. Running a tcpdump, I could even see the resolver trying the first server, and then trying the next. What behaviour are you seeing? – Niall Donegan Jan 04 '11 at 21:15
  • 2
    There are two big use-cases for resolving: short lived processes (like command line tools) and long lived processes, and the same resolver configuration has to work for both. For short lived (single lookup) setting a short timeout will fail over quickly. But if you are looking up an external address that doesn't resolve in that time: you will get a name not found, since the resolver will abandon that query if it doesn't come back in a second. (out of room; more in the next comment) – Neil Katin Jan 04 '11 at 22:12
  • Long term processes will retry each lookup, timeout, and then move to the next server. But it doesn't seem to cache the "deadness" of the server. – Neil Katin Jan 04 '11 at 22:16
3

Clustering software such as heartbeat or pacemaker/corosync is your friend here. As an exmple, we've set up pacemaker/corosync as follows:

  • Pair up every server with another one
  • Per pair have 2 dns vips, usually one on each
  • Should either bind or the server fail, the vip moves to the other server within milliseconds

Production hours are 24x7, but we strongly believe that it should be possible for every server to fail without impacting customers. option rotate is merely a workaround, I wouldn't do that.

Dennis Kaarsemaker
  • 18,793
  • 2
  • 43
  • 69
3

Run a local dns server on each node, and have resolv.conf point to localhost. This would work, but it would give us a lot more services to monitor and manage.

FWIW, this is the only workable solution that I have found for this problem. You do need to restrict the server to only listen on localhost, but it has completely eliminated users noticing DNS outages in our environment.

One interesting side effect is that if the localhost server goes down for some reason, the standard resolver libraries seem to handle the failover to the next server much faster than in the standard case.

We have been doing this for about 3 years now and I've not seen a single issue that can be related to the failure/outage of a dns server running on localhost.

2

If a nameserver is going down for maintenance, it is normal procedure to reduce the timeouts in the SOA for that domain ahead of time, so that when the maintenance occurs, changes (like removing NS records before the maintenance and putting them back after the maintenance) propagate quickly. Note that this is a server-side approach - changing resolvers is a client-side approach and ... unless you can talk to each and every one of your clients and get them to make this adjustment on their machine ... might not be the right approach. Well, I guess you did say only a hundred clients all in a data center using internal DNS servers, but really do you want to change the config on a hundred clients when you can just change the zone?

I'd tell you which values in the SOA to adjust, but I was surfing the web to find out that exact info when I ran across this question.

  • 3
    This answer pertains to authoritative DNS only. The question was concerning recursive DNS lookups made by client software. – Andrew B Jul 08 '14 at 20:44
1

Perhaps you can put your DNS servers behind a load balancer? Apparently LVS can balance UDP. Obviously make your LB highly available so it's not a single point of failure.

rxvt
  • 21
  • 4
0

A more network-centric solution would be use two DNS servers with the same (dedicated) IP and Anycast routing. (I haven't noticed this answer in this thread so far, but that's what is used here.)

As long as both are up, the nearest server is used. If one goes down, traffic for that IP will be routed to the other node until it comes up again. This especially makes sense if you have two or more locations or data centers.

Axel Beckert
  • 398
  • 2
  • 17
-2

I know this might sound trite, but how about building a more stable, reslilient DNS infrastructure as a permanent solution to the problem.

joeqwerty
  • 108,377
  • 6
  • 80
  • 171
  • We have a fairly resilient dns infrasture. But 2 or 3 times a year we have an outage because a dns server goes down (or is restarted, or has an OS upgrade, or whatever). – Neil Katin Jan 04 '11 at 22:17
  • 1
    Well... restarts and upgrades should be scheduled for non-production hours. As for the rest, it seems like you're making a pretty big deal out of something that happens a few times a year. Is the additional infrastucture, time, money, and management overhead worth it for a problem that occurs so seemingly infrequently? – joeqwerty Jan 04 '11 at 23:31
  • 9
    What happens when your production hours are 24x7? DNS should fail to the second/third/x server AND cache the failure of the other server for a period. The Default 5 second timeout is enough to bring services down depending on the load. – Ryaner Apr 29 '11 at 20:30