Linux keeps retrying failed DNS server

Question

Whenever one of the servers in /etc/resolv.conf is unreachable, Linux/glibc/whatever isn't smart enough not to retry it for a while. This results in a lot of services becoming unavailable, because a lot of them do reverse lookups on all incoming connections (like SSH), which will hang for the time-out of the first DNS server query.

How can I make my Ubuntu boxes be smart about the DNS servers it uses? I could hack a bash script that runs every minute that inserts a REJECT rule into iptables for the servers that don't respond to dig queries, but I'd rather not do it that way...

I'm told that Windows does this properly, BTW.

Edit: I worked around it a little bit by putting this in /etc/resolv.conf (or /etc/resolvconf/resolv.conf.d/base):

options timeout:2 rotate

Still not perfect, but more workable.

I predict that you would have no benefit from REJECTing traffic to dead servers as many applications would'nt check the cause of getting no answer and thus continue with stupid behaviour. — Karma Fusebox, Dec 28 '12 at 15:59
Wouldn't the next server be tried when a reject on the first occurred? — Halfgaar, Dec 28 '12 at 16:16
My comment is just a prediction as I don't know for sure. The resolv.conf is just a file with addresses and it is up the libraries of the applications code how the actual lookup is handled. Let's say you run a script written in FOOLANG, then the FOOLANG-Core will have to be smart enough to read more than one resolv.conf entry and decide "hey, REJECT, i'll try the next one". If you run applications with not-so-smart libs, you're out of luck. — Karma Fusebox, Dec 28 '12 at 16:27

score 5 · Accepted Answer · answered Dec 28 '12 at 16:21

5

Why are the DNS servers becoming unavailable? That's the issue we should focus on fixing...

You should omit the rotate directive if you want to have a deterministic retry order. rotate basically gives you round-robin lookups, which can have undesirable results in your situation.

My DNS /etc/resolv.conf tends to look like:

search blah.net client.blah.net
options timeout 1
nameserver 172.16.2.14
nameserver 172.16.2.18

Short of that, you do have the option of using a caching DNS service on your local machine, or even enabling the Name Server Caching Daemon (nscd). That will help buffer the delays that come with unreliable DNS resolvers.

answered Dec 28 '12 at 16:21

ewwhite

194,921
91
434
799

All our servers use our upstream provider's DNS servers. They can go down; nothing I can do about that. And without the rotate, all lookups are delayed by 1 second if the primary fails. When rotating, half of them are... – Halfgaar Dec 28 '12 at 16:25
Change the lookup order or enable `nscd` caching. – ewwhite Dec 28 '12 at 16:29

score 3 · Answer 2 · edited Apr 13 '17 at 12:14

Ugh. I've come across this same problem in my systems. When the primary DNS server goes offline, the entire system becomes incredibly slow at best.

In fact, I asked a similar question on this quite some time ago: DNS/resolv.conf settings for a Primary DNS Server failure?. There were some really good answers there, that you might find useful.

I wound up just editing /etc/resolv.conf with lower timeout values. (options timeout:1) Largely because it was the easiest workaround, rather than the most effective. This change means the servers spend less time waiting for dead resolvers. Lookups take 2 seconds rather than 10. This is still terrible if you're trying to do anything that isn't a batch, but at least resulted in very few service failures.

I actually just found a similar solution, but I added this: options timeout:2 rotate". The rotate also helps, but it's still not perfect. — Halfgaar, Dec 28 '12 at 16:13

Linux keeps retrying failed DNS server

2 Answers2