5

Whenever one of the servers in /etc/resolv.conf is unreachable, Linux/glibc/whatever isn't smart enough not to retry it for a while. This results in a lot of services becoming unavailable, because a lot of them do reverse lookups on all incoming connections (like SSH), which will hang for the time-out of the first DNS server query.

How can I make my Ubuntu boxes be smart about the DNS servers it uses? I could hack a bash script that runs every minute that inserts a REJECT rule into iptables for the servers that don't respond to dig queries, but I'd rather not do it that way...

I'm told that Windows does this properly, BTW.

Edit: I worked around it a little bit by putting this in /etc/resolv.conf (or /etc/resolvconf/resolv.conf.d/base):

options timeout:2 rotate

Still not perfect, but more workable.

Halfgaar
  • 7,921
  • 5
  • 42
  • 81
  • 1
    I predict that you would have no benefit from REJECTing traffic to dead servers as many applications would'nt check the cause of getting no answer and thus continue with stupid behaviour. – Karma Fusebox Dec 28 '12 at 15:59
  • Wouldn't the next server be tried when a reject on the first occurred? – Halfgaar Dec 28 '12 at 16:16
  • My comment is just a prediction as I don't know for sure. The resolv.conf is just a file with addresses and it is up the libraries of the applications code how the actual lookup is handled. Let's say you run a script written in FOOLANG, then the FOOLANG-Core will have to be smart enough to read more than one resolv.conf entry and decide "hey, REJECT, i'll try the next one". If you run applications with not-so-smart libs, you're out of luck. – Karma Fusebox Dec 28 '12 at 16:27

2 Answers2

5

Why are the DNS servers becoming unavailable? That's the issue we should focus on fixing...

You should omit the rotate directive if you want to have a deterministic retry order. rotate basically gives you round-robin lookups, which can have undesirable results in your situation.

My DNS /etc/resolv.conf tends to look like:

search blah.net client.blah.net
options timeout 1
nameserver 172.16.2.14
nameserver 172.16.2.18

Short of that, you do have the option of using a caching DNS service on your local machine, or even enabling the Name Server Caching Daemon (nscd). That will help buffer the delays that come with unreliable DNS resolvers.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • All our servers use our upstream provider's DNS servers. They can go down; nothing I can do about that. And without the rotate, all lookups are delayed by 1 second if the primary fails. When rotating, half of them are... – Halfgaar Dec 28 '12 at 16:25
  • Change the lookup order or enable `nscd` caching. – ewwhite Dec 28 '12 at 16:29
3

Ugh. I've come across this same problem in my systems. When the primary DNS server goes offline, the entire system becomes incredibly slow at best.

In fact, I asked a similar question on this quite some time ago: DNS/resolv.conf settings for a Primary DNS Server failure?. There were some really good answers there, that you might find useful.

I wound up just editing /etc/resolv.conf with lower timeout values. (options timeout:1) Largely because it was the easiest workaround, rather than the most effective. This change means the servers spend less time waiting for dead resolvers. Lookups take 2 seconds rather than 10. This is still terrible if you're trying to do anything that isn't a batch, but at least resulted in very few service failures.

Christopher Karel
  • 6,442
  • 1
  • 26
  • 34
  • I actually just found a similar solution, but I added this: options timeout:2 rotate". The rotate also helps, but it's still not perfect. – Halfgaar Dec 28 '12 at 16:13