0

I'm trying to solve a dns problem I've been having with my web app.

It makes multiple requests to various but fixed external domains. I can't put the domains in a host for for obvious reasons, cloudfront / load balancing and other changes of ip.

I've found despite running timeouts and handling stale outbound connections I've found that simulating dns failures reproduces the failures I'm seeing within my web app.

Therefore I think I should be implementing a local dns cache. I've chosen powerdns recoursor to handle my outbound requests. It will deal with 500-1000 requests per second all to the same 8 or so domains.

What I'm hoping to achieve is reduced dns failures, either communication errors, slow dns responses or failed dns responses. Believe it or not we were using googles dns before and occasionally it would fail to respond and it would make our app crash and at peak times really make our threads hang and consume resources.

So have I got the right idea, running a local recursive dns?

I'm thinking of running the local alongside google in my resolv.cond with rotate turned on along with other configuration.

What I'm not sure about is how powerdns actually resolves queries, I've set no forward zones but it will still return a dig within 30ms and all subsequent results from cache.

Can you pick holes with my logic and if this is a good solution to my dns reliability?

Thanks

B p
  • 83
  • 1
  • 2
  • 5
  • Why powerDNS rather than DNSMasq or Bind? It seems strange that you've picked a product then are asking if the strategy makes sense (yes a local nameserver is probably a good idea). – symcbean Jun 13 '13 at 09:30

2 Answers2

0

Your plan sounds fine, but check the TTL records on the RRs for the small collection of domains you need to resolve. If they have really short TTLs (like less than approximately 20 seconds), your local cache won't help that much because it will still query them once per TTL interval. Very short TTLs are sometimes seen with load balancers and such. The longer the TTLs, the more effective your cache will be.

You could also consider software that's dedicated to being a recursive resolver, such as Unbound. Unbound has a nice feature whereby it can requery domains from its cache that are just about to expire, to prevent them from expiring. That can help resiliency a bit.

There's no good reason to list both your new local cache and an external server in /etc/resolv.conf. If you have nameserver 127.0.0.1 in that file, that's all you really need.

Celada
  • 6,060
  • 1
  • 20
  • 17
  • An yes ttl, you are right as I've just checked and the shortest is 2 mins, then 2 more are 5 mins and a couple are 1 hour. This would certainly affect resilience I think. I will check out unbound, thank you for the suggestion. Any ideas where powerdns is getting its results as I've not configured any forward zones yet? – B p Jun 13 '13 at 00:55
  • I think that's fine. Even if your local caching server has to discard and requery those records every 2min/5min/1h, your local caching still has the potential to help a lot. – Celada Jun 13 '13 at 00:57
  • I don't think I understand your question about where PowerDNS is getting its results from. It's getting them (ultimately) from the authoritative nameservers for those domains... – Celada Jun 13 '13 at 00:59
  • Sorry I misunderstood the operation of powerdns resoursor, I thought I had to configure forward zones which would be checked if no cache hit was found. So it just goes direct to which authoritative nameserver is hosting the domain, I.e, route 53 etc. I understand. Thanks Celada. – B p Jun 13 '13 at 01:03
0

I just wanted to add that you may want to put

options single-request

into your resolv.conf file if you do not do a local DNS server.

On newer systems (RHEL 6), if you have IPv6 enabled, it will try to lookup both A and AAA records in parallel. I've seen some DNS servers treat this as abuse (too many requests/second) and start dropping DNS queries.

You did not mention your web server, but some web servers (Nginx) have internal DNS caching. They've been known to cache bad results, so you may want to disable any internal caches during debugging.

jeffatrackaid
  • 4,112
  • 18
  • 22