Which method can I use to automate DNS-failover with monitoring?

Question

We're operating multiple redundant servers across the world for latency reasons. Currently if one site goes down, our only way to let another site take over that region is through DNS.

We would like to automate this process, for example by replacing/modifying the zone files if a site is detected as having failed through a monitoring tool.

My Google skills only turned up companies offering this as a service, but we'd prefer our own solution. For monitoring we currently use Nagios, our nameserver is Bind.

Is there any tool/method out there to accomplish this?

score 5 · Accepted Answer · answered Feb 15 '15 at 16:43

Of cause there is, that's what those services are doing as well. :-)

It depends a bit how you're currently redirecting/distributing your users globally. Assuming that the results is effectively that some users are redirected from www.example.com to www.eu.example.com and others to www.oc.example.com respectively www.am.example.com.

You could use your monitoring solution so that when www.am.example.com becomes unresponsive not only a normal alert is triggered, but an update such that www.am.example.com points to www.eu.example.com instead.

A clean way is with dynamic Update which is a method for adding, replacing or deleting records in a master server by sending it a special form of DNS messages. The format and meaning of those messages is specified in RFC 2136.

Dynamic update is enabled by including an allow-update or an update-policy clause in the zone statement. For more info check the Bind Administrator Reference Manual.

The cleanest is to probably use both IP based access controls and DNS public keys.

Create the key-pair:

dnssec-keygen -a HMAC-MD5 -b 512 -n USER nagios.example.com.

Which should result in two files, one for the private key Knagios.exmaple.com.NNNN.private and a second with the public key Knagios.exmaple.com.NNNN.key.

Update your Bind config:

key nagios.example.com. {
    algorithm HMAC-MD5;
    secret "<string with contents from Knagios.exmaple.com.NNNN.key>"; };

zone "am.example.com"
{
    type master;
    file "/etc/bind/zone/am.example.com";
    allow-update { key nagios.example.com.; };
    ...
};

Then a script that does the following when an alert is raised using the Bind nsupdate utility:

cat<<EOF | /usr/bin/nsupdate -k Knagios.exmaple.com.NNNN.private -v
server ns1.example.com
zone am.example.com
update delete www.am.example.com. A
update add www.am.example.com. 60 A <ip-address-of-www.eu.example.com>
send
EOF

I'm not sure if you were allowed to use dynamic update for anything besides A records.

In the meantime I found out that it's easy to invoke scripts on failure with Nagios. Your answer takes the horror out of having to modify zone files in a bash script. ;-) — Mantriur, Feb 15 '15 at 19:15
I don't want to imagine the horror stories that can can turn into. But keep in mind TomTom's valid remarks that updating even short lived DNS records is not guaranteed to be a successful failover strategy. — HBruijn, Feb 15 '15 at 20:17
@Mantriur I'd strongly recommend Anycast if it's a viable option. — Andrew B, Feb 16 '15 at 15:50

score 1 · Answer 2 · answered Feb 15 '15 at 16:43

1

None. Your approach is broken.

You seem to be under the delusion that you can change the DNS like that. It does not work like this. Even if you set the TTL low, some providers will ignore it - and your old value will still be used. You effectively have no control over DNS expiration outside "within a day or two".

Any high availability based on DNS changes is thus fundamentally flawed.

answered Feb 15 '15 at 16:43

TomTom

50,857
7
52
134

I'm very well aware of this and of course there is on-site redundancy. But since we are colocating with different ISPs, also for redundancy reasons, this is our last line of defense when a disgruntled DC employee pees on the border routers. – Mantriur Feb 15 '15 at 17:58
1

Btw, TTL being ignored is really much less of a factor than people on the internet make you think. I've done that many times when moving webservers and preparing that with a 5M TTL. After about an hour 95% of clients usually have caught up. Of course, there are always one or two (out of thousands) that don't get it even after a week. :) – Mantriur Feb 15 '15 at 18:20
This doesn't really answer OPs question. Also, the low TTL issue is more of a rumor than fact, as no concrete evidence of anyone actively ignoring low TTL seem to exist – Jon Skarpeteig Apr 28 '16 at 09:20
@JonSkarpeteig: People have even asked here how to configure BIND to do exactly this. http://serverfault.com/questions/113954/how-can-i-override-ttl-of-an-internet-address. Also, AOL was positively known for doing it, although I hope they learned a bit. – Sven Apr 28 '16 at 09:36
I haven't had any caching headaches this day and age as I have had in the 90s. I'm sure at that time DNS queries where a bandwidth and processing concern at the ISP level and above. It made sense to force a TTL of several hours or more for this mostly static data. Nowadays there's really no reason to take that decision away from the name's authority. – Mantriur May 05 '16 at 19:39
Unless you are connected by modem (many lower tech countries have many people on that, for example large parts of the USA) or older mobile phone tech. Will you rule out those unfortunate enough to live in large parts of the USA? – TomTom May 06 '16 at 05:28
@TomTom The bandwidth of your connection rarely correlates with your ISP's DNS caching policy. The handful of low bandwidth customers left don't enjoy a custom tailored network environment. – Mantriur Jun 15 '16 at 22:55

Which method can I use to automate DNS-failover with monitoring?

2 Answers2