Transparent geographical DR website failover

Question

We've already got webservers that are loadbalanced. And even though outages shouldn't happen, they do, for a variety of reasons. (central switch failure, misconfigured ISP routers, backbone failures, DOS attack on shared infrastructure) I want to put a second set of servers in a completely different geographical location with entirely different connections. I can sync the SQL servers with a number of different techniques, so that's not a problem. But what I don't know how to do is transparently redirect existing user web sessions to the backup servers when the primary goes down or becomes unreachable.

AFAIK, the three most common ways of dealing with this are:

DNS load balancing, which uses a very-low TTL to intelligently resolve DNS requests to server IPs in the best environment.
Intelligent Redirection, which uses a 3rd site to authoritatively redirect users to well-known, but secondary DNS names like na1.mysite.com and eu.mysite.com.
Use an intelligent, minimal proxy server to relay the requests to different sites while hosting the proxy server in the cloud somewhere.

But in the case of a site failure, the first would leave users unable to reach the server until the TTL causes clients to requery DNS and resolve to the DR site, or causes excessive extra DNS requests. The second method still leaves us with a potential single-point-of-failure (although I could see multiple A-records being used to duplicate the master "login" role between environments) but still doesn't redirect users when the site that they're currently using goes down. And the third isn't redundant at all if the cloud goes down. (as they all have from time to time)

From what I know about networking, isn't there a way that I can give 2 different servers in 2 geographically separated environments the same overlapping IP address and let IP packet routing take over and route traffic to the server accepting requests? Is this only feasible with IPv6? What is it called and why don't DR site failovers currently use such a technique? Update: This is called anycast. How do I make this happen? And is it worth the trouble?

To clarify: this question is specific to HTTP server traffic only with service interruption allowed for up to 60 seconds. Users should not need to close their browser, go back to the login page, or refresh anything. Mobile users cannot accept an extra DNS query for every page request.

"even though outages shouldn't happen, they do" Decent summary of the life of a sysadmin — Smudge, Mar 05 '13 at 19:41
@TomO'Connor, It's happened twice in as many years to affect the entire data center. — Eric Falsken, Mar 05 '13 at 19:43

score 2 · Answer 1 · edited Apr 13 '17 at 12:14

2

I've been here before.

A few times.

Here's some of my past questions.

The general TL;DR is that DNS isn't a solution, for many reasons, some of which you've identified. Some of which are in the answers to the above linked questions.

The only real way to do geographic resilience is with BGP, and subdivide a /23 up into 2 /24s, have those advertised by your upstreams, and then do individual DNS stuff from there.

Then you get the irritating problem of synchronisation between them, but that's another story.

I can sync the SQL servers with a number of different techniques, so that's not a problem.

Well, it's not a problem you've had yet.

If you used intelligent redirection, either by changing the hostname, or by proxying the request, then you've got yet another problem.. "Where do you put the proxy, so that it's not a SPOF"

Otherwise, you'd have N geographically separate sites, but one single point of failure (The proxy/redirect engine).

I suppose, in theory you could use MPLS instead to make your locations appear to be on the same L2 network, although I'm uncertain how this would actually help improve resilience to failure.

edited Apr 13 '17 at 12:14

Community

1

answered Mar 05 '13 at 19:47

Tom O'Connor

27,440
10
72
148

Have you ever tried to anycast a single IP to multiple networks? Can it be used for DR redunancy or only for route optimization? – Eric Falsken Mar 05 '13 at 19:57
1

I haven't ever tried it. Sounds like the kind of thing that might rip apart the fabric of the internets. – Tom O'Connor Mar 05 '13 at 20:06
What about a CDN-like model? I found this SAAS product ([CloudLeverage](http://cloudleverage.com/global-load-balancing/load-balancing-101/)) that uses traffic-relay and IPv6 multicasting to send incoming requests to different datacenters. To better support our dynamic content application, it would add a little bit of latency, but supports direct-server-response without extra DNS requests. – Eric Falsken Mar 05 '13 at 20:11
Also, haven't tried it. At the time I last looked at this kind of solution, the SaaS model was in its pre-infancy, and CloudLeverage's type of product wasn't available. My advice. Suck it and see. – Tom O'Connor Mar 06 '13 at 00:01
My fear with a solution like that, however is "what if CloudLeverage goes down / ceases to exist / relies on EC2 that fails"? Then the control is effectively out of your hands. – Tom O'Connor Mar 06 '13 at 00:04
I'm waiting for a call from their sales people. But that is exactly the kind of question I'll be asking them. (do they have a single cloud dependency) – Eric Falsken Mar 06 '13 at 19:58

score 0 · Answer 2 · edited Apr 13 '17 at 12:14

DNS by itself doesn't provide automatic failover capability. But combined with browser's client retry, it do offer a free (in terms of network investment) and low latency (~1s) solution. See references below for more details.

http://blog.engelke.com/2011/06/07/web-resilience-with-round-robin-dns/
Multiple data centers and HTTP traffic: DNS Round Robin is the ONLY way to assure instant fail-over?

Transparent geographical DR website failover

2 Answers2