Is it possible to completely avoid a single-point-of-failure in a web back-end?

Question

It seems like you're always dependent on some hosting provider being available. Even if you servers are geo-redundant across data centers, you still have a DNS record that points to some IP address and it will be resolved by some DNS server that can disappear any second. Is there a solution for this? I've seen people suggest DNS load-balancing with some mechanism for detecting downtime and doing failover. Which DNS provider offers this? And does it still rely on one of its data-centers not being down?

Assuming everything behind our first line of contact (LB proxy) is already geo-redundant - is there really a feasible way to take care of that last step?

Related: http://serverfault.com/questions/60553/why-is-dns-failover-not-recommended — Assaf Lavie, May 22 '11 at 05:54
more related: http://serverfault.com/questions/69870/multiple-data-centers-and-http-traffic-dns-round-robin-is-the-only-way-to-assure — Assaf Lavie, May 22 '11 at 05:55

score 3 · Answer 1 · answered May 22 '11 at 05:36

3

Actually, there can be several DNS servers serving a certain domain, take a look at the domain stackoverflow.com:

$ nslookup -type=ns stackoverflow.com
Server:     192.168.0.1
Address:    192.168.0.1#53

Non-authoritative answer:
stackoverflow.com   nameserver = ns3.serverfault.com.
stackoverflow.com   nameserver = ns1.serverfault.com.
stackoverflow.com   nameserver = ns2.serverfault.com.

Authoritative answers can be found from:

$

The domain names under stackoverflow.com can be resolved by three name servers, so even if one or two of them went down, the domain names can still be resolved.

answered May 22 '11 at 05:36

Raymond Tau

682
3
16

doesn't this rely on the browser to "do the right thing" and correctly try alternative DNS servers in case the IP it has already resolved is suddenly unavailable? – Assaf Lavie May 22 '11 at 05:54
@Assaf Yes, this relies on OS, or the ISP's DNS servers of users to "do the right thing". – Raymond Tau May 22 '11 at 06:11
But what happens if the DNS gave the browser an IP address of a machine in a data center that a moment later has gone down. Why would the browser retry different DNS servers? Won't it just assume the host is unreachable? – Assaf Lavie May 22 '11 at 16:38
I think it should see host is unreachable in this case. Until the cached IP address from DNS lookup timeout, which is determined by the TTL value in the query. You can see the TTL from dig output: ;; ANSWER SECTION: stackoverflow.com. 1115 IN A 64.34.119.12 In this case, the TTL is 1115 seconds. – Raymond Tau May 22 '11 at 16:59
So unless I have a really tiny TTL this means the user experiences downtime until it expires. Seems like a sub-optimal solution.. basically eliminates DNS cache altogether. – Assaf Lavie May 22 '11 at 17:11
You are correct that using DNS for load balancing (via round-robin) or for failover is not an optimal solution. It is a cheap solution, however, if you can't afford to do it properly. Even if you keep the TTL low enough to make manual (or automatic) changes to point to a different IP/provider, some DNS resolvers will just ignore it and keep the information cached longer then you specify anyway. The bottom line is that if you're experiencing downtime often enough to need to jump through these hoops, you're likely better off paying a little more for a better hosting provider with better uptime – Justin Scott May 23 '11 at 22:16
Not experiencing downtime at all. Just need to prepare for all sorts of scenarios (like what happened at AWS a few weeks ago). Thanks. – Assaf Lavie May 24 '11 at 04:26

score 2 · Answer 2 · answered May 22 '11 at 05:55

The RFCs which make recommendations for DNS servers suggest using at least three name servers placed in logically and geographically diverse locations to avoid exactly that problem. The IP addresses published for those servers can also be set up with IP anycast so servers at a variety of locations can share the same IP address. Routing around failures is pretty much automatic when the proper routing is used (i.e. one location that is tied to that IP goes down and traffic is simply directed to another automatically). The root DNS servers and many of the major TLDs are set up this way to resist failure and be resilient against DDoS attacks. It is how services such as OpenDNS have close to 100% uptime even when serving billions of queries.

Companies have spent millions of dollars on redundant infrastructure to reduce downtime, but failures can still happen, often in unexpected ways related to the human factors involved rather than the technological factors.

What happens if after the DNS query yields an IP address of a proxy server this server goes away? Will a browser know to perform a DNS query again? Or will it just see an unreachable host? — Assaf Lavie, May 22 '11 at 16:37
That depends on how that IP address has been deployed. If it's a single IP to a single box and that box fails, then it would be unreachable. If it's shared among multiple boxes and one of them fails, then it would still be reachable. (For example, I have a pair of linux-based firewalls that share an IP; if one fails then the other will see that and take over automatically; this will all depend on the equipment you have deployed). If it's deployed to multiple locations using IP anycast, then the failure of any one location can still be survived. — Justin Scott, May 23 '11 at 22:10

Is it possible to completely avoid a single-point-of-failure in a web back-end?

2 Answers2