I know how to scale my software, but how to prevent downtime because of network outages?

Question

We are running rather large LAMP sites which scale well software wise. We use redundant load balancers in front of a bunch of webservers using MySQL via a proxy in master-slave-slave-slave.

We are using a very large US provider. They are not very cheap but not the most expensive either.

Last week there was a very large DDOS on their network and our cluster was affected; we lost network for a bit resulting in downtime.

What is the standard procedure to use 2 providers (for instance, one in EU and one in US)? I know how to do the software replication etc wise.

I'm wondering about the way data is sent to the EU network when the US one is down; is DNS the only choice for that? And if yes, how to set that up? Because switching DNS when the server is down seems too slow except when TTL = 0, which means we would be using DNS as a failover system. I understand (from Serverfault for instance), that this is not the preferred method of working.

So what is the preferred method of solving this with near 100% uptime (our cluster has that already, but the network doesn't). Dropping like 1000 requests would be fine, but more is bad and should never happen.

Doug Luxem · Accepted Answer · 2009-09-11T21:16:37.890

Assuming I understand your question correctly, you want to have your customer fail over to a secondary data center if the primary is down for whatever reason. One product that can handle this is the BIG-IP Global Traffic Manager from f5 Networks. Essentially, it is going to immediately update your DNS when an outage is detected to start redirecting clients to the secondary network.

Another option may be to use something like Anycast to broadcast the routes to your data centers.

To add on to this question, we do operate in multiple data centers and in the end, decided that the best route was for an engineer to manually move DNS pointers to the alternate collocation depending on the reason for the outage. The worse case scenario is that we may be down 1 hour if one data center is completely offline. However, that is weighed against the impact of the customer when we do have to switch data centers (recent activity will not be available in the alternate location).

One final option is to not rely on your data center provide to give you IP connectivity and bandwidth. Instead, talk to a global IP provider like Global Crossing or Level 3 and let them handle routing your inbound traffic to either data center. The risk is that you are working with a single provider, but the benefit is that they can be much more flexible in their routing options (you can utilize MPLS on their network for your back-end replication, and also use the same connection for public IM connectivity).

You understood correctly. Do you have any idea how that Traffic Manager works technically? Because it seems nice, but that's the sales pitch. How would it work technically? — CharlesS, Sep 11 '09 at 18:29
You can essentially look at GTM as an intelligent DNS server. You set rules on the device to determine when and how to switch DNS results between locations (or geographically load balance between locations). The hitch here is utilizing a low TTL on DNS. — Doug Luxem, Sep 11 '09 at 21:10
Akamai also have a DNS-based loadbalancing product confusingly called GTM. It works, but I haven't tried it in anger (i.e. between two or more datacentres). The nice thing with this product is all your loadbalancing logic is up on Akamai's "cloud", so you don't have to fork out for 4 F5 boxes. Of course, you'll have to put something in that can handle your load though, so you might end up spending a big chunk anyway! — , Feb 05 '10 at 14:47

score 2 · Answer 2 · 2009-09-11T19:49:05.490

Essentially, there are 2 technology choices available for this (that I'm aware of):

As OP pointed out, fail over to another DC by updating DNS so that records point to addresses in the operational datacenter.
IP Anycast, i.e. DNS publishes an IP address, and this IP address is anycasted and is in use in both datacenters, leading customer's routers to choose the nearest datacenter. Note that if a datacenter fails, then the customer 'nearest' this DC will still have a short outage, until the BGP routes have re-adjusted.

Because switching DNS when the server is down seems too slow except when TTL = 0

You can set TTL to zero, but don't expect all networks to obey your setting. In practice, around 10 minutes is the lowest value for TTL. And of course this implies that DNS based fail-over will take between 0 and 10 minutes for each customer, depending on your TTL in their cache.

Dropping like 1000 requests would be fine, but more is bad and should never happen.

To the very best of my knowledge, that is simply outside of what is technically possible today. Even the very biggest sites use DNS or anycast based technology, and try hard to keep their datacenters near 100% uptime because there is no way to get instant fail-over at global Internet level.

Inside of a LAN you can use something like VMWare VMotion to switch over really fast, but that's on your own, end-to-end controlled LAN.

My take is that global load balancing is impractical except for the very biggest sites with lots of technical expertise:

Many load balancing appliances have geo-distribution as a feature bulletpoint, but if the entire DC is down, so is your load balancer? (Edit: I just re-read DLux's answer, and I think I understand this now... You get two load balancers, put one in each DC. They set up a heartbeat between them. When the LB in the live DC notices that its colleague in the dead DC has fallen off the net, then the live LB updates DNS to facilitate fail-over.)
Using Anycast is something I would personally not attempt -- the technology exists and is operational, but what if there is a weird, rare routing problem? Troubleshooting network issues is hard enough as it is, ""optimizing"" on BGP should be left to true experts.
So what's left is DNS based failover, preferably using a globally replicated DNS provider that provides split horizon DNS as a service. That will work, and deployment is fairly straightforward. It does however not meet the OPs goal of near-instant failover.

Disclaimer, I would like input / corrections from a true expert who has fielded globally redundant systems many times before... :-)

Thank you for this great answer. I hoped there would be some tried-and-tested technology for this available. I'm hoping for the 'true expert with a lot of experience' answering here for a bit more. — CharlesS, Sep 11 '09 at 20:26

score 0 · Answer 3 · answered Sep 11 '09 at 19:09

You might also look into a content distribution network (e.g., Akamai). Offloading static content and caching dynamic content to the CDN can significantly reduce the load on your cluster.

Akamai in particular is really expensive, but there are other, cheaper alternatives.

I know how to scale my software, but how to prevent downtime because of network outages?

3 Answers3