Best way to improve resilience?

Question

One of the major projects I have lined up for 2010 is attempting to mitigate some of the Single Point of Failures (SPOFs) in a network I currently manage. We currently have a single datacentre rack containing a couple of dozen servers.
Inside the rack, we're redundant and resilient, each server has 2 disks, and can withstand one failing.
Our data storage servers have 3+ disks, and can stand one failure. We're quick to repair/replace broken hardware, too. Each server has at least one replicated partner, and we can stand to lose 1 or 2 out of each cluster (ie, web, database, storage).

The internet connectivity is provided by 2 100MBit feeds over ethernet to our main transit provider, connecting into a pair of Cisco ASA5500 firewalls in a high availability failover pair. This is not the problem.

As I see it, the two big SPOFs are as follows:

1) Our internet comes from a single transit provider. If their network goes down, we drop off the internet. As we're in a carrier neutral datacentre, it's fairly easy to get a second IP transit in.

2) If something happens to the power in our datacentre, then we're also gone.

Ideally, I'd like servers in 2 datacentres both using diverse routes over multiple IP transit providers, announcing via BGP.

In the second datacentre, I'd be speccing 2x cisco 28xx series routers, 2x ASA 5500 firewalls, a pair of Catalyst 48 port switches, and a dozen Dell servers or so. Roughly to match the primary location.

The management claim that there's massive expense involved with this approach, and the BGP route is excessively expensive. While they seem to be happy to have a second location, BGP seems to be off the table.

The last quotation for multihoming ran close to £80k. (Perhaps they were asking for quotes for gold plated Ciscos!)

Instead, the management feel that this would better be tackled with a DNS based solution, where our routing is controlled by an status uptime monitoring service (like pingdom), that changes our DNS records (with a 1s TTL) to point to the alternative location in the case of a server failure.

Massive amounts of companies use BGP for a reason, this DNS solution just isn't going to cut it, especially given that so many ISPs and so on actually disregard short TTLs and replace them with longer ones.

Questions:

1) Can anyone recommend a good carrier neutral datacentre in either Western Europe (Amsterdam, etc), or Eastern USA (DC, VA, NY, etc)?

2) Has anyone made this DNS solution work properly, or is it a case of being total madness?

3) Am I the only one thinking that a £80k quote for multihoming (in 1 location) seems absolutely excessive?

4)Does anyone have a good way I can persuade the management that BGP is the only realistic solution?

Apologies for Length.. :o)

Alnitak · Accepted Answer · 2009-11-20T22:11:22.670

Well you're right, DNS is definitely not the answer - take that from someone who has run multi-homed ISP networks, and now does DNS for living.

What was the £80k quote for - just BGP and an additional transit feed, or for the necessary Cisco routers too? The 2800s you're currently listing probably aren't capable of running a full routing table - there's currently over 200k routes in the global BGP4 table, and that takes a lot of router memory.

It's a couple of years since I was doing this for real, but actually getting BGP from transit suppliers shouldn't be expensive - indeed the larger scale suppliers expect to offer BGP as part of the service, particularly if you're taking 100+ Mbps.

Also, where's the current main data center? You don't necessarily need massive diversity - my network originally had two DCs in London - one in the city and one in Docklands, about 10km apart. That's far enough to rule out most any natural disaster.

If you have both sites in London there's a number of companies that offer cheap ethernet links between the many data centers in the city. One of the most well regarded is Datahop - they do 1 Gbps links between sites for about £4k per annum.

Similarly for the backup site if you only want the second transit link to be used in emergencies then I've seen stupidly low prices from the likes of Cogent for £5 per Mbps pcm. I wouldn't use them as a primary, but as a transit of last resort they're worth considering.

p.s. my total budget for connectivity for a whole (albeit small) ISP was way less than £80k. — Alnitak, Nov 20 '09 at 20:30

score 2 · Answer 2 · answered Nov 20 '09 at 17:33

My previous employer is in one of Equinix's NJ facilities. They seemed happy enough with it when I was working for them. Beyond that, sorry, I don't really do much in that part of the world.
DNS failover sucks. As you identify, there's enough providers out there who ignore DNS TTLs that DNS failover will cause management heartburn the first time it happens.
Yes, that is an outrageous amount of money for BGP multihoming.
Without knowing the psychology of your management, I can't suggest much specific. Find a non-stupid quote for BGP, and remind them what it really costs to have a completely redundant facility -- it's a lot more than they apparently think, especially once you throw in the need to do regular failover tests to make sure everything's still working properly.

Also, do some sensible analysis of failure scenarios and probabilities, and what it actually costs if one of those happens. It could turn out that having a few hours of downtime every few years due to a power outage is a lot less than a redundant facility. Many times management (or the techies) go on a "redundancy spree" that doesn't make any sort of economic sense.

Finally, remember that most outages are actually human-instigated, which failover sites and all that extra complexity is likely to increase the chances of, not reduce them.

+1, Complexity is a meticulously crafted devil indeed. – slovon Nov 26 '09 at 00:15 — slovon, Nov 26 '09 at 00:15

score 1 · Answer 3 · answered Nov 20 '09 at 17:27

1

Just a few quick thoughts;

Split your kit across two racks, each powered by different spurs from the same phase from the PDU.
Put UPSs into each rack if the PDU doesn't have one.
Consider Global Load-Balancing over BGP, it's how we do our active-active multi-site stuff.
Consider Telehouse (telehouse.net), they have places in Western Europe and Eastern USA and are neutral and highly regarded.

answered Nov 20 '09 at 17:27

Chopper3

100,240
9
106
238

4

I was under the impression that most colo / rack hosting providers don't approve of you running your own UPS, perhaps that was misinformation. I've had good experience of Telehouse in the past too. – Tom O'Connor Nov 20 '09 at 17:38
No experience of colo to be honest – Chopper3 Nov 20 '09 at 20:59
3

A UPS in the DC proper can be in violation of firecodes; the pre-incident plan for those sorts of facilities typically lists where the master shutoffs are to disconnect the DC floor from the central UPSes and generators, and the assumption is that the equipment floors are unpowered once those have been triggered, leaving the firefighters free to work on the DC floor safely. Water + local UPS == unhappy firemen. – womble Nov 26 '09 at 13:19
After 20+ years of infra design I've been lucky enough to have never worked in a DC without central UPSs so thanks for the info - genuinely interesting :) – Chopper3 Nov 26 '09 at 14:28

score 0 · Answer 4 · answered Nov 26 '09 at 00:11

Simple and good solution: Our medium size e-commerce site uses Zoneedit DNS for fail-over, and AlertFox for transaction testing. If you exclude the 1-3 minutes hickups during switch over, our uptime this year was 100%. Cost: 20$/year(?) for Zoneedit and $199/month for AlertFox PRO3. Plus two dedicated servers.

Best way to improve resilience?

4 Answers4

Linked