One of the major projects I have lined up for 2010 is attempting to mitigate some of the Single Point of Failures (SPOFs) in a network I currently manage.
We currently have a single datacentre rack containing a couple of dozen servers.
Inside the rack, we're redundant and resilient, each server has 2 disks, and can withstand one failing.
Our data storage servers have 3+ disks, and can stand one failure. We're quick to repair/replace broken hardware, too.
Each server has at least one replicated partner, and we can stand to lose 1 or 2 out of each cluster (ie, web, database, storage).
The internet connectivity is provided by 2 100MBit feeds over ethernet to our main transit provider, connecting into a pair of Cisco ASA5500 firewalls in a high availability failover pair. This is not the problem.
As I see it, the two big SPOFs are as follows:
1) Our internet comes from a single transit provider. If their network goes down, we drop off the internet. As we're in a carrier neutral datacentre, it's fairly easy to get a second IP transit in.
2) If something happens to the power in our datacentre, then we're also gone.
Ideally, I'd like servers in 2 datacentres both using diverse routes over multiple IP transit providers, announcing via BGP.
In the second datacentre, I'd be speccing 2x cisco 28xx series routers, 2x ASA 5500 firewalls, a pair of Catalyst 48 port switches, and a dozen Dell servers or so. Roughly to match the primary location.
The management claim that there's massive expense involved with this approach, and the BGP route is excessively expensive. While they seem to be happy to have a second location, BGP seems to be off the table.
The last quotation for multihoming ran close to £80k. (Perhaps they were asking for quotes for gold plated Ciscos!)
Instead, the management feel that this would better be tackled with a DNS based solution, where our routing is controlled by an status uptime monitoring service (like pingdom), that changes our DNS records (with a 1s TTL) to point to the alternative location in the case of a server failure.
Massive amounts of companies use BGP for a reason, this DNS solution just isn't going to cut it, especially given that so many ISPs and so on actually disregard short TTLs and replace them with longer ones.
Questions:
1) Can anyone recommend a good carrier neutral datacentre in either Western Europe (Amsterdam, etc), or Eastern USA (DC, VA, NY, etc)?
2) Has anyone made this DNS solution work properly, or is it a case of being total madness?
3) Am I the only one thinking that a £80k quote for multihoming (in 1 location) seems absolutely excessive?
4)Does anyone have a good way I can persuade the management that BGP is the only realistic solution?
Apologies for Length.. :o)