Essentially, there are 2 technology choices available for this (that I'm aware of):
- As OP pointed out, fail over to another DC by updating DNS so that records point to addresses in the operational datacenter.
- IP Anycast, i.e. DNS publishes an IP address, and this IP address is anycasted and is in use in both datacenters, leading customer's routers to choose the nearest datacenter. Note that if a datacenter fails, then the customer 'nearest' this DC will still have a short outage, until the BGP routes have re-adjusted.
Because switching DNS when the server is down seems too slow except when TTL = 0
You can set TTL to zero, but don't expect all networks to obey your setting. In practice, around 10 minutes is the lowest value for TTL. And of course this implies that DNS based fail-over will take between 0 and 10 minutes for each customer, depending on your TTL in their cache.
Dropping like 1000 requests would be fine, but more is bad and should never happen.
To the very best of my knowledge, that is simply outside of what is technically possible today. Even the very biggest sites use DNS or anycast based technology, and try hard to keep their datacenters near 100% uptime because there is no way to get instant fail-over at global Internet level.
Inside of a LAN you can use something like VMWare VMotion to switch over really fast, but that's on your own, end-to-end controlled LAN.
My take is that global load balancing is impractical except for the very biggest sites with lots of technical expertise:
- Many load balancing appliances have geo-distribution as a feature bulletpoint, but if the entire DC is down, so is your load balancer? (Edit: I just re-read DLux's answer, and I think I understand this now... You get two load balancers, put one in each DC. They set up a heartbeat between them. When the LB in the live DC notices that its colleague in the dead DC has fallen off the net, then the live LB updates DNS to facilitate fail-over.)
- Using Anycast is something I would personally not attempt -- the technology exists and is operational, but what if there is a weird, rare routing problem? Troubleshooting network issues is hard enough as it is, ""optimizing"" on BGP should be left to true experts.
- So what's left is DNS based failover, preferably using a globally replicated DNS provider that provides split horizon DNS as a service. That will work, and deployment is fairly straightforward. It does however not meet the OPs goal of near-instant failover.
Disclaimer, I would like input / corrections from a true expert who has fielded globally redundant systems many times before... :-)