I am building a web application where uptime is key. I understand that 100% uptime is not realistic but I would like to achieve five nines. I'm unsure as to the most prudent way to accomplish this.

My preliminary plan was to have the web app running in two geographically separate data centres. The "main" data centre would contain master server and this would replicate to the unused "slave" server elsewhere. If downtime occurred at the main data centre, DNS failover would move traffic to the "slave" server. There are challenges with this technique, including some users being unable to access the site for a while due to odd DNS record caching, etc.

However, I have read a lot of opinion stating that DNS failover is not a great solution and that you should keep everything in a single data centre and focus on redundancy there. The issue I see with that is that even the good data centres seem to have the odd network issue that can cause enough downtime to blow apart the expectation of five nines.

Should I go with the DNS failover option? Are there better options?

  • 17,978
  • 9
  • 56
  • 104
  • 1
    That's 5 minutes of downtime per year. DNS records are commonly cached for a week, and sometimes inconsiderate DNS cache operators keep them for a month or more. You can see the problem. – Chris S Jan 11 '13 at 19:52
  • 4
    If you had a valid business case for five nines, you probably wouldn't be asking it here. Your boss might say "we need five nines" but when he/she understands the cost of five nines, that requirement will likely be loosened. – northben Jan 11 '13 at 21:00

4 Answers4


My rule-of-thumb for clients is: two nines you get for free (ie, without spending anything specifically on high availability). Every extra nine increases total cost by up to an order of magnitude.

That is to say, you can have 99% uptime by just putting your application on a half-decent server on your corporate internet connection. To improve on that, you can colocate. You can colocate with load-balancing and fast failover. You can colocate with load-balancing, fast failover, and a cold spare DR site. You can colocate with load-balancing, a hot spare site, PI address space, run your own ASN and have BGP peering arrangements in place to ensure that your address space is always globally-routable. You can investigate high-availability hardware, where everything including memory and CPUs can be quiesced and hot-swapped. If your application supports it, you can run fully-distributed hosting, or outsource to the highly-available content provision networks. You can, and will, need five times as many staff to manage all this 24*365, including holidays and sickness cover, and the frequent live DR tests you will need to do to have confidence in all of this.

You can do lots of clever stuff. But it all costs, and most of it costs a very large amount of money.

So my sincere advice is: work out what it'd cost you to host your app on a single server in the corporate office. If your employer isn't willing to spend up to a thousand times as much as that, forget five-nines; it's not realistic.

  • 78,442
  • 20
  • 178
  • 229

If five nines was easy, Twitter, Facebook, Gmail, Azure and Amazon would probably already be there. They definitely have the money and most valid business cases for it. Instead, I would recommend you aim for hosting with a cloud provider who has the expertise in providing reliable infrastructure so that they can worry about this while you develop your product.

  • 382
  • 1
  • 4
  • 13

For five nines, you're looking at a lot more involvement than just one failover solution. You need HA within one datacenter plus a hot (or at least warm) standby datacenter that's geographically far but topologically near your primary data center. And that's just the start...

  • 8,920
  • 1
  • 28
  • 34

I imagine there is a boss-wants-powerpoint-compatible-selling-points thing here, but getting five nines or really close to it is possible - though you have to be careful about defining exactly what it is that needs to have the five nines uptime.

I am writing an application that collects data from IoT (also boss / powerpoint compliant) devices and presents the collected data to end users, does data mining and so on using MongoDB and such.

We actually have a perceived uptime of at least 99.9 at this point. How? Well, our uptime is defined as availability of the user front end application. That part is run on GAE while the other parts (like MongoDB) are run on our own servers. Communication is via REST and a lot of crypto. GAE has 99.45% uptime right now - but actually, for the parts we are using, it is higher - we have yet to log any kind of outage.

MongoDB on the other hand is at times a little flaky - not much - but getting 98-99% uptime is the best we can do just now. On top of MongoDB, we have an engine that generates JSONified chunks of data - those are generated on request but also periodically. Caching those is quite helpful in maintaing the perceived uptime of the entire system. End users do not know whether some device delivered data to the backend just now - or an hour ago. Thus - cached data seems just as fresh as 'actual' fresh data.

So - getting really high uptime is certainly possible if you are good at isolating the bits that actually need to have high uptime. Getting the whole stack to five nines uptime is HARD and really expensive as others have pointed out. But you can probably do with less and still make your boss happy.

  • 1,023
  • 7
  • 9