319

We received an interesting "requirement" from a client today.

They want 100% uptime with off-site failover on a web application. From our web application's viewpoint, this isn't an issue. It was designed to be able to scale out across multiple database servers, etc.

However, from a networking issue I just can't seem to figure out how to make it work.

In a nutshell, the application will live on servers within the client's network. It is accessed by both internal and external people. They want us to maintain an off-site copy of the system that in the event of a serious failure at their premises would immediately pick up and take over.

Now we know there is absolutely no way to resolve it for internal people (carrier pigeon?), but they want the external users to not even notice.

Quite frankly, I haven't the foggiest idea of how this might be possible. It seems that if they lose Internet connectivity then we would have to do a DNS change to forward traffic to the external machines... Which, of course, takes time.

Ideas?

UPDATE

I had a discussion with the client today and they clarified on the issue.

They stuck by the 100% number, saying the application should stay active even in the event of a flood. However, that requirement only kicks in if we host it for them. They said they would handle the uptime requirement if the application lives entirely on their servers. You can guess my response.

gWaldo
  • 11,887
  • 8
  • 41
  • 68
NotMe
  • 3,772
  • 7
  • 30
  • 43
  • 50
    Dont underestimate the huge downtime caused by hacking, look at Sony and the PlayStation network. you can guarantee they had the same %100 uptime idea and the money/hardware to back it up. make clear with the client that 100% uptime is an unfeasible expectation, even google techs would be hesitant to mutter "100% uptime". a hint btw is to look into using dynamic DNS, they only cache for 60 seconds, this should include OS and local DNS servers. – Silverfire Sep 29 '11 at 00:39
  • 184
    I would personally **RUN** from this client as fast as possible. I suspect this won't be the last crazy idea they may have (from a technology standpoint). – GregD Sep 29 '11 at 00:53
  • 138
    I wish I could downvote your client. – joeqwerty Sep 29 '11 at 02:03
  • 5
    Haha. Too bad there wasn't a client of shame wall.. – GregD Sep 29 '11 at 02:09
  • 82
    If you figure out 100% uptime let me know. I'll create a business with it and sell it to google. It's impossible to guarantee 100%. Even companies like microsoft, amazon or google won't go that high because they know it's impossible. The best i've seen is 99.999% and even that is a stretch (5 minutes in a year). The best you could probably do is 99.99% reliably. – Matt Sep 29 '11 at 04:50
  • 39
    Just make up an insanely high price tag to put on their insane request. That will probably bring them back to their senses. Either than, or it will send them off looking for someone willing to lie to them. – Nate C-K Sep 29 '11 at 05:28
  • 4
    Agree on the client on 99.99% and clearly state that, even if this implies some possible obligation for your party (fees), it causes also the cost estimate for them will be +50% more than if it didn't include guaranteed uptime. Put all of this into a money perspective, and they'll give up. Source: direct experience. Talk money and people will understand. – gd1 Sep 29 '11 at 07:14
  • 3
    Instead of saying it can't be done, just quote them a crazy price. I guarantee you if you tell them that it will cost $100 billion to have 100% uptime they won't take you up on it. – Kyle Cronin Sep 29 '11 at 07:21
  • 6
    No problem. Just say you can guarantee an uptime of 100% (rounded to zero decimal places). – Nick Pierpoint Sep 29 '11 at 10:11
  • 9
    It doesn't have to be an insanely high figure. Just realistic. Draw them a graph of the costs of high availability, and the costs of downtime, and talk about the point where they meet. If downtime costs you $millions per hour then 5 nines is achievable, since you have $millions to spend on HA. If it costs $100/hour then two nines is probably all that you should consider. – Colin Pickard Sep 29 '11 at 10:21
  • 6
    How much are you charging your client? Super-high availability computing is usually the territory of HP NonStop, IBM zSeries, etc. always with multiple, synchronized datacenters. Those companies charge big bucks for their services, so unless you are charging 7-figures for this, don't even consider agreeing to something like 99.999%, let alone 100%. – Martin Sep 29 '11 at 10:44
  • 19
    Don't worry, use the Microsoft solution: 100% uptime guaranteed* (*money back for the period when it's down). – MSalters Sep 29 '11 at 10:51
  • 7
    100% uptime? Sure I do 100% uptime. I'll carve your website in slabs of marble and mail it to every person on Earth, twice. – jimworm Sep 29 '11 at 10:57
  • 4
    I can guarantee 100% uptime. First step. Transfer every single pound|dollar|coin to my account. Second Step. Wait. This will take a long time. – Tom O'Connor Sep 29 '11 at 15:02
  • 3
    **Unobtainium** – GregD Sep 29 '11 at 20:26
  • Was your response "It's been a pleasure not working with you"? – Justin ᚅᚔᚈᚄᚒᚔ Sep 30 '11 at 03:07
  • 16
    As a reminder: Youtube went down today – Nemo Sep 30 '11 at 03:57
  • @Justin: no. Because they were willing to 1. host it completely and 2. take full responsibility then this issue went away from my perspective. After more details came out it only matter if I was hosting the primary site for them. So my response was "Sounds great." – NotMe Sep 30 '11 at 14:38
  • 3
    @ChrisLively: Dodged a bullet then :) – Justin ᚅᚔᚈᚄᚒᚔ Sep 30 '11 at 14:40
  • 1
    @ChrisLively thank god for that then right? Even if they want a failover, which is perfectly fine, it's not going to be 100%. There will be that delay where it does switch over, or it may not even switch over and you have to manually do it. While you had no downtime as far as having the data and hardware there, you still have that availability issue, which you just can't avoid. So, the users may not even notice, they may try to hit the app and it's down for a few minutes or whatever, who cares. – Matt Sep 30 '11 at 15:54
  • 3
    @ChrisLively For the love of all that is holy, make sure there are provisions in the contract to make sure when they inevitably experience downtime, they don't try to blame it on your code and take legal action. Make that as ironclad as possible so you can just take their money and then be able to sleep at night. – MattC Sep 30 '11 at 16:30
  • 2
    Some services which advertise 100% uptime offer a penalty clause in the small print like "2 days free for every day the service is down". So the 100% is for marketing with some money reserved for paybacks. No engineer would offer 100% uptime and some would find another job if they found their services had been advertised as such with the expectation of 100% uptime. Also, maintenance windows should be arranged beforehand as part of the SLA and when planned these don't count against the 100%. – Stuart Woodward Oct 01 '11 at 09:47
  • 1
    Just tell them that of course you're going to give them 110%. That should completely satisfy them :) – Geoffrey McGrath Mar 29 '12 at 02:25
  • What I have seen in the past is, a milestone payment is reached once a mission critical system is able to achieve 99.9% uptime over 3 months. As soon as it hits lower than 99.9% the count starts over again. Downtime includes time it takes to patch the software with fixes to improve reliability. Any more than that and you're just giving them the right to swindle you for free concessions when you can't meet the cut. Without a reasonable limit you're basically giving them unlimited scope or a good reason to screw you for free money in court. – Evan Plaice Mar 29 '12 at 06:33
  • Tell them that your providers (hardware, network, kernel developers...) doesn't give you 100% uptime, so you can't guaranty 100% uptime. End of discussion. If he still ask, you inform them that there are no providers that offer 100% SLA. Giving a 100% means you also audit all the hardware (network included) and for that you also need to go to chip level. 99.x you can get by balancers, dns tricks, but even with that, there is going to be a short time were some info couldn't be sent. So 100% means the machine,network,power supply doesn't fail never. – Guillermo May 28 '13 at 11:53
  • tell them you can't give them 5 9s, but you can give them 9 5s – warren Jun 17 '13 at 18:05
  • 1
    I can (almost) guarantee that if you host with me you'll get 100% uptime for the next 5 minutes! – hookenz Jul 12 '13 at 02:19
  • Make sure you know what their definition of "uptime" is. You might just need to host your servers in a really tall datacenter, or turn the server rack sideways. Or just setup a tarpit on the server network(s) so that pings always get responses. Or force clients to install a kernel patch that presents a failing login page if accessing one of your domains on ports 80 or 443 fails. That way, as far as the client's concerned, your site's up; they're just locked out. And you can explain "we just have very strict security policies in place". That'll shut them up. – Parthian Shot Jul 17 '14 at 18:14

27 Answers27

370

Here is Wikipedia's handy chart of the pursuit of nines:

enter image description here

Interestingly, only 3 of the top 20 websites were able to achieve the mythical 5 nines or 99.999% uptime in 2007. They were Yahoo, AOL, and Comcast. In the first 4 months of 2008, some of the most popular social networks, didn't even come close to that.

From the chart, it should be evident how ridiculous the pursuit of 100% uptime is...

Skyhawk
  • 14,149
  • 3
  • 52
  • 95
GregD
  • 8,713
  • 1
  • 23
  • 35
  • 64
    Pingdom also isn't checking every second. On top of that, the ones that did meet five nines likely still had localized disruptions that Pingdom might not have detected, or glitches that made some services unavailable while still responding to pings. – ceejayoz Sep 29 '11 at 01:16
  • 8
    Which in and of itself makes the five nines dubious... – GregD Sep 29 '11 at 01:22
  • 5
    Precisely. And they've got $billions to work with! – ceejayoz Sep 29 '11 at 01:32
  • 1
    I seem to recall someone once wrote a nice summary of the potential costs associated with pursuing each of the nines...or maybe I'm imagining that. – GregD Sep 29 '11 at 03:57
  • I'm amazed: marketing and basilar math meet so scarcely these days it almost feels wrong when someone makes it happen. +1 – ZJR Sep 29 '11 at 05:26
  • 43
    Sorry to disturb the chat going on, but the OP's question was how to go about striving towards the goal of 100% uptime on a technical level not conceptually, I'm sure he knows it's not always possible because of natural occurrences that happen to hardware and the environment. Could we help him with that? – David d C e Freitas Sep 29 '11 at 10:11
  • So, you have agreed with your client that you want some seriously high uptime (not 100%) and you still have an issue with app to appserver communications. I thought about this quite a few times and it seems to me you could do something in order to make your app-appserver link redundant, i.e. have the app talking to both servers at all times, and add a mechanism that will trash the unused duplicate at app level, with a clean replication at the appserver level. – Morg. Sep 29 '11 at 10:22
  • 1
    On the 100% uptime, you can offer "some" of it if you like, it's possible for example to offer 100% uptime on the application but not on the service, i.e. guaranteeing that there will be an appserver to answer calls at all times - which by itself still means the service could be interupted due to DNS, ISP, etc. failures. You can guarantee with a few hundred servers in a few hundred locations that at least one will be alive at any given time - either way if it's not, your client won't be able to notice anyway ;) – Morg. Sep 29 '11 at 10:25
  • And .. he doesn't want me to edit, so here are the details : i.e. have the app talking to both servers at all times, and add a mechanism that will trash the unused duplicate (like don't listen to second if first was ok, send every message to both appservers) at app level, with a clean replication at the appserver level (i.e. avoid duplicates @ replication time through simple checks based on app. arch - like transaction id's given by the app or whatever.). – Morg. Sep 29 '11 at 10:28
  • Great info. It was hard to pick a winner here. Thank you. – NotMe Sep 29 '11 at 19:07
  • 1
    @DavidFreitas I know what the OP was asking. I posted what I did to put what his clients were asking for, into some type of context that was visually, easy to see. I was hoping that off the bat, he'd see that 100% uptime is not possible and work back from there to a more feasible/affordable uptime goal.... – GregD Sep 29 '11 at 20:19
  • 5
    To the OP: I have seen SLAs that guaranteed uptime in the context of "outside of normal maintenance". The normal maintenance of course being scheduled downtime per month for updates, patches, etc., that usually occur on their least busy day of the month during the least busy times of the month (usually in the middle of the night). They must have some type of metrics for their business with regard to business. You **could** offer better uptime (4 nines) for them **only** during those times. – GregD Sep 29 '11 at 20:23
  • @GregD: The uptime requirement is for a particular 6 month period out of the year (seasonal). It wasn't important the other 6 months; which gives us plenty of time to push updates. – NotMe Sep 29 '11 at 22:37
  • @Miles Erickson Thanks for the edit. I knew there was something I'd forgotten to go back and add.. – GregD Sep 30 '11 at 02:09
  • @DavidFreitas: You're right. I've added [someone's answer](http://serverfault.com/questions/316637/100-uptime-for-web-application/317285#317285) from Hacker News which to me seems much more mature. – Jungle Hunter Sep 30 '11 at 15:51
  • I read each line as 100% (rounded to zero decimal places) - http://serverfault.com/a/316757/155 – Nick Pierpoint Jan 06 '12 at 15:39
  • Interesting, Six Sigma in regards to downtime. – Chad Harrison Aug 30 '12 at 15:42
  • True figures, but you didn't even address the question. How to configure the network to make it fail over automatically. – hookenz Sep 01 '13 at 22:03
  • @Matt I did answer the question by pointing out how ridiculous this pursuit of 100% is in the context of total downtime in a year, month and week. I also indicated to the OP that IF the client still insists on 100% uptime given the context in my answer, that I would RUN from that client because that expectation cannot be met. OP also updated his question AFTER my answer, with some clarification from the client. – GregD Oct 20 '13 at 13:53
191

Ask them to define 100% and how it will be measured Over what time period. They probably mean as close to 100% as they can afford. Give them the costings.

To elaborate. I've been in discussions with clients over the years with supposedly ludicrous requirements. In all cases the they were actually just using non precise enough language.

Quite often they frame things in ways that appear absolute - like 100% but in actual fact on deeper investigation they are reasonable enough to do the cost/benefit analyses that are required when presented with costings to risk mitigation data. Asking them how they will measure the availability is a crucial question. If they don't know this then you are in a position having to suggest to them that this needs to defined first.

I would ask the client to define what would happen in terms of business impact/costs if the site went down in the following circumstances:

  • At their busiest hours for x hours
  • At their least busy hours for x hours

And also how they will measure this.

In this way you can work with them to determine the right level of '100%'. I suspect by asking these kinds of of questions they will be able to better determine their other requirements' priorities. For example they may want to pay certain levels of SLA and compromise other functionality in order to achieve this.

Preet Sangha
  • 2,727
  • 1
  • 24
  • 25
  • 21
    Agreed. They may just mean "very high" uptime (upper 90s?) with a pretty solid failover strategy. If not, then an explanation of the cost scale involved would hopefully persuade them... – Martin Dow Sep 29 '11 at 10:55
  • 33
    +1 for not jumping to conclusions, and instead just asking the client to explain what they have in mind. – sleske Sep 29 '11 at 14:10
  • 4
    I echo the "not jumping to conclusions" statement...if the customer means 100% uptime (minus scheduled maintenance) then it *may* be more of a reasonable requirement. – Tim Reddy Sep 29 '11 at 18:26
  • 1
    Regarding business impact, we actually know and understand their business completely and the costs involved for the site going down are not financial. More along the lines of the natives showing up with pitchforks, potential hangings, etc. ;) Just imagine 40,000 people showing up at your front door screaming. That's what they want to avoid with a passion. – NotMe Sep 29 '11 at 22:43
  • 7
    @ChrisLively All the more reason to have a mature understanding of risk then. The dominant paradigm for safety engineering is [probabilistic risk assessment](http://en.wikipedia.org/wiki/Probabilistic_risk_assessment). There are systems that could kill (not just annoy) thousands of people and they still have a low, hopefully well understood, but non-zero probability of failure. – poolie Sep 29 '11 at 23:41
138

Your clients are crazy. 100% uptime is impossible no matter how much money you spend on it. Plain and simple - impossible. Look at Google, Amazon, etc. They have nearly endless amounts of money to throw at their infrastructure and yet they still manage to have downtime. You need to deliver that message to them, and if they continue to insist that they offer reasonable demands. If they don't recognize that some amount of downtime is inevitable, then ditch 'em.

That said, you seem to have the mechanics of scaling/distributing the application itself. The networking portion will need to involve redundant uplinks to different ISPs, getting an ASN and IP allocation, and getting neck-deep in BGP and real routing gear so that IP address space can move between ISPs if need be.

This is, quite obviously, a very terse answer. You haven't had experience with applications requiring this degree of uptime, so you really need to get a professional involved if you want to get anywhere close to the mythical 100% uptime.

EEAA
  • 108,414
  • 18
  • 172
  • 242
  • 7
    Agreed. Totally. Crazy. – jdw Sep 29 '11 at 00:49
  • Well, they used to say going faster than light is impossible because it would need infinite energy... :) – TC1 Sep 29 '11 at 08:25
  • 2
    they used to ?? – Sirex Sep 29 '11 at 09:30
  • 2
    @Sirex Referring to the recent experiment @ CERN where neutrinos have been found to travel faster than light. Results yet to be confirmed by independent scientists though. – TC1 Sep 29 '11 at 10:32
  • 9
    @TC1 I'll bet you [$200](http://xkcd.com/955/) that doesn't pan out. – dpatchery Sep 29 '11 at 13:47
  • 2
    +1 for mentioning BGP and real routing gear. The first thing that came to mind when I read the question was "anycast." And even the best distributed anycast-hosted services (such as Google) still have downtime, even if locally rather than globally. – fluffy Sep 29 '11 at 17:39
  • Yes. They are crazy. However, from a meeting today it looks like they are going to take on the whole crazy problem themselves. Great info about BGP. Thanks, – NotMe Sep 29 '11 at 19:09
  • 1
    @dpatchery: Of course that won't pan out. Because they'll find that the neutrinos actually passed through the Cardiff Rift where all bets are off. – NotMe Sep 29 '11 at 22:44
  • @ChrisLively Thanks, you just made me realize I missed the entire 2011 season. There goes my weekend. – dpatchery Sep 30 '11 at 11:40
  • Their craziness doesn't matter. You need to build an infrastructure that will win the bid, figure out what you can achieve (99.9%/99.99% uptime) and build in enough overhead to cover your SLA penalties. Or negotiate a monitoring methodology that is in your favor. – duffbeer703 Sep 30 '11 at 12:27
  • @duffbeer703 I beg to differ. Their craziness certainly *does* matter. The fact that they're requesting "100% uptime" shines a very bright light on their lack of understanding how things actually work and what's possible. Their lack of understanding and unreasonable expectations will certainly not stop with their uptime expectations. – EEAA Sep 30 '11 at 12:55
  • 5
    @ErikA A request for 100% uptime is indicative of ignorance of technical characteristics of systems. That's ok, because the customer's job is doing whatever they do. Your job is to engineer IT systems. Difficult customers like this can be nightmares, but they can also become your best customers. – duffbeer703 Sep 30 '11 at 13:04
  • @TC1 , I bet someone is poorer by $200 by now! – Vaibhav Garg Mar 23 '13 at 09:22
  • @dpatchery, see above! – Vaibhav Garg Mar 23 '13 at 09:23
  • I think it's more a matter of perception than reality. I'm willing to bet that this customer perceives that Google is up 100% of the time. Of course, Google could go down for an hour every night for maintenance at 3am, but they would never know because they're not using it at that time. – Ernie Jun 09 '15 at 17:30
54

Well, that's definitely an interesting one. I'm not sure I would want to get myself contractually obligated to 100% uptime, but if I had to I think it would look something like this:

Start with the public IP on a load balancer completely out of the network and build at least two of them so that one can fail over to the other. A program like Heatbeart can help with the automatic failover of those.

Varnish is primarily known as a caching solution but it does some very decent load balancing as well. Perhaps that would be a good choice to handle the load balancing. It can be set up to have 1 to n backends optionally grouped in directors which will load balance either randomly or round-robin. Varnish can be made smart enough to check the health of every back end and drop unhealthy back ends out of the loop until it comes back online. The backends do not have to be on the same network.

I'm kind of in love with the Elastic IPs in Amazon EC2 these days so I would probably build my load balancers in EC2 in different regions or at least in different availability zones in the same region. That would give you the option of manually (god forbid) spinning up a new load balancer if you had to and moving the existing A record IP to the new box.

Varnish cannot terminate SSL, though, so if that is a concern you may want to look at something like Nginx instead.

You could have most of your backends in your clients network and one or more outside their network. I believe, but am not 100% sure, that you can prioritize the backends so that your clients machines would receive priority until such time as all of them became unhealthy.

That's where I would start if I had this task and undoubtedly refine it as I go along.

However, as @ErikA states, it's the Internet and there are always going to be parts of the network that are outside your control. You'll want to make sure your legal only ties you up with things that are under your control.

jdw
  • 3,735
  • 1
  • 17
  • 20
  • 2
    For a while I was thinking about Amazon and MS for a cloud deployment but both of them have had major outages over the past couple of months. SSL is critical. – NotMe Sep 29 '11 at 01:57
  • 3
    If you were going to use Amazon, you would definitely want to spread your machines out around the 5 availability zones. It's pretty unlikely that all their zones would go out at the same time. – jdw Sep 29 '11 at 12:10
  • 11
    +1 for actually addressing the OP's main question. – Phil Sep 29 '11 at 14:15
  • you will always have a point of failure, jdw, as long as there's a non-distributed thing in the chain (in your case heartbeat, unless of course you have multiple instances of that running on remote machines all monitoring each other as well as your servers, which any of them may or may not see because of network trouble along the routing). Which brings us to "downtime". The servers may be up and running and still unavailable to the client without heartbeat ever detecting it if the failure is not in the routing path. – jwenting Sep 30 '11 at 09:21
  • Agreed. As EVERYONE else has pointed out, there's no such thing as 100% uptime. All you can do is try and what I described is how I would start trying. – jdw Sep 30 '11 at 10:13
  • While I personally love Amazon's offerings... they do not contractually guarantee 100% uptime, so as a downstream customer of theirs, it seems unwise to promise better than the provider is, and has a record of providing. – SplinterReality Oct 07 '11 at 09:23
30

No problem - slightly revised contract wording though:

... guarantee an uptime of 100% (rounded to zero decimal places).

Nick Pierpoint
  • 639
  • 1
  • 8
  • 14
  • 2
    +1 for noting, that 100% is not 100,0% or 100,000% etc. The decimal digits matters, they indicate precision ;) – Danubian Sailor Sep 29 '11 at 13:13
  • 4
    By some conventions, "100%" has only one significant figure, such that all numbers between one-half and one would round to "100%"; 50% would round to 100%. – Thomas Levine Sep 29 '11 at 23:42
  • 1
    Depending on the standard for counting some will say that 50% has two meeningfull numbers where 100% has three meeningful number. 50,5 and 100 are there fore just as precise. Others will count digits after the decimal point. Then 50,5 and 100,4 will be just as accurate. If nothing else stated I would assume that 100% is 99,5% and up. 100,0% is 99.95% and up etc. – Tillebeck Oct 18 '11 at 08:28
26

If Facebook and Amazon can't do it, then you can't. It's as simple as that.

Paperjam
  • 139
  • 2
  • 8
Mike
  • 21,910
  • 7
  • 55
  • 79
  • 17
    he could be smarter than all their people combined, who knows :p – Matt Sep 29 '11 at 05:40
  • 3
    100% uptime doesn't have to be so literal people - it means: 100% available during the time that it's needed. For example, bank systems should always be available, and they do quite well. Just because they go down for maintenance for 1 second once a year doesn't mean they failed at their 100% uptime goal. – David d C e Freitas Sep 29 '11 at 10:14
  • 13
    @DavidFreitas - I think in contracts it's usually pretty literal... – UpTheCreek Sep 29 '11 at 11:20
  • @UpTheCreek that means that the parties in the contract have to be able to measure the uptime to determine if it's 100% up, thereby needing another checker system that is up 100% of the time to check it... hmm. – David d C e Freitas Sep 29 '11 at 11:48
  • 2
    @Matt just because Facebook/Amazon can't do it doesn't mean a smaller site can't do it. A lot of large websites face much harder problems to overcome than a smaller site. – Xorlev Sep 30 '11 at 13:54
  • @xorlev sorry smaller sites have more to overcome.. they don't have the money the larger sites do. When you are small you are too dependent on 3rd parties that don't offer a 100% uptime SLA – Mike Sep 30 '11 at 13:55
  • @Mike I disagree. With a smaller site you could run a few instances in multiple datacenters and be pretty well insulated from failure on a budget. We run a couple boxes on Amazon West and East, when West failed a month or so back we switched the DNS to be East and had zero-downtime (other than for clients who'd already connected.) The moment you have a massive site you have huge databases to keep in sync over multiple geographic locations which makes a somewhat easy master/slave system into a MUCH harder distributed system. See PNUTS for how Yahoo solved that problem. – Xorlev Sep 30 '11 at 14:00
  • 1
    so what you are saying is you didn't have 100% uptime since you had some clients that had errors.. plus dns isn't an instant switch since you have ISPs that ignore short TTLs – Mike Sep 30 '11 at 14:12
  • @DavidFreitas - There does exist a checking system that has 100% uptime. It's called "customers". The client may stipulate that they consider the provider in breach of contract if they receive X many customer complaints. – Chris Wenham Sep 30 '11 at 14:39
  • @Xorlev i never said he couldn't, hence my comment. While it would be a huge stretch, i really did mean he could be smarter than everyone, or at least intelligent enough to find a way to get 100%. But then again he probably wouldn't be working for a small company then and making millions if not billions by now. But to actually get 100% is basically impossible. – Matt Sep 30 '11 at 15:46
  • @DavidFreitas You're right in you say that uptime /= availability. You could have 100% uptime, just means the hardware is up and running, but the services may come down, therefore; you can't have 100% availability. It's like running an SQL server, the server may still be running, but SQL might crash and you just need to restart the services. So at this point SQL is not available. I thought about it more after i made a comment, because uptime is not availability, although i assumed thats what the OP meant. – Matt Sep 30 '11 at 15:50
25

To add oconnore's answer from Hacker News

I don't understand what the issue is. The client wants you to plan for disaster, and they aren't math oriented, so asking for 100% probability sounds reasonable. The engineer, as engineers are prone to do, remembered his first day of prob&stat 101, without considering that the client might not. When they say this, they aren't thinking about nuclear winter, they are thinking about Fred dumping his coffee on the office server, a disk crashing, or an ISP going down. Furthermore, you can accomplish this. With geographically distinct, independent, self monitoring servers, you will basically have no downtime. With 3 servers operating at an independent(1) three 9 reliability, with good failover modes, your expected downtime is under a second per year(2). Even if this happens all at once, you are still within a reasonable SLA for web connections, and therefore the downtime practically does not exist. The client still has to deal with doomsday scenarios, but Godzilla excluded, he will have a service that is "always" up.

(1) A server in LA is reasonably independent from the server in Boston, but yes, I understand that there is some intersection involving nuclear war, Chinese hackers crashing the power grid, etc. I don't think your client will be upset by this.

(2) DNS failover may add a few seconds. You are still in a scenario where the client has to retry a request once a year, which is, again, within a reasonable SLA, and not typically considered in the same vein as "downtime". With an application that automatically reroutes to an available node on failure, this can be unnoticeable.

Jungle Hunter
  • 535
  • 1
  • 5
  • 14
  • 6
    The problem is that they're saying it in contract-ese. Meaning that if a disaster *does* occur and you need more than ten seconds to take the site back online via backups they'd have standing to sue. – Shadur Oct 02 '11 at 14:16
  • @Shadur: If they *really* want it, then you must *really* charge them. Spread the servers geographically far and wide, hopefully there will not be disaster everywhere. – Jungle Hunter Oct 03 '11 at 02:49
  • 3
    I've seen a site that offered 100% uptime guarantees or your money back. The trick was they charged a boatload and partitioned into months. So some months go unpaid and you schedule everything around that, and cover the loss with the months that work out okay. – jldugger Oct 03 '11 at 16:38
17

You are being asked for something impossible.

Review the other answers here, sit down with your client, and explain WHY it's impossible, and gauge their response.

If they still insist on 100% uptime, politely inform them that it cannot be done and decline the contract. You will never meet their demand, and if the contract doesn't totally suck you'll get skewered with penalties.

voretaq7
  • 79,345
  • 17
  • 128
  • 213
  • 2
    100% needs to be defined, i.e. 100% available except when doing maintainance or upgrades and that time will be limited to quiet hours for a few hours a month at most. It all _depends_ on what the purpose and usage of the web app is in this case... – David d C e Freitas Sep 29 '11 at 10:18
  • 1
    and define "downtime". Can't even in theory guarantee they'll be able to access a server in Omaha from their offices in Fairbanks unless you control the entire network in between (though you could give assurances about the server being up and running). – jwenting Sep 30 '11 at 09:23
  • The definitions are, IMHO, irrelevant if they ask for "100% uptime": Even if you negotiate scheduled maintenance and build in N+N redundancy if one minor glitch causes an unscheduled reboot or service blink you've blown your SLA. **DEFINITELY** relevant if you're negotiating a 3, 4 or 5 nines SLA though. – voretaq7 Sep 30 '11 at 14:38
  • Depends on the terms of the SLA though, doesn't it? If you get paid $100K per month and every minute of downtime carries a $1K penalty, that might be entirely doable (if you have other contracts to amortize the cost of 24/7 on-site sysadmins). – Michael Borgwardt Sep 30 '11 at 23:49
  • @MichaelBorgwardt there are definitely ways to "make it work" from a pure numbers standpoint, but I'd still decline because of potential for bad PR ($_CLIENT goes on Twitter and tells the world 'we're down because $_PROVIDER is incompetent and can't meet their SLA!'). Personally I'd rather have 10 smaller, more reasonable clients pay me $10k a month :-) – voretaq7 Oct 01 '11 at 04:06
13

Price accordingly, and then stipulate in the contract that any downtime past the SLA will be refunded at the rate they are paying.

The ISP at my last job did that. We had the choice of a "regular" DSL line at 99.9% uptime for $40/mo, or a bonded trio of T1s at 99.99% uptime for $1100/mo. There were frequent outages of 10+ hours per month, which brought their uptime well below the $40/mo DSL, yet we were only refunded around $15 or so, because that's what the rate per hour * hours ended up at. They made out like bandits from the deal.

If you bill $450,000 a month for 100% uptime, and you only hit 99.999%, you'll need to refund them $324. I'm willing to bet the infrastructure costs to hit 99.999% are in the neighborhood of $45,000 a month assuming fully distributed colos, multiple tier 1 uplinks, fancypants hardware, etc.

Bryan Boettcher
  • 301
  • 2
  • 9
  • 3
    If you see anybody promising 100% uptime then this is exactly what they are doing. There's a difference between promising 100% uptime and delivering it. It would be a good idea to explain this to the client if they try to quote a competitor's SLA to you. – sjbotha Sep 30 '11 at 13:30
10

If professionals question if 99.999 percent availability [is] ever a practical or financially viable possibility, then 99.9999% availability is even less possible or practical. Let alone 100%.

You will not meet 100% availability goal for an extended period of time. You may get away with it for a week or a year, but then something will happen and you will be held responsible. The outfall can range from damaged reputation (you promised, you didn't deliver) to bankruptcy from contractual fines.

Paweł Brodacki
  • 6,451
  • 19
  • 23
10

There are two types of people who ask for 100% uptime:

  1. People with absolutely no knowledge about computers, computer systems, or the Internet.*
  2. Ones who are intentionally making an ass of themselves, either to test your ability to say No (Google "the Orange Juice Test"), or trying to gain some kind of contract SLA leverage in order to get out of paying you later.

My advice, having suffered both of these types of clients on many occasions, is to not take this client. Let them drive someone else insane.

*This same person might have no embarrassment inquiring about Faster-than-Light travel, Perpetual Motion, Cold Fusion, etc.

Irving
  • 146
  • 3
8

I would communicate with the client to establish with them what exactly 100% uptime means. It is possible they don't really see a distinction between 99% uptime and 100% uptime. To most people (ie. not server admins) those two numbers are the same.

jhocking
  • 181
  • 4
6

100% uptime?

Here's what you need:

Multiple, (& redundant) DNS servers, pointing to multiple sites all over the world, with proper SLAs with each ISP.

Make sure the DNS servers are setup properly, with TTL recognised effectively.

A T
  • 397
  • 1
  • 4
  • 15
  • 1
    Yes, DNS is a good start - e.g. `nslookup google.com` returns 6 different IP's for redundancy in case some of them don't work. Also check out RobTex.com a great site to look at the configurations of certain domains e.g. http://www.robtex.com/dns/google.com.html#records – David d C e Freitas Sep 29 '11 at 10:21
6

This is easy. The Amazon EC2 SLA clearly states:

“Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.”

http://aws.amazon.com/ec2-sla/

Just define 'uptime' to be relative to the entire bundle of service you can actually keep operational 100% of the time, and you should have no problems.

Also, it's worth pointing out that the entire point of an SLA is to define what your obligations are and what happens if you can't meet them. It doesn't matter if the client asks for 3 nines or 5 nines or a million nines - the question is what they get when/if you can't deliver. The obvious answer is to provide a line item for 100% uptime at 5x the price you want to charge, and then they get a 4x refund if you miss that target. You might score!

fields
  • 690
  • 1
  • 10
  • 21
5

DNS changes only take time if they are configured to take time. You can set the TTL on a record to one second - your only issue would be to ensure that you provide a timely response to DNS queries, and that the DNS servers can cope with that level of queries.

This is exactly how GTM works in F5 Big IP - the DNS TTL by default is set to 30 seconds and if one member of the cluster needs to take over, the DNS is updated and new IP is taken up almost immediately. Maximum of 30 seconds of outage, but that is the edge case, the average would be 15 seconds.

Paul
  • 1,228
  • 12
  • 24
  • 10
    It's been my experience that some DNS servers will disregard a TTL that they consider to be obnoxiously low (in spite of the RFC). Anything less than 5 minutes becomes somewhat unreliable in the global scale. – jdw Sep 29 '11 at 00:59
  • 2
    They are not running DNS if they do not respect the TTL and so can be ignored - they have made their own decision to "break the internet" for their users. Many failover solutions rely on TTL being respected. – Paul Sep 29 '11 at 01:02
  • 13
    @Paul ignoring reality isn't an acceptable practice, no matter how much it pisses everyone off. – MDMarra Sep 29 '11 at 01:11
  • 5
    I'm with jdw on this. I've seen numerous DNS servers completely ignore TTL, even a 1 hr setting and default back to something like 24 hours or so. – NotMe Sep 29 '11 at 01:56
  • I guess my point is that if you find a DNS server that is broken like this, then report that it needs to be fixed. If you have a choice regarding DNS servers, then don't use ones that are broken. If a site owner states "we are not interested in DNS based failover so will ignore low TTLs and accept we will get an outage" then that is an internal policy that does not need to impact your architecture. – Paul Sep 29 '11 at 03:18
  • it's been my experience that amazon's ec2 dns servers ignore small ttl times not to mention comcast – Mike Sep 29 '11 at 03:35
  • Of course - but it doesn't negate my point or invalidate my comment to the point that it should be demoted. Businesses can choose to use these DNS servers for general resolution and accept that they will suffer an outage for services that use DNS TTL failover, or they can choose not to. F5 have a huge market share and use DNS for GTM. They are in wide use. The businesses deploying F5 as a GTM solution do so with acceptance that those users with broken DNS will suffer an outage proportional to their level of brokenness. – Paul Sep 29 '11 at 05:40
  • 6
    @Paul - the OP doesn't have control over every ISP's DNS resolvers on the planet. Ergo, they don't get the choice to say "if you're going to use our website, do not use Comcast/Roadrunner/whomever as your ISP because they will ignore our TTL settings". It's something that is simply out of their control and is therefore too fragile to be considered a solution for this problem IMHO. The solution has to include some way to be able to internally force the IPs around without relying on other bits of the network that may not be cooperative. – jdw Sep 29 '11 at 10:12
  • 1
    I really don't think even the absurd request for "100% uptime" includes ensuring that the uptime exists for all parties on the planet. My internet connection can be disabled for various reasons and I will not be able to access the service. Including my DNS being misconfigured. The OPs architecture need not include installing a second internet connection at my office in order to achieve uptime in my situation should it? Or broadband connections for dial up? It absolutely should ignore anything that isn't working correctly and aim for uptime with those connections that work as they should. – Paul Sep 29 '11 at 12:47
  • 3
    That's kind of like not having a UPS because the power 'should just work'. It's not a forward thinking way to architect a system. If you know that there is a fragile part of the system, for whatever reason, you should try to account for it. – jdw Sep 29 '11 at 16:39
  • 2
    @Paul: Client: "The service is down." OP: "No, it's up." Client: "I can't access it, but I can get to Google.". From the client's perspective, their connection is "working as it should", but maybe their ISP's DNS disregards low TTL. Just because there's a technical reason as to why their connection is not up to standard doesn't mean that they're going to regard it as "working properly." – Adam Robinson Sep 29 '11 at 19:21
  • I see. So there are two opposing viewpoints here. One is "you should architect your solution and SLA based on your availability to the internet" and the other is "you should architect your solution based on your availability to the internet", plus stipulate in your SLA that you will account for any issues that potential visitors may have with their internet connection, including incorrectly configured DNS servers, and that your solution will remain up for all customers regardless of the misconfigurations that their ISPs may include in the service they delvier now and in the future". – Paul Sep 29 '11 at 23:32
  • "if you're going to use our website, do not use Comcast/Roadrunner/whomever as your ISP because they will ignore our TTL settings". even worse. You don't even have to use the ISP yourself, all that's needed is for the routing table to include a router whose DNS isn't updated yet for the connection to fail. – jwenting Sep 30 '11 at 09:29
  • A DNS server that isn't updated "yet"? Seriously? A DNS server that is serving records past expiry is broken. Customers are entitled to choose whatever ISP they wish to use. Some provide a better service than others and you get what you pay for. Picking those that provide the least sensible DNS service on the planet and suggesting the OP needs to cater for them is silly. If they have issue with the DNS service their ISP provides, they can simply switch ISP or DNS servers. Or should we also work around ISPs that decide that updating BGP is too much trouble? – Paul Sep 30 '11 at 14:50
  • @jwenting Not to nitpick but DNS lookups only happen at the endpoints and have nothing to do with in-flight routing, which is based purely on IP address. If my upstream's upstream has a faulty DNS that resolves with the wrong TTL that won't affect the connection as long as my own DNS resolves it correctly. – fluffy Sep 30 '11 at 17:49
  • 1
    "_"if you're going to use our website, do not use Comcast/Roadrunner/whomever as your ISP because they will ignore our TTL settings"_" Since when ISP like Comcast/Roadrunner/whomever force their clients to use their DNS resolvers? If some ISP proposes a broken DNS service, just tell people to avoid the DNS service. And of course if some ISP starts to filter DNS request to force their customers to use their own DNS resolver, then it's the customers responsibility to put pressure on the ISP to stop doing that, or else to switch ISP for breach of contract. – curiousguy Oct 02 '11 at 07:59
5

You know this is impossible.

No doubt the client is focused on seeing "100%", so the best you can do is promise 100%, except for [all reasonable causes that aren't your fault].

Marcin
  • 154
  • 4
4

While I doubt 100% is possible you may want to consider Azure (or something with a similar SLA) as a possibility. What goes on:

Your servers are virtual machines. If there's ever a hardware issue on one server your virtual machine is moved to a new machine. The load balancer takes care of the redirection so the customer should not see any downtime (though I'm not sure how your sessions state would be affected).

That said, even with this fail-over, the difference between 99.999 and 100 borders on insanity.

You'll have to have full control over the following factors.
- Human factors, both internal and external, both malice and impotence. An example of this is somebody pushing something to production code that brings down a server. Even worse, what about sabotage?
- Business issues. What if your provider goes out of buisness or forgets to pay their electric bills, or simply decides to stop supporting your infrastructure without sufficient warning?
- Nature. What if unrelated tornadoes simultaneously hit enough data centers to overwhelm backup capacity?
- A completely bug free environment. Are you sure there isn't an edge case with some third party or core system control that hasn't manifested itself but still could do so in the future?
- Even if you have full control over the above factors, are you sure the software/person monitoring this won't present you with false negatives when checking if your system is up?

JSWork
  • 151
  • 4
  • 2
    Azure and EC2 have both recently had near complete and total failures. I believe Azure was recently taken down simply due to a bad config entry on a DNS server. Either way, thanks for the info. – NotMe Sep 29 '11 at 19:11
  • and if your load balancer (which does the switching) goes down unnoticed (its monitor could also be down unnoticed, ad infinitum) when the node goes down, you're still screwed. – jwenting Sep 30 '11 at 09:26
  • 1
    I think you meant 'incompetence.' 'Impotence' shouldn't have a great deal of impact on the IT staff's ability to do their jobs. – mfinni Sep 30 '11 at 12:58
4

Honestly 100% is completely insane without at least a waver in the terms of a hacking attack. Your best bet is to do what Google and Amazon do in that you have a geo-distributed hosting solution where you have your site and DB replicated across multiple servers in multiple geographic locations. This will guarantee it in anything but a major disaster such as the internet backbone being cut to a region (which does happen from time to time) or something nearly apocalyptic.

I would put in a clause for just such cases (DDOS, internet backbone cutting, apocalyptic terrorist attack or a big war, etc).

Other than that look into Amazon S3 or Rackspace cloud services. Essentially the cloud setup will not just offer the redundancy in each location but also the scalability and the geo-distribution of traffic along with the ability to redirect around failed geo-areas. Though my understanding is that the geo-distribution costs more money.

Patrick
  • 190
  • 10
3

I just wanted to add another voice to the "it can (theoretically) be done" party.

I wouldn't take on a contract that had this specified no matter how much they paid me, but as a research problem, it has some rather interesting solutions. I'm not familiar enough with networking to outline the steps, but I imagine a combination of network-related configurations + electrical/hardware wiring failovers + software failovers would, possibly, in some configuration or the other work to actually pull it off.

There's almost always a single point of failure somewhere in any configuration, but if you work hard enough, you can push that point of failure to be something that can be repaired "live" (i.e. root dns goes down, but the values are still cached everywhere else so you have time to fix it).

Again, not saying it's feasible.. I just didn't like how not a single answer addressed the fact that it isn't "way out there" - it's just not something they actually want if they think it through.

Mahmoud Al-Qudsi
  • 509
  • 1
  • 6
  • 21
3

Re-think your methodology of measuring availability then work with your customer to set meaningful targets.

If you are running a large website, uptime is not useful at all. If you drop queries for 10 minutes when your customers need them most (traffic peak), it could be more damaging to the business than an hour-long outage at 3 AM on a Sunday.

Sometimes large web companies measure availability, or reliability, using the following metrics:

  1. percentage of queries that are answered successfully, without a server-side error (HTTP 500s).
  2. percentage of queries that are answered below a certain target latency.
  3. dropped queries should count against your stats (see below).

Availability should not be measured using sample probes, which is what an external entity such as pingdom and pingability are able to report. Don't rely solely on that. If you want to do it right, every single query should count. Measure your availability by looking at your actual, perceived success.

The most efficient way is to collect logs or stats from your load-balancer and calculate the availability based on the metrics above.

The percentage of dropped queries should also count against your stats. It can be accounted in the same bucket as server-side errors. If there are problems with the network or with another infrastructure such as DNS or the load balancers, you can use simple math to estimate how many queries you lost. If you expected X queries for that day of the week but you got X-1000, you probably dropped 1000 queries. Plot your traffic into queries per minute (or second) graphs. If gaps appear, you dropped queries. Use basic geometry to measure the area of those gaps, which gives you the total number of dropped queries.

Discuss this methodology with your customer and explain its benefits. Set a base-line by measuring their current availability. It will become clear to them that 100% is an impossible target.

Then you can sign a contract based on improvements on the baseline. Say, if they are currently experiencing 95% of availability, you could promise to improve the situation ten fold by getting to 98.5%.

Note: there are disadvantages to this way of measuring availability. First, collecting logs, processing and generating the reports yourself may not be trivial, unless you use existing tools to do it. Second, application bugs may hurt your availability. If the application is low quality, it will serve more errors. The solution to this is to only consider the 500s created by the load-balancer instead of those coming from the application.

Things may get a bit complicated this way, but it's one step beyond measuring just your server uptime.

Yves Junqueira
  • 671
  • 3
  • 7
3

While some people noted here, that 100% is insane or impossible, they somehow missed the real point. They argued, that the reason for this is the fact that even the best companies/services cannot achieve it.

Well, it's lot simpler than that. It's mathematically impossible.

Everything has a probability. There could be a simultaneous earthquake at all locations of where you store your servers, destroying all of them. Agreeably it's a ridiculously small probability, but it's not 0. All you internet providers could face a simultaneous terrorist/cyber attack. Again, not very probable, but not zero either. Whatever you provide, you can get a non-zero probability scenario which brings the whole service down. Because this, your uptime cannot be 100% either.

Karoly Horvath
  • 334
  • 1
  • 4
  • 14
2

Go grab a book on manufacturing quality control using statistical sampling. A general discussion in this book, the concepts of which any manager would have been exposed to in a general statistics course in college, dictate the the costs to go from 1 excption in a thousand, to 1 in ten thousand to 1 in a million to 1 in a billion rise exponentially. Essentially the ability to hit 100% uptime would cost an almost unlimited amount of funds, kind of like the amount of fuel required to push an object to the speed of light.

From a performance engineering perspective I would reject the requirement as both untestable and unreasonable, that this expression is more of a desire than a true requirement. With the application dependencies which exist outside of any application for networking, name resolution, routing, defects propogated from underlying architectural components or development tools, it becomes a practical impossibility to have anyone gurantee 100% uptime.

James Pulley
  • 456
  • 2
  • 6
1

I don't think the customer is actually asking for 100% uptime, or even 99.999% uptime. If you look at what they're describing, they're talking about picking up where they left off if a meteor takes out their on-site datacenter.

If the requirement is external people not even notice, how drastic does that have to be? Would making an Ajax request retry and show a spinner for 30 seconds to the end user be acceptable?

Those are the kinds of things the customer cares about. If the customer was actually thinking of precise SLAs, then they would know enough to express it as 99.99 or 99.999.

Kevin Peterson
  • 205
  • 1
  • 6
  • If the customer thinks they want "100% uptime" and that's when ends up in the contract verbiage, you might get held to it if it ends up in court. Best to talk it out and help the customer understand what they really want instead of assuming you know what they're thinking. – Chris S Sep 30 '11 at 19:35
  • Oh I agree this needs to be cleared up before it gets into a contract. I'm just saying this needs to be approached as the client isn't communicating what they actually want, as opposed to the client is asking for something ridiculous. – Kevin Peterson Sep 30 '11 at 20:43
1

my 2 cents. I was responsible for a very popular web site for a fortune-5 company who would take out ads for the super bowl. I had to deal with huge spikes in traffic and the way I solved it was to use a service like Akamai. I do not work for Akamai but I found their service extremely good. They have their own, smarter DNS system that knows with a particular node/host is either under heavy load or is down and can route traffic accordingly.

The neat thing about their service was that I didn't really have to do anything very complicated in order to replicate content on servers in my own data center to their data center. Additionally, I know from working with them, they made heavy use of Apache HTTP servers.

While not 100% uptime, you may consider such options for dispersing content around the world. As I understood things, Akamai also had the ability to localize traffic meaning if I was in Michigan, I got content from a Michigan/Chicago server and if I was in California, I supposedly got the content from a server based in California.

Kilo
  • 1,554
  • 13
  • 21
  • -1 because this is a practical answer but not useful at all. All questions in this site could be answered by "hire someone else to do it", but that is not why we are here. – Yves Junqueira Oct 02 '11 at 17:33
  • I beg to differ. "Not useful at all?" It was most certainly useful for me and contrary to your "hire someone else to do it" comment, I suppose with your reasoning the guy should trench his own fiber optic cable and design his own switches rather than buy them too? Are you serious, Yves? You sound like someone who has not spent much time in the IT field. – Kilo Oct 03 '11 at 00:33
0

Instead of off-site failover, just run the application from two locations simultaneously, internal and external. And synchronise the two databases... Then if the internal goes down, the internal people will still be able to work and external people will still be able to use the application. When internal comes back online, synchronise the changes. You can have two DNS entries for one domain name or even a network router with round robin.

Peter Mortensen
  • 2,319
  • 5
  • 23
  • 24
Christian
  • 746
  • 3
  • 13
  • 30
0

For externally hosted sites, the closest you'll get to 100% uptime is hosting your site on Google's App Engine and using its high replication datastore (HRD), which automatically replicates your data across at least three data centers in real time. Likewise, the App Engine front-end servers are auto scaled/replicated for you.

However, even with all of Google's resources and the most sophisticated platform in the world, the App Engine SLA uptime guarantee is only "99.95% of the time in any calendar month."

espeed
  • 159
  • 5
0

Simple and direct: Anycast

http://en.wikipedia.org/wiki/Anycast

This is what cloudflare, google and any other big company uses to do redundant, low latency, cross continental fail-over/balancing.

But also keep in mind that it's impossible to have 100% uptime, and that the costs to go from 99.999% to 99.9999% is MUCH bigger.

Leon Waldman
  • 132
  • 3