How far should we take the N+N redundancy craziness?

Question

The industry standard when it comes from redundancy is quite high, to say the least. To illustrate my point, here is my current setup (I'm running a financial service).

Each server has a RAID array in case something goes wrong on one hard drive

.... and in case something goes wrong on the server, it's mirrored by another spare identical server

... and both server cannot go down at the same time, because I've got redundant power, and redundant network connectivity, etc

... and my hosting center itself has dual electricity connections to two different energy providers, and redundant network connectivity, and redundant toilets in case the two security guards (sorry, four) needs to use it at the same time

... and in case something goes wrong anyway (a nuclear nuke? can't think of anything else), I've got another identical hosting facility in another country with the exact same setup.

Cost of reputational damage if down = very high
Probability of a hardware failure with my setup : <<1%
Probability of a hardware failure with a less paranoiac setup : <<1% ASWELL
Probability of a software failure in our application code : >>1% (if your software is never down because of bugs, then I suggest you doublecheck your reporting/monitoring system is not down. Even SQLServer - which is arguably developed and tested by clever people with a strong methodology - is sometimes down)

In other words, I feel like I could host a cheap laptop in my mother's flat, and the human/software problems would still be my higher risk.

Of course, there are other things to take into consideration such as :

scalability
data security
the clients expectations that you meet the industry standard

But still, hosting two servers in two different data centers (without extra spare servers, nor doubled network equipment apart from the one provided by my hosting facility) would provide me with the scalability and the physical security I need.

I feel like we're reaching a point where redundancy is just a communcation tool. Honestly, what's the difference between a 99.999% uptime and a 99.9999% uptime when you know you'll be down 1% of the time because of software bugs ?

How far do you push your redundancy crazyness ?

score 8 · Answer 1 · answered Aug 14 '09 at 07:37

8

When the cost of the redundancy is higher then the cost of being down while what ever is broken is being replaced, it's to much redundancy.

answered Aug 14 '09 at 07:37

mrdenny

27,074
4
40
68

Probably. But I think it's too much redundancy far before this threshold, don't you? – Brann Aug 14 '09 at 11:47
2

It all depends on your business model, and how much risk the business is willing to accept. In my companies case everything within our data center is fully redundant (passive SQL Servers, multiple web servers, redundant network switches with everything dual homed). However we don't have a redundant site as we can't justify the cost to mitigate the risk that the hub for the Internet in Los Angeles going offline (probably not that likely to happen). However Visa probably feels that multiple CoLos are worth every penny, because they are willing to pay for it. There's no right answer here. – mrdenny Aug 14 '09 at 22:30

score 4 · Answer 2 · answered Aug 14 '09 at 10:56

Its all abut risk management. Even with 2x everything, you can still get downtime due to unforseen problems.

eg. My hosting provider has dual, redundant connections to the upstream internet. So the day that one of their cables was cut through by some building contractors, their upstream provider took the other one down for some maintenance. And not only that, because all the phones were SIP, no-one could phone in to say there was no connectivity and they didn't realise there was a problem for ages.

Now that was a one in a million cock-up, and it could have been prevented by adding in more layers of redundancy or management oversight... but the chance of it happening was so slim, you'd never think that there'd be a problem so it wouldn't be worth the cost of preventing it from happening.

Another example: we implemented SQL Server mirroring at an Ambulance 999 control room, mirrored DBs should have meant there would be no problem.. except that we found a bug in SQLServer that froze the main DB and prevented it failing over to the mirror. So, although we did what we could to ensure continuous uptime, we still had to transfer to manual calltaking while the DB issue was resolved. In this case, we had the best solution we could reasonably implement, and a fallback plan in case that 'best solution' failed. Trying to ensure a total 100% uptime guarantee for the 'best solution' simply would not have been cost effective, and probably would still have not given us that 100% guarantee anyway.

Again, another story: we have a europe-wide network of replicated Active Directory servers, with fallback in case of failure in any country. So when a certain admin accidentally deleted a few too many records, the solution was to stop the server and let people authenticate against the next country along. Only the replication got there first and the deleted records started to be deleted from the other servers too.... took a week, with Microsoft expert help to get things resolved fully.

So - its all down to risk/cost. You decide how much risk you're willing to take, and cost it. It quickly gets to a point where reducing risk further costs too much, at that point you should find alternative strategies to cope with the downtime when it happens.

wow, it's like I heard the last cry of a billion bits, then the only thing remaining was silence. — hayalci, Aug 14 '09 at 11:06

score 1 · Answer 3 · answered Aug 14 '09 at 08:16

1

You're doing what I do - I don't think it's crazy at all.

answered Aug 14 '09 at 08:16

Chopper3

100,240
9
106
238

If you're both doing the same thing would you still see it as crazy? :-) – Bart Silverstrim Aug 14 '09 at 12:03

score 1 · Answer 4 · answered Aug 14 '09 at 10:02

... and in case something goes wrong anyway (a nuclear nuke? can't think of anything else), I've got another identical hosting facility in another country with the exact same setup.

As the others have noted: This is simply a business case. The level of redundancy required is dictated directly by the requirements and expectations of your clients/users. If they pay for and expect uptime in the region of five-9s then you need to provide that. If they don't, then you should address that as a business strategy.

However, if I try to guesstimate the probability of another problem (software or human), I think it's several order of magnitudes higher than that.

Simple answer: This has to be addressed by procedure. Not by physical redundancy.

If human error is causing you downtime then you need to strenghten the error checking performed whenever humans intervene. This probably means that all platform amendments are ticketed as change requests and signed off by a secondary person. Or that those change requests contain more detail about tasks to be undertaken and no deviation can be taken. Or that staff simply require more training about how to work with care in production environments.

If software error is causing you downtime then perhaps you need to strengthen your staging procedure. Ensure that you have a good staging environment, which may well be entirely virtualised to reduce the hardware requirements, but still matches your production environments as closely as possible. Any software changes should get tested in the staging environment for a specified period of time before they are rolled for general deployment.

Even an extremley well organized team cannot produce bug-free software (look at Microsoft for an example). On the other hand, it's pretty easy to achieve a very high hardware uptime. I'm just advocating there's no sense trying to move from 99.99 to 99.9999 uptime when the probability you'll have downtime caused by bugs is way higher than that ! — Brann, Aug 14 '09 at 11:01
It is very well possible to produce bug free code. You just have to have enough time and money to formally prove your code. However it's in most cases inapropriate, some people do demand that there's formal prove that the code will do what is in the requirements. — Martin M., Jun 20 '11 at 23:02

score 1 · Answer 5 · answered Aug 14 '09 at 10:57

Every design and architecture should be requirements driven. Good systems engineering calls for defining the constraints of the design and implementing a solution that meets that. If you have a SLA with your customers that calls for a .99999, then your solution of N+N redundancy should account for all those LRU (line replaceable units) that could fail. RAID, PS, and COOP planning should all account for that. In addition your SLA with vendors should be the 4 hour response time type or account for a large number of spares onsite.

Availability (Ao from here out) is that study. If you are doing all these things because it seems like the right thing to do then you are wasting your time and your customers money. If pressed, everyone would desire 5x9's, but few can afford this. Have an honest discussion about the availability of the data and system in the perspective of cost.

The questions and answers posed thus far do not take into account the requirements. The chain assumes that N+N redundancy with hardware and policies is the key. Rather, I would say let the requirements from your customers and SLA drive the design. Maybe your mom's flat and your old laptop will suffice.

Us geeks sometimes go looking for a problem just so we can implement a cool solution.

in the banking industry, requirements are insanely high. For example, most banks expect you to physically decommission your old backups in military approved specialized centers, while a DoD secure wipe would destroy the data just as efficiently. And the list goes on. — Brann, Aug 14 '09 at 11:06

score 0 · Answer 6 · answered Jun 20 '11 at 22:54

You are right about the "hardware" part of the setup. Providing HA by geo-redundancy will make it very unlikely your services are down because of failed hardware.

In other words, I feel like I could host a cheap laptop in my mother's flat, and the human/software problems would still be my higher risk.

I totally disagree. You are missing the crucial point of testing and release management. Also there are strategies that will ensure that a software will never let your service go down for all customers.

Some companies even go so far as to not only use a single brand of webserver just because they fear that a bug in Apache could be triggered all over the place at once, hence they deploy multiple Webservers.

As far as testing goes: There has to be a certain level of trust, even with a system with complete access to all sources you can't possibly have the resources to test everything (or if testing wouldn't be enough - formally prove the correctness).

The point is that you should have tests before your software goes into production. That is something like:

regression tests (everything that worked in the old release works in the new release)
unit tests
behaviour tests (as in have some users try out the promised features if they work as expected or even better automate that process)

As far as release management goes: If you don't want an unknown bug in the new release to trigger downtime: Just don't release the new version everywhere. Only expose a small fraction of customers to the new release. If it works out fine migrate some more customers (something like 5%, 20%, 50%, 100%). Note that you could have a rolling cycle here like:

the first 5% have version 5
the 6-20% have version 4
the 21-50% have version 3
the 51-100% have version 2

So you don't have enourmous amounts of time between your release cycles if your definition is to let it run for 2 weeks in every batch of deployment.

I found the problem isn't in actually creating such a system but selling that to management. Because it will cost a lot of time and money to do it (at least when starting) once the process is established I find it even cheaper. Having rolling releases also makes for a perfect fallback as the software (say version 5 from the last example is completely broken) just has to have a mechanism to work with old and new data, which then again means:

repackage version 4 as version 6
deploy version 6 to the first 5% batch
make sure version 5 is purged everywhere
make sure version 5 will never ever again get deployed anywhere
start developing version 7

How far do you push your redundancy crazyness ?

As far as management is willing to pay if they find it worth it and consider the risk of an outage to be of a much higher cost than the cost of (some random level of) high-availability.

score 0 · Answer 7 · answered Aug 14 '09 at 07:20

0

How much does your reputation cost? if your software fail, you did your best to protect your costumer's data providing the best hardware / cluster redundancy. If you reached your best point, then it's time to put more budget on your change / qa management.

answered Aug 14 '09 at 07:20

allruiz

41
6

Well, I could definitely do better ; eg I could add a third hosting facility :) – Brann Aug 14 '09 at 07:24
even with the fourth facility, as gbjbaanb said it's all about risk management. – allruiz Aug 14 '09 at 15:47

score 0 · Answer 8 · answered Aug 14 '09 at 07:40

If you've got the appropriate budget and your hosting is important to you (as it would be for a financial institution), you should keep going. I've noticed you don't talk of your backups... maybe some improvements can be made there? I've never seen any setup that was so awesome that I felt it didn't need additional work (sometimes this is just fixing procedures).

score 0 · Answer 9 · answered Aug 14 '09 at 08:42

I'd do the calculation with inputs:

cost of failure: how much it costs per hour ouf outage
probability of outage: estimate risk; what's the probability of outage in given time frame
recovery time: how much time you need to get it back running? (in hours)

Then you can calculate the financial risk:

potential_outage_cost = hourly_outage_cost * recovery_time * outage_probability

Then simply weight the redundancy cost against this cost.

I hope I don't have to remind you that there are several types of outages, like:

failing disk (very probable, very fatal, but redundancy is cheap)
failing power supply
failing server
failing network connection
failing uplink ...

In any case do the risk analysis first as it gives you the baseline.

score 0 · Answer 10 · answered Aug 14 '09 at 10:29

... and in case something goes wrong anyway (a nuclear nuke? can't think of anything else),

Fire at the data centre could shut it down (didn't that happen to one shared DC last year?) however much redundancy exists inside the centre.

Two DCs can help, but even here single events could take them both out. For example, in tornado ally in the US, two DCs close enough for dark fibre could easily be hit by tornadoes from the same super-cell system. This risk can be mitigated by careful relative geographical positioning (start by checking historical storm tracks), but not completely eliminated.

I've got another identical hosting facility in another country with the exact same setup.

And as others have said it is all about cost of outage verses cost of redundancy, and many of the costs of outage are intangibles (loss of customer trust).

score 0 · Answer 11 · answered Aug 24 '09 at 21:05

0

Just be glad that you have the budget to do things right.

At the same time, your proceedures for software updates could probably use some work now.

answered Aug 24 '09 at 21:05

Ernie

5,324
6
30
37

How far should we take the N+N redundancy craziness?

11 Answers11