8

We have several hosts where we have an identical hot spare host, which is patched and updated so it is very close to have to same software and config. In case of failure the network cable is switched and the DHCP server is updated with the new MAC address. This is best case, as there usually are a bit more that needs modification.

I feel it is a waste of electricity to have a hot spare host and waste of time to maintain it, and since config modifications are needed in case of failover, I'd like to ask the following:

Are hot spare hosts old school and there are better ways now?

Instead of having a hot spare host, would it make sense to make it a cold spare, take the hard drives and put them in the primary host and change the RAID from 1 to 1+1. In case of failure all I would have to do is change network cables, update the DHCP server, take the hard drives and insert them in the cold spare and power on. The benefit, as I see it, is that the 2x2 disks are always in sync, so only one host to maintain and no config changes are needed when failing over.

Is that a good idea?

user
  • 4,267
  • 4
  • 32
  • 70
Jasmine Lognnes
  • 2,490
  • 8
  • 31
  • 51
  • 1
    Are these physical "hosts" with actual services or VM hosts with a bunch of guests? – Nathan C Jul 09 '14 at 13:47
  • 2
    With VMware FT and Hyper-V Replica available as virtualization options (as well as plain old HA) I find the idea of having a dedicated hot spare for a single purpose host to be a little out of step. – joeqwerty Jul 09 '14 at 15:07

5 Answers5

11

Yes, it's a bit old school. Modern hardware doesn't just fail that often. Focus either on making your applications more highly-available (not always possible), or on the items needed to make your individual hosts more resilient...

For hosts:

  • Buy better hardware.
  • Ensure you have support contracts.
  • REGISTER your servers' support contracts (spare parts are stocked locally based on registration data!)
  • Use redundant power supplies, (hardware?) RAID, redundant fans.
  • If the server is not capable of accommodating the above redundant features, keep a spare chassis or components on hand to be able to self-repair in the event of failure.

In order of decreasing failure frequency, I see: disks, RAM, power supplies, fans most often... Sometimes system board or CPU. But those last two are where your support contract should kick in.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Moving parts die first - thankfully disks RAID, otherwise they'd be my most frequent failure. – Sobrique Jul 09 '14 at 14:04
  • 2
    +1 just for "REGISTER your servers' support contracts". Even in my limited experience it is more common than you would think that I call support during a SHTF situation at a new site and support has no idea the particular piece of hardware exists and has a contract attached to it. –  Jul 09 '14 at 15:14
  • The servers in question are all IBM, and now probably 5 years old. So far have we only had one mainboard and one CPU failure. – Jasmine Lognnes Jul 09 '14 at 15:23
  • 1
    IBM and HP are solid. Dell sometimes. If Supermicro, I'd recommend keeping **TWO** spares per server ;) – ewwhite Jul 09 '14 at 15:25
  • RAM is a second most commonly failing component for you? That makes me curious as to what sort of failure you see there. Wrong color bits? :) (Disks I can easily imagine being the #1, particularly in a large storage setup.) – user Jul 11 '14 at 12:19
  • 1
    On my HP servers, the early ECC thresholds are exceeded and [trigger an alert](http://i.stack.imgur.com/PyXxn.png). The RAM is usually replaced before there an impact to applications. I see it about 10 times a year across a few hundred servers. – ewwhite Jul 11 '14 at 12:24
  • @ewwhite Right. So, wrong color bits. :) – user Jul 11 '14 at 12:26
9

It's rather inefficient - not least because of the dependency on manual intervention to make the switch.

I have worked at places that run a hot DR site - literally, identical servers to the primary, ready to go instantly. However the DR switchover is an automated process - we're not talking cabling, a bit of fiddling and a switch, but a process when we press the button flips everything from one site to the other.

This approach is sickeningly expensive, but that's a business decision - acceptable risk vs. the money needed to deliver on the objective. As a rule, there's an exponential curve on recovery time objective - the nearer to zero it gets, the more it costs.

But that's what your question is about, really. What is your recovery time objective, and what is the most effective way of achieving it. Waiting for a server to boot will take a few minutes. How long does it take someone to do the adjustment and 'recovery tasks' when it goes pop at 4am?

And how long is an acceptable outage?

I would suggest that if you're doing 'hot recovery' you want to think clustering. You can be fairly cheap on clustering with good use of VMWare - 'failing over' to a VM - even from a physical - means you're not running redundant hardware. (Well, N+1 rather than 2N).

If your RTO is long enough, then switch the box off. You may find that the RTO is sufficient that a cold rebuild from backup is ok.

Sobrique
  • 3,697
  • 2
  • 14
  • 34
  • 2
    +1 just for the recovery time curve; I always tell clients that they get 99% uptime for the cost of the kit and setup, but each extra 9 they decide they need will increase the cost by somewhere between two and ten times. – MadHatter Jul 09 '14 at 14:07
  • Downtime during night is not good, but accepted buy the CEO. During working hours, 30 minutes is probably okay each 6 months. Failing over to a VM is an interesting idea. Can it be done with KVM? Will I still need to maintain the VM with patches and config changes, or can that be automated? – Jasmine Lognnes Jul 09 '14 at 15:20
  • VM is virtual machine, nothing to do with a KVM. (Keyboard/Video/Mouse). And yes, you'd need to keep the OS instance up to date, and check it all works normally. But you should be able to use the same update mechanism as you do on the primary device. – Sobrique Jul 09 '14 at 15:31
  • Although seriously - how often has your server fallen over? I mean completely, for reasons hardware related? Most 'server grade' pieces of hardware run N+1 resiliency. – Sobrique Jul 09 '14 at 15:33
  • 3
    @sobrique in this context KVM likely stands for kernel based virtual machine -http://www.linux-kvm.org/ – Grant Jul 09 '14 at 16:29
  • Mea culpa. Been fiddling down in datacentres with KVM switches far too long :) – Sobrique Jul 10 '14 at 12:45
6

Sobrique explains how the manual intervention causes your proposed solution to be sup-optimal, and ewwhite talks about probability of failure of various components. Both of those IMO make very good points and should be strongly considered.

There is however one issue that nobody seems to have commented on at all so far, which surprises me a little. You propose to:

make [the current hot spare host] a cold spare, take the hard drives and put them in the primary host and change the RAID from 1 to 1+1.

This doesn't protect you against anything the OS does on disk.

It only really protects you against disk failure, which by moving from mirrors (RAID 1) to mirrors of mirrors (RAID 1+1) you greatly reduce the impact of to begin with. You could get the same result by increasing the number of disks in each mirror set (go from 2-disk RAID 1 to 4-disk RAID 1, for example), along with quite likely improving read performance during ordinary operations.

Well then, let's look at some ways this could fail.

  • Let's say you are installing system updates, and something causes the process to fail half-way; maybe there's a power and UPS failure, or maybe you have a freak accident and hit a crippling kernel bug (Linux is pretty reliable these days, but there's still the risk).
  • Maybe an update introduces a problem that you didn't catch during testing (you do test system updates, right?) requiring a failover to the secondary system while you fix the primary
  • Maybe a bug in the file system code causes spurious, invalid writes to disk.
  • Maybe a fat-fingered (or even malicious) administrator does rm -rf ../* or rm -rf /* instead of rm -rf ./*.
  • Maybe a bug in your own software causes it to massively corrupt the database contents.
  • Maybe a virus manages to sneak in.

Maybe, maybe, maybe... (and I'm sure there are plenty more ways your proposed approach could fail.) However, in the end this boils down to your "the two sets are always in sync" "advantage". Sometimes you don't want them to be perfectly in sync.

Depending on what exactly has happened, that's when you want either a hot or cold standby ready to be switched on and over to, or proper backups. Either way, RAID mirrors of mirrors (or RAID mirrors) don't help you if the failure mode involves much of anything aside from hardware storage device failure (disk crash). Something like ZFS' raidzN can likely do a little better in some regards but not at all better in others.

To me, this would make your proposed approach a no-go from the beginning if the intent is any sort of disaster failover.

user
  • 4,267
  • 4
  • 32
  • 70
  • That's what backups and configuration management are for, no? – ewwhite Jul 11 '14 at 11:41
  • @ewwhite Absolutely, but it should be a lot easier *if needed* to switch over to a secondary host that has a (presumably known good) configuration (software and settings) already, than to break a RAID mirror, physically move the disks, make any necessary configuration changes (network cabling, DNS, IP settings, ...), and then have to fix whatever went wrong requiring you to switch over in the first place before your standby host even does you any good. At that point you might just as well fix it in place. (Or particularly if you're in the position of running VMs revert to a relevant snapshot.) – user Jul 11 '14 at 11:59
  • Oh, definitely. If I have replication solutions, there's also an RPO/RTO consideration and offset (10-15 minutes) to cover the above scenarios. – ewwhite Jul 11 '14 at 12:01
  • @ewwhite I'm not arguing your point (and actually upvoted your answer), just adding another way that I saw nobody mentioning how the OP's proposed solution could (would) fail to produce the most likely desired result, which is failure recovery. Was actually surprised to find my answer accepted. – user Jul 11 '14 at 12:07
  • 5
    *Sandra works in mysterious ways...* – ewwhite Jul 11 '14 at 12:11
5

The fact that it is old school doesn't necessarily make the use of a hot spare a bad idea.

Your main concern should be the rationale, what are the risks you run, and how does running a hot spare mitigate them. Because in my perception your hot spare only addresses hardware failure, which is although not uncommon, neither the only operational risk you run, nor the most likely. The second concern is do alternative strategies provide more risk reduction or significant savings.

Running a hot spare with multiple manual fail-over steps will take long and is likely to go wrong, but I've also seem automated failover with HA cluster suites turning into major cluster f*cks.

Another thing is that hot or cold standby in the same location doesn't provide business continuity in case of local disaster.

HBruijn
  • 72,524
  • 21
  • 127
  • 192
2

The concept of having a hot or even cold spare is dependent how the application(s) are built in the first place.

What I mean is that if the application has been built in such a way that the data and service load is spread across multiple machines then the concept of any single machine taking the system down should go away. In that situation you don't need a hot spare. Instead you need enough excess capacity to handle when an individual machine/component dies.

For example, a standard web application generally requires a web server and a database server. For the web servers, just load balance 2 or more. If one dies, no biggie. The database is usually more difficult as it has to be architected to be multi-master with all data sync'd across the participating machines. So instead of a single DB server you end up with 2 (or more) that are both servicing your data needs. Large service providers such as Google, Amazon, Facebook, etc have gone this route. There is more upfront cost in development time, but it pays dividends if you need to scale out.

Now, if your application isn't structured in such a way or it's simply prohibitive to retro fit the app then yes you will likely want a hot spare.

NotMe
  • 3,772
  • 7
  • 30
  • 43