3

Given the MTTF T of an individual drive (say, 100000 hours) and the average time r it takes the operator to replace a failed drive and the array-controller to rebuild the array (say, 10 hours), how long will it take, on average, for a second drive to fail while the earlier failure is still being replaced thus dooming the entire N-drive RAID5?

In my own calculations I keep coming up with results of many centuries -- even for large values of N and r, which means, using "hot spares" to reduce the recovery time is a waste... Yet, so many people choose to dedicate a slot in a RAID-enclosure to hot spare (instead of increasing capacity), it baffles me...

Mikhail T.
  • 2,272
  • 1
  • 22
  • 49
  • 3
    I think you're missing two details: 1) It can take upwards of 50 hours to rebuild a large RAID 5 array. 2) If *any* remaining drive fails in that time, you're dead. – David Schwartz Aug 13 '13 at 18:23
  • David, I know very well about point 2 -- if you read my question carefully, a second drive failing during the recovery _is_ how I define the death of the entire array. But I'm asking about a formula (or, at least, a number). If the rebuild component of the recovery time _r_ is, indeed, as big as 30 hours, then using hot spares makes even less sense -- an operator can put in a cold spare in 4-6 hours tops (and usually much faster). – Mikhail T. Aug 13 '13 at 18:26
  • 1
    Read this: http://www.smbitjournal.com/2012/07/hot-spare-or-a-hot-mess/ – TheCleaner Aug 13 '13 at 21:07

1 Answers1

5

Let's try a 10 drive RAID5 array with a 3% AFR and a two day rebuild time and do some rough calculations:

A 3% AFR over 10 drives means that roughly we will have a 30% chance of a single drive failure in a year.

If we assume a two day rebuild time, that means the chance that one of the nine remaining drives will fail during the rebuild is about 1.5% (30 * 9 * 2 / 365). That gives us about a .5% chance (.3 * 1.5) of a catastrophic failure with service interruption in a given year.

I agree that a hot spare is not the right solution to this problem. It only reduces the rebuild time a bit.

David Schwartz
  • 31,215
  • 2
  • 53
  • 82
  • Thanks, David. Could you reword the answer to calculate the MTTF, though? And, preferably, use the variables (_N_, _r_, and _T_) so I can pick your answer... – Mikhail T. Aug 13 '13 at 18:34
  • 3% AFR over 10 drives is actually 26.3%; though using 30% isn't terribly off. – Chris S Aug 13 '13 at 18:40
  • @ChrisS Yeah, these are rough calculations to get the order of magnitude. – David Schwartz Aug 13 '13 at 18:49
  • 1
    You're also assuming that the chance of the 2nd drive failure is unrelated to the chance of the first one. This is often not the case in real life. If thermal stress or other localized environment issue affected the first drive to fail, it will increase the chances of another drive in the same chassis failing. – mfinni Aug 13 '13 at 19:07
  • @mfinni Modeling external factors like that is beyond what we can reasonably expect for free on server Fault :-) I do agree that failures are rarely independent though: At my last job we had a string of drive failures because our vendor got a bad shipment. The first 3 failures in that batch of systems lead us to monitor them all closely, and sure enough most systems in that batch lost at least one drive within 6 months. – voretaq7 Aug 13 '13 at 21:05
  • I'm not saying you should do the math - I'm pointing out that there's an important assumption in your model that often varies from the real world in a bad way – mfinni Aug 13 '13 at 21:13
  • Mfinni, yes, there are corelated failures, but having a hot spare will not help you against those much either: you are likely to have not just two, but several (or even _all_) drives fail to the same environmental problem, that killed the first one... – Mikhail T. Aug 13 '13 at 21:28