Understanding Disk Array Availability

Question

I'm new to this thus I'm trying to understand about calculating Disk Array Availability (DAA).

What I have understood so far is that availability is always 1, and Failure = 1÷MTBF, and that is all I could understand since morning.

Assume I have 1TB and the time is 5 years, and the disk's MTBF is 1.6 millions hours (got this from Dell's website for a 15K RPM disk), I need to know the availability of this disk in 5 years, how do I calculate, what is the formula, I'm seeing MTTR, and MTTDL and some other MTTs and getting confused.

Another confusion is, is disk array availability associated only with RAID.

Could someone explain in simple English how to calculate DAA.

Appreciate the assistance.

Currently I'm using this as reference : http://www.ecs.umass.edu/ece/koren/architecture/Raid/reliability.html (this reference has only formula with minimal or no explanation)

If anyone knows of any other good reference that explains in simple English.

Thank You

Also in the reference it says double disk failure, does the double disk mean any 2 disks or what, because once we find the availability of 1 disk would multiplying that into 2 NOT give us the double disk failure.. — Huud Rych, May 14 '19 at 19:49

score 1 · Answer 1 · answered May 17 '19 at 17:24

It's obviously that DAA is associated only with RAID arrays since it is a redundant array of the independent disks.

As per MTBF, here is some information from Hitachi:

"MTBF target is based on a sample population and is estimated by statistical measurements and acceleration algorithms under median operating conditions. MTBF ratings are not intended to predict an individual drive’s reliability. MTBF does not constitute a warranty."

For HDD is better to use AFR — Annualized Failure Rate (https://en.wikipedia.org/wiki/Annualized_failure_rate)

WD just stopped using the MTBF/MTTF specifications just because of very unclear and not understanding the statement.

You can't calculate a real-world HDD lifetime, just because of many factors which affect the reliability, such as:

1) Temperature

2) Power-on/off cycles

3) Intensive writes/reads

4) or even issue from the manufacturer software or hardware

score 1 · Answer 2 · answered May 17 '19 at 21:38

MTBF is just a statistic. This will not help you with what you're trying to predict. In my experience with various disks from various manufacturers over the course of 20 years, Enterprise grade equipment generally lasts far longer than you ever want to even look at it in a typical environment. Yes, you will always have that 10% failure rate of everything, but that is what RAID and backups are for.

That said, consumer grade equipment in enterprise environments tends to fail right about when you think it would (meaning shortly after the warranty is up). But if you're running WD Black/Gold disks or Seagate Enterprise disks, etc, you're going to get rid of them because they're uselessly small/slow long before they cease to spin up. SSDs have the added advantage of telling you how much life they have left so there's that.

score 0 · Answer 3 · answered May 15 '19 at 14:31

Mean time between failures of 1.6 million hours is 182 years, meaning that if you run 182 drives for a year very likely at least one will fail. Annual failure rate is the inverse of this, in "failures per hour" units, annualized.

Vendors may be overstating MTTF:

Backblaze 2018 data suggests 1.25% annual failure rate or about 110 year MTBF
ServeTheHome assumes 5 year
A presenter at a storage developer conference assumes 34 year MTBF (4% annual fail rate)

Maybe this has something to do with consumer versus enterprise disks, but might as well not take risks with your data.

Mean time to repair (MTTR) is the typical amount of time for a full repair, including drive replacement and rebuild. This varies a lot, from days to notice and replace a drive, to zero with a hot spare that already is an array member.

Putting it together, data loss is when the number of failures exceeds the redundancy of an array. As in, a secondary failure when the array is degraded. Failure modes, and so the formulas, depend on the RAID level.

RAID 5 would be a second failure on any drive. The first failure is MTTF divided by the number of drives. But the second must be within the degraded window, the chance of which is ( MTTR ) / ( MTTF / number of drives - 1 ). Multiply both together and you get the chance of secondary drive failure.

That was full drive failure. Non-recoverable (aka unrecoverable) read errors can also be significant. The Segate ST8000DM002 that Backblaze likes has specs of 8 TB in size and a read error every 10^14 bits. (They measured 0.94% AFR.) Meaning, a full read of the drive will likely have a faulty sector 64% of the time. Drives may exceed this spec in practice, particularly if they are not very old. UREs may not matter if the array has redundancy and can correct it. Or the array only returns one bad sector which the file system didn't use anyway. Or if it hit an unimportant file. It is far more problematic if it causes the array to fail entirely.

Understanding Disk Array Availability

3 Answers3