Mean Time to Failure (MTTF): When disk manufacturers post this, how should you interpret their numbers?


Mean Time to Failure (MTTF) is usually given in terms of hours, and by doing some calculations, it seems that a disk should fail only after a good number of years have gone by.

It seems that disks need repair more often than that. Does anyone know why this is so?

I figured that there is something fishy about this metric. Am interpreting something wrong here?

First off:

MTTF = Mean Time To Failure
MTTR = Mean Time To Repair
MTBF = Mean Time Between Failures = MTTF + MTTR

MTBF is often more or less equal to MTTF, since repair may take an hour, and MTTF may be tens of thousands of hours. But also MTBF is often not applicable, since defective products don't get repaired, but simply replaced, because repair costs more than replacing.

MTTF calculation is a complex statistical method involving calculating the odds of failing each and every individual part. And it's not a linear thing as people sometimes presume. If you have a MTTF of 1000 000 hours that doesn't mean that in 1000 devices there will be one failing after 1000 hours, or that you will get a failure in 1000 000 devices after 1 hour.
Many electronic devices follow the "bathtub curve",

where there are many failures early on, then a long time with hardly any failures, and near the end of life the number of failures rises again. In hard disks there are also some mechanical parts which have a more linear failure curve; this slowly ramps up from day 1.

If the manufacturer says for instance 1000 000 hours MTTF (that's most often POH, or Power-On Hours) it means that on average the drive should last > 100 years. Some drives will last longer, some will fail earlier on. So despite the 1000 000 hours it's perfectly possible to have a failure after 1000 hours. I once had a drive failing within a week, and then you have to think back of the bathtub curve. The replacement drive has been spinning happily for >50k hours.


3A few things worth noting might be the fact that early failures are often called burn-in. Manufacturers that have much lower early failures often run devices through their burn-in phase. Also that pure electronics do not exhibit a wear out period and only a burn in. – Kortuk – 2011-10-25T10:26:21.690

1Note that when you are calculating the MTTF (or MTBF), you're usually using just a single distribution to model the failures. Therefore the calculation is either based on "infant mortility", "normal life", or "end of life wear-out" distribution. The only thing which distinguishes these three distributions is the Weibull shape parameter, if you're using Weibull as your basic distribution. The only case in which the failures would come out of the "normal life" distribution is when time would have no effect on the failure rate, and therefore the distribution would be exponential. – None – 2011-10-25T13:49:24.883

2MTTF is primarily useful as an indication of what sort of life you should expect from the device or widget. It cannot be, for obvious reasons, an exact prediction of the date of failure of the device. It's only an estimate based on the statistical analysis of the available data and should be considered only as such. Useful for budgeting (how long should I amortize or depreciate the costs here) and planning (how long can we expect the widget to perform before we have to get the next one). – music2myear – 2011-10-25T18:20:41.303

First off, what exactly is a "disk failure"? – Kaitlyn Mcmordie – 2011-10-26T00:16:35.690

2@Kaitlyn - I guess you're referring to bad sectors. I'd say a disk failure is when you can't read from or write to the drive any longer. Usually a mechanical error, like a head crash. This usually happens when you still have plenty of good sectors left. – stevenvh – 2011-10-26T06:55:57.523


If a piece of equipment has an MTBF of 1,000,000 hours' usage, that doesn't mean that any piece of equipment can be expected to last 1,000,000 hours. Rather, it means, roughly, that if 1,000,000 pieces of equipment which are within their rated service lifetime are each operated for one hour, or 100,000 pieces operated for ten hours (but still within rated lifetime), or 60,000,000 for one minute, etc. there will be roughly one failure in the lot. Note that rated service lifetime is an entirely orthogonal to MTBF. Consider the following two types of widgets:

  1. Every widget, regardless of age, has a 0.1% chance of failing every hour.
  2. Out of every billion widgets, all but one will operate for precisely 61 minutes and then die; that one will die after 30 minutes; the widgets have a specified service lifetime of 60 minutes.

The first type of widget would have an average lifetime of about 1,000 hours, and also have an MTBF of about 1,000 hours. The second would have an average lifetime of 61 minutes, but an MTBF of 1,000,000,000 hours within its service lifetime. While it may seem odd to say the second device has an MTBF that's almost billion times as long as the expected lifetime, the MTBF is hardly a meaningless figure.

Suppose one is going to conduct an experiment that requires that 1,000,000 devices all work perfectly for an hour, after which they will all be scrapped. If any device fails, the entire experiment will be ruined. Which would be more useful--a device which will last an average of 1,000 hours but has an MTBF of only 1,000 hours, or a device which would last at most 61 minutes, but would have only a one in a billion chance of failing to meet that mark?


So, bottom line is that we shouldn't see the MTBF of 10^6 hours as the "mean lifetime" of any particular disk, but rather as a measure concerning the lifetimes of multiple disks? – Kaitlyn Mcmordie – 2011-10-26T03:53:07.460

@Kaitlyn Mcmordie: The term "lifetime" isn't really applicable; death doesn't imply failure, nor vice versa. The maker of a storage device may specify procedures that should be followed to avoid data loss; such procedures may include moving all the data form any device which gives an "failure imminent" indication to a new device (after the data is copied, the old device would be considered "dead"). If no data loss occurs from such an event, it's not a failure. Data loss which occurs from any device, however, even a seemingly-healthy one, is a failure. Nothing to do with lifetime. – supercat – 2011-10-26T15:23:50.697


Adding to stevenvh's answer: Well known disk manufacturers all do a burn-in run of new devices, as do manufacturers of electronic components. In hard disks, there's not only an overall MTBF and MTTF but also individual failure statistics for the blocks of the disks. In other words: Some parts of the spinning, "platter" in the disk may fail, while the majority still reads/writes ok. The so called "bad sectors" can be detected and then mapped out by the firmware inside the drive.

All drives today contain additional sectors in reserve which can then be used in place of the defect sectors. This is simply a precaution by the manufacturer: If they wouldn't do this, they couldn't sell the disk at the proclaimed capacity. If they build in an additional x % of hidden sectors as a reserve, they increase the cost by some < x % but achieve a much higher overall production yield.

The disks today keep a count of bad sectors which can also be read out with appropriate software. This and other disk health parameters (e.g. temperature) are called SMART values.

Now, once the manufacturer has done the burn-in test of the drive, and some of the sectors have a nearly failure and have been remapped by the drive's internal firmware, the "Bad Sector Count" SMART parameter is set to 0. Then the drive is delivered to customers.

Usually, after the burn-in process, the start of the bathtub curve that has already been mentioned is no longer seen by the customer. We are lucky, and only see an increase in failure likelihood over time.

So if you look at the MTTF that is quoted by the manufacturer, for any failure modeling you might want to do, you can disregard the start of the bathtub curve.


Thank you. Btw, do you have any idea what the term "server fault" is supposed to mean? – Kaitlyn Mcmordie – 2011-10-26T04:00:53.110

The obvious meaning is an error encountered by a computer that provides services to others. And I believe that is the time where you're supposed to ask questions on ;-) Couldn't find anything about it in the FAQ

– cfi – 2011-10-26T08:13:26.573


You should interpret this as marketing. They actually don't know exact MTBF (Mean time between failures), so they use various tricks to estimate it, and they show higher numbers for 'enterprise' drives to justify their cost.

In reality, it is profitable for HDD manufacturers to have their HDDs fail soon after warranty is over.

As a conspiracy theory, I belive mass fail of Seagate 7200.11 was a mistake in implementing 'programmed death' causing disks to fail before warranty is over, so they had to 'fix' that by firmware update.


Re the conspiracy theory: like Carl Sagan says, "Extraordinary claims require extraordinary evidence". Do you have any evidence for your claims? – Joris Groosman – 2015-10-29T10:59:23.180

I don't buy this conspiracy argument. – None – 2011-10-25T08:43:50.713

1@Federico Russo : Why? You think it's just a usual developers error, causing HDDs to lock in a non-recoverable state after certain number of hours? – BarsMonster – 2011-10-25T12:36:11.133

2-1: Statistical analysis is used to determine MTBF numbers, and it's known to a certain statistic - they're not just using "various tricks". You'll need some significant sources to back up your assertions that enterprise drives are just higher numbers, that HDD manufacturers have their drives fail after warranty is over, and that Seagate implements any kind of 'programmed death' in their drives. – Kevin Vermeer – 2011-10-25T14:00:48.897

1It is in the best interest of drive manufacturers to show higher MTTF than their competition. +1 – tyblu – 2011-10-25T14:42:00.940

What exactly is a disk failure? What counts for one? – Kaitlyn Mcmordie – 2011-10-26T00:16:58.857