Mean Time Between Failures -- SSD

Question

The Mean Time Between Failures, or MTBF, for this SSD is listed as 1,500,000 hours.

That is a lot of hours. 1,500,000 hours is roughly 170 years. Since the invention of this particular SSD is post-Civil War, how do they know what the MTBF is?

A couple of options that make sense to me:

Newegg just has a typo
The definition of mean time between failures is not what I think it is
They are using some type of statistical extrapolation to estimate what the MTBF would be

Question:

How is the Mean Time Between Failures (MTFB) obtained for SSD/HDDs?

Related: http://serverfault.com/q/257693/126632 – Michael Hampton Nov 03 '14 at 19:29 — Michael Hampton, Nov 03 '14 at 19:29

score 37 · Accepted Answer · edited Jun 11 '20 at 10:02

Drive manufacturers specify the reliability of their products in terms of two related metrics: the annualized failure rate (AFR), which is the percentage of disk drives in a population that fail in a test scaled to a per year estimation; and the mean time to failure (MTTF).

The AFR of a new product is typically estimated based on accelerated life and stress tests or based on field data from earlier products. The MTTF is estimated as the number of power on hours per year divided by the AFR. A common assumption for drives in servers is that they are powered on 100% of the time.

http://www.cs.cmu.edu/~bianca/fast/

MTTF of 1.5 million hours sounds somewhat plausible.

That would roughly be a test with 1000 drives running for 6 months and 3 drives failing.
The AFR would be (2* 6 months * 3)/(1000 drives)=0.6% annually and the MTTF = 1yr/0.6%=1,460,967 hours or 167 years.

A different way to look at that number is when you have 167 drives and leave them running for a year the manufacturer claims that on average you'll see one drive fail.

But I expect that is simply the constant "random" mechanical/electronic failure rate.

Assuming that failure rates follow the bathtub curve, as mentioned in the comments, the manufacturer's marketing team can massage the reliability numbers a bit, for instance by not including DOA'S (dead on arrival, units that passed quality control but fail when the end-user installs them) and stretching the DOA definition to also exclude those in the early failure spike. And because testing isn't performed long enough you won't see age effects either.

I think the warranty period is a better indication for how long a manufacturer really expects a SSD to last!
That definitely won't be measured in decades or centuries...

Associated with the MTBF is the reliability associated with the finite number of write cycles NAND cells can support. A common metric is the total write capacity, usually in TB. In addition to other performance requirements that is one big limiter.

To allow a more convenient comparison between different makes and differently sized sized drives the write endurance is often converted to daily write capacity as a fraction of the disk capacity.

Assuming that a drive is rated to live as long as it's under warranty:
a 100 GB SSD may have a 3 year warranty and a write capacity 50 TB:
        50 TB
---------------------  = 0.46 drive per day write capacity.
3 * 365 days * 100 GB

The higher that number, the more suited the disk is for write intensive IO.
At the moment (end of 2014) value server line SSD's have a value of 0.3-0.8 drive/day, mid-range is increasing steadily from 1-5 and high-end seems to sky-rocket with write endurance levels of up to 25 * the drive capacity per day for 3-5 years.

Some real world tests show that sometimes the vendor claims can be massively exceeded, but driving equipment way past the vendor limits isn't always an enterprise consideration... Instead buy correctly spec'd drives for your purposes.

Note that the conversion from AFR to MTTF assumes a constant AFR. This is emphatically not true for things with moving parts (eg. hard drives), and may not be true for SSDs. — Mark, Nov 04 '14 at 05:57
Definitely true. IIRC there's a early failure spike, then a period of low failure and then a steady increase in the AFR with increased age. Add changing environment factors and the real world number become much higher. As @Chris S mentioned the warranty period may be a better metric with useful real world impact. — HBruijn, Nov 04 '14 at 08:21
Good sobering view that a 1'500'000 hours MTBF means really "If I have 1000 ssd like this one, 3 are likely to fail within 6 months (some even earlier than that) ..." . +1 (and as the tests are over a short period, expect the life span of those to not exceed too much the warranty... the "MTBF" probably drops a lot when your drive reaches N years old) — Olivier Dulac, Nov 04 '14 at 15:07
@HBruijn Thanks for your informative answer. The phenomenon you're referring to (early failure spike, period of low failures, then steady increase in failures) is described by the [bathtub curve](http://en.wikipedia.org/wiki/Bathtub_curve). — OSE, Nov 04 '14 at 17:30

score 19 · Answer 2 · edited Jul 23 '19 at 16:21

Unfortunately the MTBF isn't what most people think...

It is not how long an individual drive will last.

Manufacturers expect their drives to last as long as the warranty, after that it really isn't their problem. Older electromagnetic platter hard drives will seize up after 10 or so years. Integrated circuits last an extremely long time, but other components (notably capacitors) wear out after somewhat predictable number of cycles.
It is how many of these drives you would need to expect 1 drive to fail every hour.

As others have pointed out manufactures do various testing over a reasonable period of time and determine a failure rate. There's a fair amount of variance in these sorts of tests and marketing often has "input" as to what the final number should be. Regardless they make a best effort guess as to how many drives would be needed to average one failure per hour.

For situations with less drives you can infer a statistical probability of failure based on the MTBF, but keep in mind that failures in well designed products should follow a "bathtub" curve - that is higher failure rates when devices are initially put into service and after their warranty period has expired, with lower failure rates in between.

score 2 · Answer 3 · answered Nov 03 '14 at 19:48

They come from a statistical evaluation based on a small sample size and a short amount of time. There's really no universally agreed upon method or process so it's really just silly 'marketing'.

This article may explain it a bit more. And Wikipedia has some formulas which might be what you're looking for?

Essentially, for nearly everything (including general household machines such as a dishwasher) several products are run for X amount of time. How many failures happen during this period are used to calculate the MTFB.

It's of course not feasible to run products through an entire lifecycle, i.e SSDs, which will last a long time. They are mostly limited by the amount of writes rather than mechanical failure (which is what MTFB is for)

score 2 · Answer 4 · answered Dec 07 '14 at 01:06

Bad news about MTBF is that common evaluation metodics suppose evenly distributed write load among all NAND cells. But cells are grouped into the clusters and when one single cell fails - whole cluster is marked as dead and is replaced with new one from the reserve. Usually reserve is about 20% of the SSD volume. When reserve is exhausted whole SSD will be marked as dead.

IRL SSD contains persistent data as well as volatile. Imagine that you have 90% of SSD filled with static data, and the 10% rest is under the heavy write load. SSD controller spread the load among the available free clusters. That 10% exhausts their lifespan 10 times faster than you have estimated. They will be replaced from the reserve again and again till the end.

In the really bad case where persistent/volatile data amount is 30:1 or greater, for example - pile of photos and relatively small database for popular website, your SSD will die in a year.

One of my customers was very impressed with SSD characteristics and insisted to equip his DBMS-server with pair of them. In the next 12 months we have replaced both of them twice.

But accordingly to the marketing materials lifespan of SSD is 170 years. Sure.

score 1 · Answer 5 · answered Feb 20 '17 at 09:23

MTBF is not relevant for measuring SSD drive endurance since SSD is not sensitive for the time itself like ordinary spinning HDD drive but for the number of re-writes for SSD cells. More relevant measure for SSD is Drive Writes Per Day (DWPD). For example some enterprise class SSD disks 3.2TB endurance would be 3 DWPD for 5 years.

Some times SSD vendor provide endurance in terms of (Total) Terabytes Written (TBW) or "Write Cycles" which can be easily translated to DWPD and vice versa knowing time and maximum throughput for the given SSD drive.

For the given example with 3.2Tb SSD drive:
TBW = DriveSize * Years * DWPD;
TBW = 3.2TB * 5*365 * 3d = 17520 TB for 5 years

If the drive provide 80 MByte per second sustainable write throughput, then
WriteCycles = DWPD * Years;
WriteCycles = 3 * 365*5 = 5475 total write cycles for the given disk

What is important to notice we are calculating the worst case if you will provide 100% utilization throughput for the drive (which is very likely not possible).

Mean Time Between Failures -- SSD

5 Answers5

Linked