3

I have several storage arrays where a significant number of the drives have been powered on between 25,000 - 30,000 hours (2.8 - 3.4 years). These drives have no other issues or errors.

What I want to know: is there a point where drive age alone is a significant enough factor to replace a drive, even if the drive is working fine and has no errors?

(I'm curious to see if people tend to run drives until they fail or start throwing errors, or if anyone takes a proactive approach at replacement using Power On Hours as a metric.)

Drive manufactures generally quote MTBF on enterprise drives at 1,000,000 to 1,500,000 hours, but these numbers don't really mean much in the real world.

I did locate this study completed in 2007:

Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?

http://www.cs.cmu.edu/~bianca/fast07.pdf

The study suggests a "sweet spot" between 1 year and 5-7 years where you can expect less failures. Drive age before/after these times tended to be considerably higher.

jlehtinen
  • 1,958
  • 2
  • 13
  • 15
  • @ewwhite I think that's probably close enough for me to consider mine a duplicate. The general consensus seems to be to wait for drives to fail and to mitigate impact w/ technology (i.e. RAID, hot spares, backups, etc.). – jlehtinen Jan 24 '14 at 18:03

3 Answers3

5

No.

You replace drives when they fail (or you get a predictive failure, like with SMART), not just because they're a certain age. I've seen drives last over 15 years, and seen drives fail in under an hour, so age isn't a good indicator of drive failure.

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
4

I've seen servers that are 10-15 years old with original drives still running and the function they perform unchanged. I have seen servers less than a year old have a catastrophic drive failure.

I do not have an opinion on how good/poor practice it is to run a drive until it shows signed of failure, so my answer would be "it depends" - on backups, value of the data/tools, size of the drive, intensity of activity, if it is in a mirrored array, and if the downtime to replace it can be afforded - which may be pricier than the mere hardware.

I think based on those things and other variables specific to your site and application - it needs to be a decision your team makes, not some rote age value.

Edit: If the data or uptime is important, consider a backup and disaster recovery strategy with practice runs, using redundant servers, and mirrored arrays with drives from different batches. This way you are extremely unlikely to have everything fail at the same time, so you will be able to fail over to working hardware while the bad one is replaced without loss of data.

Danny Staple
  • 1,484
  • 1
  • 9
  • 15
3

I have never met (or heard of) anyone who is replacing drives just because they are 'too old' (while keeping storage/server in production).

Dusan Bajic
  • 2,046
  • 1
  • 17
  • 20