19

I was wondering if it is a good idea to replace a hard drive in a (fairly) system-critical database server after a certain number of years of use, before it dies.

For example, I was thinking of replacing a hard drive after 3 years of use. Since I have many hard drives across servers, I could stagger which hard drives are replaced.

Is this a good idea, or do people just wait for the failure?

Mark Henderson
  • 68,316
  • 31
  • 175
  • 255
Garfonzo
  • 499
  • 2
  • 18

3 Answers3

33

Google did a study on disk drives and found very little correlation between disk age and failure. SMART tests also do not show failures.

My local observations (>500 servers) is similar. I have new disks fail quickly while old ones still chug along.

My general rule is if we seen disk issues (SMART or system errors) we replace it immediately. If not, then the drives get cycled out when the server does.

Google Study http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

jeffatrackaid
  • 4,112
  • 18
  • 22
  • This was generally what I was thinking, but wanted to see what others did. Thanks – Garfonzo Dec 19 '11 at 19:37
  • 2
    I concur. We're seeing much higher failure rates with newer 2.5" SAS drives than with 10 year old servers running 3.5" 9GB SCSI drives! – James O'Gorman Dec 19 '11 at 20:25
  • @JamesO'Gorman Manufacturing processes change...makes me wonder what has been done to new drives as part of some engineering "trade-off". – Avery Payne Dec 19 '11 at 21:19
  • 1
    Microsoft Technet also has an article on Fault Tolerance that touches briefly on hard drive / mechanical component failure (http://technet.microsoft.com/en-us/library/bb742464.aspx) - They talk a little bit about the "bathtub curve" that mechanical component failures tend to follow. – voretaq7 Dec 19 '11 at 22:06
  • @AveryPayne Re new drives, note that 2.5" drives have ***MUCH*** tighter tolerances - As a result what used to be "acceptable" mechanical slop on a 3.5" drive can lead to a catastrophic failure on a 2.5" drive. See also the TechNet article I linked about the bathtub curve - Mechanical components suffer from high infant mortality in general, and then are relatively stable until they finally die of "old age". The 2.5" drives are still in "infant mortality" territory - by my experience for at least 1 year of operation. – voretaq7 Dec 19 '11 at 22:09
  • my desktop usage also happens to have 100% of failure on <1yr drivers, and >6yr drivers being retired for lack of usefulness – gcb Dec 19 '11 at 22:33
  • The major issue I've seen is vibration. I know of a dedicated server provider that used to put towers on metal racks. The racks bow in the middle. The servers vibrated significantly and touched each other. We saw a 10x higher than expected drive failure rate. At the facility were trash cans full of disks. When the same facility moved to proper rack mounts this failure rate declined. – jeffatrackaid Dec 20 '11 at 19:14
13

No.

One of the biggest problems with replacing a hard drive on an active production server is that doing so will trigger a rebuild. Especially if you are using RAID5, and especially if you are using large drives, forcing a rebuild creates a very significant risk of an unrecoverable failure. The risk of losing the array during a rebuild is far greater than the risk involved in leaving a 3-year-old drive in place.

Taking an extreme example, if you successively replace every disk in a 6-disk RAID5 array comprised of 2TB disks, your theoretical risk of an unrecoverable read error during one of the rebuilds is in the neighborhood of 58% (according to my napkin math; please do your own and compare notes). In other words: your "preventive" disk replacement is, in effect, nothing less than an act of sabotage.

The only time when I would consider refreshing drives in an old server would be in the course of "refurbishing" it, e.g. after having been decommissioned from one task and before putting it back into service with a new role. Even at that point, capacity and performance requirements would be far more important than the age of the drives.

Skyhawk
  • 14,149
  • 3
  • 52
  • 95
  • 1
    +1 for triggering rebuild – gregmac Dec 19 '11 at 20:29
  • Can you please explain why the risk is 58%? If the disk are patrolled regularly why it would stress more a recovery? – Mircea Vutcovici Dec 19 '11 at 20:41
  • @MirceaVutcovici because in a RAID-5 arrangement, all of the drives will be constantly active during the rebuild vs. the occasional random seek here or there. In other words, the "load" on all of the drives goes way up, and in doing so, your risk of triggering a 2nd failed drive goes up as well. – Avery Payne Dec 19 '11 at 21:21
  • @Avery Payne I know that you stress the disks more during a rebuild. I am trying to understand why a rebuild would stress the disks more than a consistency check. – Mircea Vutcovici Dec 19 '11 at 22:15
  • @MirceaVutcovici The exact figure (and how to do the math) is debatable, but the bottom line is you have to read 10 terabytes of data *six times*, without the benefit of a parity disk to correct any read errors, in order to perform the six rebuilds. The probability of reading 60 terabytes of data, with no errors at all, is not in your favor. – Skyhawk Dec 19 '11 at 23:25
3

I haven't seen it. We keep servers under warranty until they are taken out of production - 5 years. Standard RAID 5 allows you to survive a disk failure so we just keep a couple drives on-hand so we can start a rebuild right away and on critical servers, we include a hotspare or go RAID 10.

If you've noticed several drives failing recently in a server you may have a backplane problem. Could be new vibration or dust too from nearby construction.

Paul Ackerman
  • 2,729
  • 15
  • 23
  • This is not entirely true. if a large number of your disks are from the same lot, you run a much higher risk of simultaneous failure when you add the stress of a rebuild. As noted in another answer, increasing sizes ofRAID5 run increasing probabilities of a URE during rebuild which takes your array below the raid5 validity threshold. – Magellan Jun 25 '14 at 18:55