16

A few years ago I was told to avoid S.M.A.R.T. like the plague. The reasoning was that the stress the testing puts on the drive will actually cause it to fail.

Is this still the case? If not, what is a reasonable frequency to run tests? If I should still be avoiding it, what is a better way to monitor the health of my hard drives?

sysadmin1138
  • 131,083
  • 18
  • 173
  • 296
Bob
  • 2,917
  • 5
  • 28
  • 32

3 Answers3

30

While S.M.A.R.T. certainly doesn't predict all failures, I worked in a computer repair shop for several years, and many times a S.M.A.R.T. error message was the first indication that a failure was about to occur, allowing me to save the customer's data before the drive died.

The technology itself does not stress the drive, it just keeps track of a number of indicators (full list here: http://en.wikipedia.org/wiki/S.M.A.R.T.) that could potentially lead to drive failure, such as:

  • Read Error Rate
  • Reallocated Sectors Count
  • Spin Retry Count
  • Uncorrectable Sector Count
  • Power on Hours

The performance hit for S.M.A.R.T. is negligable, doesn't stress drives (the monitoring is passive), and can potentially warn you that you are about to lose all the pictures of your kids (or your MP3 collection or whatever is important on your Hard Drive).

In short, leave it on.

Sean Earp
  • 7,207
  • 3
  • 34
  • 38
  • I think the original poster was told about the danger of *active* surface tests that can be done by SMART (running "long" tests manually). Bus as you say, by default SMART is passive and should be always enabled. Personnally I also run active tests once a month on my drives. – Colas Nahaboo May 27 '09 at 15:59
3

Besides passively logging performance counters and events, SMART provides an interface to initiate several types of self-tests performed by the drive and get their results later.

Some of these tests involve scanning of the entire platter surface while staying online and responding to host requests, so heavy I/O will cause a lot of head threshing.

I guess the latter is the source of the grave misconception you've been told. SMART is nice.

NekojiruSou
  • 344
  • 1
  • 2
  • 9
-6

A while ago Google did a study (PDF) "Failure Trends in a Large Disk Drive Population". They have tons of drives that use and the study showed:

Our analysis identifies several parameters from the drive’s self monitoring facility (SMART) that correlate highly with failures. Despite this high correlation, we conclude that models based on SMART parameters alone are unlikely to be useful for predicting individual drive failures. Surprisingly, we found that temperature and activity levels were much less correlated with drive failures than previously reported.

So temperature is a much bigger factor then stress on the drive. Plus with all the error correction that happens in this new drives ALL the time, so much more stress is added that you don't have control over. If you are looking for a tool to provide maintenance (or recovery) on your drives I would recommend SpinRite. Its by Steve Gibson and its an amazing product.

Bernie Perez
  • 137
  • 7
  • 3
    Doesn't your reference say that temperature is *not* correlated to drive failture?...but you consider it a "much bigger factor"? Please clarify what you mean – Michael Haren May 28 '09 at 16:47
  • Am I reading that wrong? It says temperature isn't strongly correlated with drive failure right? Neither is activity according to that paper. – MrChrister May 28 '09 at 16:50
  • 1
    "So temperature is a much bigger factor then stress on the drive." -- Where do you conclude that from? The paragraph you quoted said temperature was not correlated as highly. – Joe Phillips May 28 '09 at 16:51
  • 3
    -1 for failing to not only correctly read the abstract of the paper, but for failing to take the time to read the whole thing. Drive failure is generally down to either a manufacturing defect (as shown in Fig 3: high usage in the first 3 months weeds out component errors - after that failure due to usage begins to converge), or degradation over time. Degradation over time is what SMART will pick up. Fig 5 shows that temperature is not a big factor at all - indeed the report suggests cooler drives have more of a change of failure than those running hot, certainly inside the first 3 years. – Ian May 28 '09 at 17:10