Why do different manufacturers have different S.M.A.R.T value?

23

16

First of all, I think everyone knows that hard drives fail a lot more than the manufacturers would like to admit. Google did a study that indicates that certain raw data attributes that the S.M.A.R.T status of hard drives reports can have a strong correlation with the future failure of the drive.

We find, for example, that after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors. First errors in reallocations, offline reallocations, and probational counts are also strongly correlated to higher failure probabilities. Despite those strong correlations, we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.

Seagate seems like it is trying to obscure this information about their drives by claiming that only their software can accurately determine the accurate status of their drive and by the way their software will not tell you the raw data values for the S.M.A.R.T attributes. Western digital has made no such claim to my knowledge but their status reporting tool does not appear to report raw data values either.

I've been using HDtune and smartctl from smartmontools in order to gather the raw data values for each attribute. I've found that indeed... I am comparing apples to oranges when it comes to certain attributes. I've found for example that most Seagate drives will report that they have many millions of read errors while western digital 99% of the time shows 0 for read errors. I've also found that Seagate will report many millions of seek errors while Western Digital always seems to report 0.

Q: How do I normalize this data? Is Seagate producing millions of errors while Western digital is producing none? Wikipedia's article on S.M.A.R.T status says that manufacturers have different ways of reporting this data.

Here is my hypothesis:

I think I found a way to normalize (is that the right term?) the data.

Seagate drives have an additional attribute that Western Digital drives do not have (Hardware ECC Recovered). When you subtract the Read error count from the ECC Recovered count, you'll probably end up with 0. This seems to be equivalent to Western Digitals reported "Read Error" count. This means that Western Digital only reports read errors that it cannot correct while Seagate counts up all read errors and tells you how many of those it was able to fix.

I had a Seagate drive where the Read error count was less than the ECC Recovered count and I noticed that many of my files were becoming corrupt. This is how I came up with my hypothesis. The millions of seek errors that Seagate produces are still a mystery to me.

Please confirm or correct my hypothesis if you have additional information.

Here is the smart status of my western digital drive just so you can see what I'm talking about:

james@ubuntu:~$ sudo smartctl -a /dev/sda
smartctl version 5.38 [x86_64-unknown-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF INFORMATION SECTION ===
Device Model:     WDC WD1001FALS-00E3A0
Serial Number:    WD-WCATR0258512
Firmware Version: 05.01D05
User Capacity:    1,000,204,886,016 bytes
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Thu Jun 10 19:52:28 2010 PDT
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   179   175   021    Pre-fail  Always       -       4033
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       270
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       1468
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       262
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       46
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       223
194 Temperature_Celsius     0x0022   105   102   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Edit: Here is the Seagate drive that I was talking about that was causing data corruption. This data is from HDTune.

HD Tune: ST3250623A Health

ID                               Current  Worst    ThresholdData       Status   
(01) Raw Read Error Rate         45       38       6        77882492   Ok       
(03) Spin Up Time                99       98       0        0          Ok       
(04) Start/Stop Count            100      100      20       640        Ok       
(05) Reallocated Sector Count    100      100      36       0          Ok       
(07) Seek Error Rate             85       60       30       359872048  Ok       
(09) Power On Hours Count        94       94       0        6028       Ok       
(0A) Spin Retry Count            100      100      97       0          Ok       
(0C) Power Cycle Count           100      100      20       689        Ok       
(C2) Temperature                 25       55       0        25         Ok       
(C3) Hardware ECC Recovered      50       47       0        201555081  Ok       
(C5) Current Pending Sector      100      100      0        0          Ok       
(C6) Offline Uncorrectable       100      100      0        0          Ok       
(C7) Ultra DMA CRC Error Count   200      199      0        1          Ok       
(C8) Write Error Rate            100      253      0        0          Ok       
(CA) TA Counter Increased        100      253      0        0          Ok       

Power On Time         : 6028
Health Status         : Ok

The fact that the Hardware ECC Recovered is larger than the Raw Read Error Rate is counter intuitive in my opinion.

This is what I've found to be a "normal" seagate drive where the ECC Recovered matches the Raw Read Error Rate:

HD Tune: ST380011A Health

ID                               Current  Worst    ThresholdData       Status   
(01) Raw Read Error Rate         62       46       6        79986164   Ok       
(03) Spin Up Time                98       98       0        0          Ok       
(04) Start/Stop Count            100      100      20       6          Ok       
(05) Reallocated Sector Count    100      100      36       0          Ok       
(07) Seek Error Rate             83       60       30       210309663  Ok       
(09) Power On Hours Count        93       93       0        6516       Ok       
(0A) Spin Retry Count            100      100      97       0          Ok       
(0C) Power Cycle Count           99       99       20       1325       Ok       
(C2) Temperature                 25       52       0        25         Ok       
(C3) Hardware ECC Recovered      62       46       0        79986164   Ok       
(C5) Current Pending Sector      100      100      0        0          Ok       
(C6) Offline Uncorrectable       100      100      0        0          Ok       
(C7) Ultra DMA CRC Error Count   200      188      0        18         Ok       
(C8) Write Error Rate            100      253      0        0          Ok       
(CA) TA Counter Increased        100      253      0        0          Ok       

Power On Time         : 6516
Health Status         : Ok

EDIT:

I want to clarify that I know that Google generally considers S.M.A.R.T useless. I know that everyone should backup their data. I am however in the business of fixing other peoples computers. Most people do not have backups or have RAID. It is not cost effective for corporations to troubleshoot hard drives, so they just run them on a RAID until they die. I find it useful in my line of work to check the SMART status of the hard drive. It takes like 30 seconds. If I am lucky enough for a bad drive to show a hint of failure such as scan errors or reallocated sectors, I know to get the drive the heck out of there. If no such hint exists, I'll probably spend many hours troubleshooting slowness and data corruption until I finally find that the hard drive is bad.

I'm just trying to fine tune this process.

James T

Posted 2010-06-11T02:55:47.040

Reputation: 8 515

There is smart based information in the administration menu under (I believe) disks management. It may have additional abilities over smartctl, but I haven't used it in a while and don't have it in front of me. – Jarvin – 2010-06-14T20:54:16.297

@Dan Hi Dan, I'm not sure what windows tool you are talking about. Can you clarify? – James T – 2010-06-16T19:18:09.030

The problem with SMART is that it is a bit of a misnomer; there is no actual intelligence in it, only a few equations (probably not even heuristics). All it can do is monitor itself and report the numbers, that’s all. For example, I have a drive that had a poorly connected power cord, which caused it to turn on and off very quickly several times (making a “click-of-death” sound). I reseated the connector, so it works smoothly now, but due to the temporary (fixable) failure that one time, it has now permanently recorded a RRER event in the SMART, making look like it’s failing. – Synetech – 2012-01-01T00:03:27.883

Answers

14

It does appear that different manufacturers use SMART values for sometimes radically different things, as you can see here:

My hard disk(s) in ReadyNAS is reporting high SMART Raw Read Error Rate, Seek Error Rate, and Hardware ECC Recovered. What should I do?

Seagate uses these SMART fields for internal counts, so this is a known issue with Seagate disks. Look for abnormal counts in other fields, especially Reallocated Sector Ct and ATA Error Count.

So when it comes to your actual question ...

If I am lucky enough for a bad drive to show a hint of failure such as scan errors or reallocated sectors, I know to get the drive the heck out of there. If no such hint exists, I'll probably spend many hours troubleshooting slowness and data corruption until I finally find that the hard drive is bad.

I'd say a good rule of thumb is, you can only expect SMART settings to be comparable within the same drive manufacturer, and maybe even the same drive model!

So when you're looking at diagnosing those SMART counts, keep that in mind... one manufacturer's "read error retry count" may mean something totally different than another manufacturer's. Sad but true. :(

Jeff Atwood

Posted 2010-06-11T02:55:47.040

Reputation: 22 108

14

Okay, first of all I disagree with your premise.

Google did a study that indicates that certain raw data attributes that the S.M.A.R.T status of hard drives reports can have a strong correlation with the future failure of the drive.

In fact they found the opposite:

...we find that failure prediction models based on SMART parameters alone are likely to be severely limited in their prediction accuracy, given that a large fraction of our failed drives have shown no SMART error signals whatsoever.

Secondly, SMART thresholds are not standardised. The firmware on the drive itself will flag an attribute as being "pre-failure", but the raw values are meaningless to the user. For example, Seagate says:

Various attributes are being monitored and measured against certain threshold limits. If any one attribute exceeds a threshold then a general SMART Status test will change from Pass to Fail.

The SMART values that might be read out by third-party SMART software are not based on how the values may be used within the Seagate hard drives. Seagate does not provide support for software programs that claim to read individual SMART attributes and thresholds. There may be some historical correctness on older drives, but new drives, no doubt, will have incorporated newer solutions, attributes and thresholds.

tl;dr Summary:

Raw SMART values are almost meaningless, as different manufacturers use them in different ways and have different thresholds etc. The drive firmware itself will tell you when it is in "pre-failure"... or it might not, SMART really isn't very reliable.

Do regular backups!

sml

Posted 2010-06-11T02:55:47.040

Reputation: 1 582

Based on your comments it does not seem like you read my whole post. This is why I put in all the background information and quotes. You quoted Google but only a very select part of it. If you read the part just before your quote... it says that some attributes have a strong failure correlation.... such as reallocated sector counts. The manufacturers do not report their drives as being in a pre-failure state after one reallocated sector. This clearly indicates that you can get a better indication of the health of the drive by looking at the raw data. – James T – 2010-06-16T18:33:35.350

I'd also like to add that my seagate drive was corrupting my data and the raw data values were noticeably different from what I've learned to be healthy drives. Clearly something is wrong with where the manufacturer sets the threshold. – James T – 2010-06-16T18:35:45.897

I think you need to re-read my post and link. Raw SMART values are not reliable indicators of anything. The Google report does not say that "some attributes have a strong failure correlation". What it does say is that despite the fact that "after their first scan error, drives are 39 times more likely to fail within 60 days than drives with no such errors", less than 15% of the failed drive population had any Scan Errors. Is it a reliable indicator if it is right 15% of the time? – sml – 2010-06-16T19:24:07.670

1@scottl I'm not sure where you got your 15% from. I did not see that in the article. Even if only 15% of their drives had scan errors... they found that a drive with scan errors is 39 times more likely to fail in 60 days. This does not mean that your drive will not fail unless you have scan errors. This just means that if you do have a scan error... your hard drives remaining life is probably short.

Have you ever taken statistics? I found it very useful. – James T – 2010-06-16T19:41:06.187

1smartmontools FAQ says: The raw SMART attributes (temperature, power-on lifetime, and so on) are stored in vendor-specific structures. Sometime these are strange. Hitachi disks (at least some of them) store power-on lifetime in minutes, rather than hours (see next question below). IBM disks (at least some of them) have three temperatures stored in the raw structure, not just one. And so on. – sml – 2010-06-16T19:41:24.360

@James: look at figure 14 for the 15% figure. Can you point out the number of scan errors for your Seagate drives in the data you posted? Do you really think that a healthy WD drive has had 0 Raw Read Errors, while a healthy Seagate drive has had 78 million? – sml – 2010-06-16T19:45:32.703

@scottl Thanks... that power on-lifetime information is the kind of information I'm looking for. Now please find that for read errors and seek errors please. – James T – 2010-06-16T19:47:06.867

@James Read my Seagate link!: *Seagate uses the general SMART Status, pass or fail. The individual attributes and threshold values are proprietary and we do not offer a utility that will read out the values. If the values that you are seeing with a third party SMART utility are not displaying properly or seem to be false, please contact your software vendor for further explanation of the values.* – sml – 2010-06-16T19:50:25.020

There is no way of telling what raw SMART values mean - they change even between drive models of the same manufacturer!! – sml – 2010-06-16T19:52:11.103

@scottl You are not being very consistent and I am confused with the position you are taking. You said "Do you really think that a healthy WD drive has had 0 Raw Read Errors, while a healthy Seagate drive has had 78 million?". Are you saying that you think the seagate drive is not healthy? The SMART status says it is in good health... :-) – James T – 2010-06-16T19:58:28.600

@scottl I know seagate says that only their software can accurately report the status on their drives. They however use those raw data values in order to determine the health of the drive. I'm just trying to figure out how they report their data vs. how other manufacturers report their data. I know it is "propriatery information" and you say that there is "no way" of figuring it out. I think people can figure it out. The raw data values are not useless. Google did an entire study keeping track of the raw data... why would they do that if the data was useless? – James T – 2010-06-16T20:06:44.497

No, I am saying that WD and Seagate obvioiusly encode their raw data differently. If you can get the internal Seagate docs for your drive, or can reverse-engineer the firmware, then yes it might be possible to figure out what the numbers mean. Google didn't say they used Seagate drives, and they probably have the connections to get the "proprietary information" anyway. – sml – 2010-06-16T20:16:21.860

@scottl Ok, that is why I'm asking this here. Maybe someone has done that and would possibly share that information. – James T – 2010-06-16T20:19:54.617

You asked if your hypothesis was correct.. the answer is no. – sml – 2010-06-16T20:28:39.360

@scottl You have not provided any evidence to prove my hypothesis wrong. You are saying that you can't compare the individual data values with each other. I agree. I said so in my post. I have however also found that Read errors and ecc corrected are related in some way on seagate drives. It might be possible to combine ecc corrected with the read errors in order to compare it with the read errors on a wd drive.

Prove me wrong. – James T – 2010-06-16T20:38:07.987

@ James Your hypothesis "..Seagate counts up all read errors and tells you how many of those it was able to fix." is obviously wrong, because Hardware ECC > Read errors on your failed drive. Even wikipedia says of "Hardware ECC": "The raw value has different structure for different vendors and is often not meaningful as a decimal number." Therefore, your hypothesis is incorrect. – sml – 2010-06-17T16:34:40.140

@scottl I'll give you that. I believe you are right. Please read the edit to my post. It might clarify for you that even though most companies consider SMART useless... I still find it useful in some cases. I would like to know as much as I can about S.M.A.R.T so I can save myself time in the future. If a mismatch in ecc corrected and read errors means data is being lost... that is useful information for me. – James T – 2010-06-17T21:40:48.013

> The drive firmware itself will tell you when it is in "pre-failure"... or it might not, SMART really isn't very reliable. Indeed. One of my WD drives had a RRER trip when the power flicked off and on. It worked fine for a while, reporting SMART status as good to the mobo, then suddenly started reporting it as bad. I did some scans and tests and stuff, and it started reporting it as good again. It still shows the RRER as being triggered, but the overall status remains “good”. Confusing and unreliable indeed (sadly). – Synetech – 2011-12-31T23:51:20.710

4

I'm not exactly sure what the question is that you're asking. You seem to have the whole question and answer rolled up into one but...

Have you compared the hard drive metrics to those given from SeaTools

It's Seagate's standard hardware diagnostic tool and AFAIK the most commonly used HDD diagnostic tool.

Don't be surprised if you find that the tools report unfavourable results about their competitors. The tools generally work with HDDs of all manufacturers but that doesn't mean that they have make their competitors look good while doing.

Haven't you ever heard the joke, "99.99% of all statistics are true except, of course, this statistic".

Evan Plaice

Posted 2010-06-11T02:55:47.040

Reputation: 1 387

1Yeah... it is a bit confusing. I basically put in all the background information that I am familiar with before the question and all my tests and conjectures after the question. Here is my question "How do I normalize this data?". Basically.. how do I make all the data attributes from one manufacturer mean the same thing as the data attributes from another manufacturer so I can accurately compare them. – James T – 2010-06-15T22:54:56.233

@James You can try to collect data from as many difference as possible and figure out how each if interpreting the data differently from one another. They may all be reporting correct data, they may just be interpreting it in a different manner like you pointed out. That's why I added the statistics quote... Just because the data is good, doesn't mean the interpretations is. – Evan Plaice – 2010-06-15T23:16:03.343

2Yup, that is what I have done. I've checked over 70 different hard drives and the large difference in seek errors and read errors are the attributes that stuck out to me. I have a guess that for seagate drives, read errors have some kind of relationship with hardware ecc recovered. I'm not exactly sure what that relationship is. I was hoping someone here could tell me. I was also hoping someone could tell me why seagate drives have huge seek error counts while western digital always seems to have zero. – James T – 2010-06-15T23:23:03.950

@James Maybe somebody will come along with a better answer... My honest guess is, Western Digital probably doesn't follow the exact S.M.A.R.T spec. That's the problem with hardware standards, they're great selling points but there's always a few manufacturers that will market all the benefite without following the full specification. – Evan Plaice – 2010-06-15T23:29:50.643

Yup the deviation from the standard is what I figured and what the wikipedia article suggests. I'd like to know how they differ so that I might be able to properly compare the two manufacturers (and possibly others). Thanks for the comments Evan. Hopefully this clarifies the question for others too. – James T – 2010-06-15T23:37:49.687

2

In the physical reality of hard drive internals, all brands of hard drives larger than 100MB will have a lot of physical read errors. Most of those are safely corrected by ECC, some (hopefully very few) are wrongly corrected by ECC and the rest (few but more than the wrong corrections) are reported back to the computer as failed read and should also make the drive automatically relocate the bad sector.

In addition to correcting raw read errors, ECC also corrects reads that the hardware thought were OK, but the returned bits were slightly wrong. Thus ECC corrected might be "raw read failed but fixed by ECC + raw read succeeded but was wrong and got fixed by ECC".

Thus two interpretations of the data seem possible:

A. non-Seagate drives do not include the ECC corrected read errors in the "raw read error count", only the unfixable errors.

B. Seagate considers it a read error if ECC finds something wrong with the data even if the low level circuit did not notice, others don't.

Normalization will be very different depending on which theory (A or B) is right.

Jakob Bohm

Posted 2010-06-11T02:55:47.040

Reputation: 21

> should also make the drive automatically relocate the bad sector. Then what is the relation between the Uncorrectable Sector Count Relocated Event Count and Current Pending Sector Count fields? Wouldn’t it increase current, then either relocated or uncorrectable? Why would it be uncorrectable? If it tried to remap a bad sector and it failed (ie, the spare sector is bad), then shouldn’t it try remapping to a different spare sector? it’s not a tire that it only has one spare. – Synetech – 2011-12-31T23:54:58.500

100 MB? Do you mean 100 GB? – Peter Mortensen – 2013-12-07T17:07:30.347