Should I be concerned about a high SMART Hardware_ECC_Recovered value?

Question

I got such a message in /var/log/messages:


Jun 25 06:29:27 server.ru smartd[4477]: Device: /dev/sda, SMART Usage Attribute: 195 Hardware_ECC_Recovered changed from 46 to 47

#smartctl -a /dev/sda:


smartctl version 5.38 [i686-redhat-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   110   088   006    Pre-fail  Always       -       28526210
  3 Spin_Up_Time            0x0003   093   093   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       24
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   087   060   030    Pre-fail  Always       -       471723621
  9 Power_On_Hours          0x0032   098   098   000    Old_age   Always       -       2520
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       41
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   068   052   045    Old_age   Always       -       32 (Lifetime Min/Max 31/35)
194 Temperature_Celsius     0x0022   032   048   000    Old_age   Always       -       32 (0 27 0 0)
195 Hardware_ECC_Recovered  0x001a   047   045   000    Old_age   Always       -       105036390
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0000   100   253   000    Old_age   Offline      -       0
202 TA_Increase_Count       0x0032   100   253   000    Old_age   Always       -       0

Does it mean that the disk is failing and I have to replace it? Where can I read about the interpretation of S.M.A.R.T test results?

Dave Cheney · Accepted Answer · 2009-06-27T04:40:48.100

16

According to Steve Gibson of Spinrite fame, SMART values have to be taken over time, not as instantaneous readings. That means, a value of 47 isn't necessarily bad if the value has been 47 for months. However if the value was 42 an hour ago, and its climbing rapidly, then that means the drive is experiencing difficulty accessing part of the data and may soon be unable to read the sector at all. Depending on the value of the data on that drive you may wish to replace it.

edited Jun 27 '09 at 04:40

answered Jun 26 '09 at 06:56

Dave Cheney

18,307
7
48
56

+1 for great answer, and to add to it. If you are really concerned Lexsys, I would buy a copy of spinrite and run it. You will need your system to be able to boot from a CD, but the OS is irrelavant. (Althought to create the boot CD you will need Windows, or a Windows clone) – Matt Jun 26 '09 at 17:25
Spinrite comes with a bootable (freedos) .iso image which you can burn with any current os – Dave Cheney Jun 26 '09 at 17:31
http://www.grc.com/sr/spinrite.htm would be a better place to link to – Brad Gilbert Jun 27 '09 at 04:35
3

The example is wrong! See @CesarB's answer - for most values, increasing is good! So if it was 42 an hour ago, and now it's 47 - great. But not the other way around. – Volker Siegel Jul 23 '14 at 00:08

Robert Klemme · Answer 2 · 2016-08-14T11:23:24.220

12

A high value for this attribute is actually pretty good:

Hardware ECC Recovered S.M.A.R.T. parameter indicates time between ECC-corrected errors.

https://kb.acronis.com/content/9131

edited Aug 14 '16 at 11:23

answered Aug 05 '16 at 20:21

Robert Klemme

233
2
8

2

Added what I believe to be the central quote. – Robert Klemme Aug 14 '16 at 11:24
This makes sense now even if the linked resource goes away. Thanks Robert. – chicks Aug 14 '16 at 12:35

score 7 · Answer 3 · edited Apr 25 '21 at 12:07

First, lower values are worse for SMART, not higher values (notice how the threshold column is always lower than the current value). So, a value increasing is no cause for worry. (This rule does not apply to the raw values, however.)

SMART values tend to oscillate a bit (yours might be in the edge between 46 and 47, for instance, so even small changes could cause it to flip to the other value).

Your smartctl -a output shows the worst this value has been is 45, so it oscilating slightly above it is normal.

For more information, take a look at Wikipedia: ATA S.M.A.R.T. attributes.

Please Note that the "Lower are worse" only applies to the values in the three columns labeled "Value", "Thresh" and "Worst". And not necessarily applicable to the "Raw Value", as values there are not normalised by that metric.

score 4 · Answer 4 · answered Jun 26 '09 at 17:28

Keep in mind that even the extensive study that Google conducted found that a large number of drive failures were not predicted by SMART errors. It's possible what you see is perfectly normal, but as each manufacturer has different metrics for converting the raw values into the reported values it is hard to say for sure if your drive is experiancing a lot of errors or not. However, a raw number that large does strike me as odd.

I would recommend reading all of the drive (dd or rsync'ing to a new drive) and check the SMART values as it goes along. If you see that raw number, or the reported values, change a lot I'd start looking to replace the drive.

Huh. It would be pretty cool to have ZFS track SMART attributes against its own usage patterns. — i336_, Oct 06 '19 at 16:05

score 3 · Answer 5 · answered Jun 26 '09 at 10:55

IIRC Hardware ECC recovered is error correction on disk reads, which isn't unusual for a disk, and they encode the data with error correction mechanisms for precisely this reason. Some controllers also support redundant information in disk sectors and add another layer of error correction.

As Dave Cheney states the figures should be monitored over time. Radical changes in these statistics are an indication of a failing drive. Also, keep an eye on grown defect lists - if the grown defect list starts to grow or the SMART statistics start to change significantly then you should prophylactically replace the drive.

1

lol, prophylactically – Dave Cheney Jun 26 '09 at 17:27

cstamas · Answer 6 · 2009-06-26T09:08:21.193

1

Nothing wrong with it.

You can always run

smartctl -t long /dev/yourdrive

Then after a few hours query its result

smartctl -a /dev/yourdrive

just to be sure.

edited Jun 26 '09 at 09:08

answered Jun 26 '09 at 08:59

cstamas

6,607
24
42

Should I be concerned about a high SMART Hardware_ECC_Recovered value?

6 Answers6