29

We have a linux server that has been in heavy use for 3 years. We're running a number of virtualized servers on it, some that have not been well behaved, and for a significant time the server's io capacity was exceeded leading to bad iowait. It's got 4 500gb Barracuda sata drives connected to a 3com raid controller. 1 Drive has the OS, and the other 3 are setup raid-5.

Now we have a debate as to the condition of the drives and whether they are actively failing.

Here's a portion of the output for 1 of the 4 disks. They all have relatively similar statistics:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   118   099   006    Pre-fail  Always       -       169074425
  3 Spin_Up_Time            0x0003   095   092   000    Pre-fail  Always       -       0
  4 Start_Stop_Count        0x0032   100   100   020    Old_age   Always       -       26
  5 Reallocated_Sector_Ct   0x0033   100   100   036    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000f   077   060   030    Pre-fail  Always       -       200009354607
  9 Power_On_Hours          0x0032   069   069   000    Old_age   Always       -       27856
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       1
 12 Power_Cycle_Count       0x0032   100   100   020    Old_age   Always       -       26
184 Unknown_Attribute       0x0032   100   100   099    Old_age   Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
188 Unknown_Attribute       0x0032   100   100   000    Old_age   Always       -       1
189 High_Fly_Writes         0x003a   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0022   071   060   045    Old_age   Always       -       29 (Lifetime Min/Max 26/37)
194 Temperature_Celsius     0x0022   029   040   000    Old_age   Always       -       29 (0 21 0 0)
195 Hardware_ECC_Recovered  0x001a   046   033   000    Old_age   Always       -       169074425
197 Current_Pending_Sector  0x0012   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

My interpretation of this is that we have not had any bad sectors or other indications that any of the drives are actively failing.

However, the high Raw_Read_Error_Rate and Seek_Error_Rate is being pointed to as indications that the drives are dying.

gview
  • 1,025
  • 1
  • 10
  • 16
  • 4
    There is a good description here (too long to repost, please follow the link): https://lime-technology.com/wiki/Understanding_SMART_Reports In case the link goes down, some important quotes: "This is an indicator of the current rate of errors of the low level physical sector read operations. In normal operation, there are ALWAYS a small number of errors [...] there is NO issue with the drive." and "PLEASE completely ignore the RAW_VALUE number! Only Seagates report the raw value, which yes, does appear to be the number of raw read errors, but should be ignored, completely." – Konrad Gajewski Feb 12 '18 at 21:19

8 Answers8

79

For Seagate disks (and possibly some old ones from WD too) the Seek_Error_Rate and Raw_Read_Error_Rate are 48 bit numbers, where the most significant 16 bits are an error count, and the low 32 bits are a number of operations.

% python
>>> 200009354607 & 0xFFFFFFFF
2440858991
>>> (200009354607 & 0xFFFF00000000) >> 32
46

So your disk has performed 2440858991 seeks, of which 46 failed. My experience with Seagate drives is that they tend to fail when the number of errors goes over 1000. YMMV.

Dan Pritts
  • 3,181
  • 25
  • 27
tsuna
  • 1,613
  • 1
  • 15
  • 10
15

The "seek error rate" and "raw read error rate" RAW_VALUES are virtually meaningless for anyone but Seagate's support. As others pointed out, raw values of parameters like "reallocated sector count" or entries in the drive's error log are more likely to indicate a higher probability of failure.

But you can take a look at the interpreted data in the VALUE, WORST and THRESH columns which are meant to be read as gauges:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH
  7 Seek_Error_Rate         0x000f   077   060   030

Meaning that your seek error rate is currently considered to be "77% good" and is reported as a problem by SMART when it reaches "30% good". It had been as low as "60% good" once, but has magically recovered since. Note that the interpreted values are calculated by the drive's SMART logic internally and the exact calculation may or may not be published by the manufacturer and typically cannot be tweaked by the user.

Personally, I consider a drive containing error log entries as "failing" and urge for a replacement as soon as they occur. But all in all, SMART data has turned out to be a rather weak indicator for failure prediction, as a research paper published by Google uncovered.

the-wabbit
  • 40,319
  • 13
  • 105
  • 169
11

In my experience, Seagates have weird numbers for those two SMART attributes. When diagnosing a Seagate I tend to ignore those and look more closely at other fields like Reallocated Sector Count. Of course, when in doubt replace the drive, but even brand new Seagates will have high numbers for those attributes.

hwilbanks
  • 466
  • 2
  • 4
8

I realized this discussion is a bit old but want to add my 2 cents. I have found the smart information to be quite a good indicator of pre-fail. When you get a smart threshold tripped then replace the drive. That is what those thresholds are for.

The vast majority of time you will start to see bad sectors. That is a sure sign the drive is starting to fail. SMART has saved me many times. I use software RAID 1 and it's very helpful since you simply replace the failing drive and rebuild the array.

I also run short and long self test weekly.

smartctl -t short /dev/sda
smartctl -t long /dev/sda 

Or add it /etc/smartd.conf and get it to email you if there are errors

/dev/sda -s L/../../3/22 -I 194 -m someemail@somedomain
/dev/sdb -s L/../../7/22 -I 194 -m someemail@somedomain

Make sure to install logwatch and redirect root to an email address and check the daily emails from logwatch. SMARTD tripped flags will show up there but it's of no help if nobody is monitoring that regularly.

Fred Flint
  • 618
  • 7
  • 6
3

Sorry to commit necromancy on this post, but in my experience, the "Raw Read Error Rate" and "Hardware ECC Recovered" fields for a Seagate drive will quite literally go all over the place and increment constantly into the trillions range at which point they'll cycle back around to zero to continue the process again. I've a Seagate ST9750420AS that has had that problem since day one and still works great even after quite a few years and 3500+ hours of use.

I think those fields can be safely ignored if you're running one in your case. Just make sure the two fields are reporting the same number and in sync constantly. If they're not...well... That actually might mean a problem.

Ryan Gandy
  • 31
  • 1
2

Add these flags so that the attributes 1 & 7 (Raw_Read_Error_Rate & Seek_Error_Rate) are interpreted as consisting of a 24-bit error count and a 32-bit total count.

-v 1,raw24/raw32 -v 7,raw24/raw32

-v stands for --vendorattribute=

Specifying raw24/raw32 is a way to tell smartctl to interpret and display the raw information as per a common format. see man page.

as per Seagate manual here The meaning of raw 7 Bytes of each attribute is as follows:

Attribute ID 1: Raw Error Rate  
Raw [3 – 0] = Number of sector reads  
Raw [6 - 4] = Number of read errors  

Attribute ID 7: Seek Error Rate  
Raw [3 – 0] = Number of seeks  
Raw [5 – 4] = Number of seek errors
2

To automate the calculations of this answer, use the online javascript calculator:

https://yksi.ml/

This will tell you:

  • Total number of operations
  • Number of failed operations

The calculator is valid for Seagate's:

  • Seek Error Rate
  • Raw Read Error Rate
  • Hardware ECC Recovered

For further reading on the calculation of the normalised (between 0 and 100 values), see this article.

Tom Hale
  • 1,005
  • 1
  • 12
  • 23
1

Yes, those fields look bad but I don't trust (anymore) the info reported by smart (my test machine have a drive which should be dead a long time ago if you read the data with smartctrl) The fact is that you have reported high iowait and the drives are 3 years old. This should be enough for you to change the drives.

migabi
  • 154
  • 3
  • 1
    For various reasons we need to maximize our investment in the hardware. The iowait had to do with the ridiculous load, as well as some configuration mistakes we made when setting up the box. – gview Sep 20 '11 at 22:57