Can read errors loged in S.M.A.R.T. be caused by the controller?

2

I have one little server at a remote location that gives me some headaches, as it now seemingly ate up the third hdd in a specific slot in a row.

The last replacement was done in may (a 3TB WDC WD30PURX if that matters, it lasted only 8 months) and after a while I noticed read errors again. I was then wondering if I am really unlucky with that one, or if there is an issue with the controller.

Normally I thought that smart only talks about what the drive experienced, but then I thought that maybe it was possible that it could read its sectors fine but couldn't write them to the controller and that was logged as an error?

The things that made me suspicious was that on the day I discovered the smart alerts the first time, the bad sectors were all between 3330891687 to 3303035895, making this look like some bad surface and running all kinds of tools over the hdd all resulted in various errors around there, but in the end, each read request succeeded, and from then on the sector was "healed". This seemed a bit like reallocated sectors to me, but there were none recorded.

In total there were 4527 read errors in 4153 different sectors, now I can not find a single bad one (ran several times through the whole disk).

Then after a few days, a whole disk scan (smart and via badblocks) revealed no error at all, and the disk is performing ok-ish.

The errors appeared in the syslog like:

 [517871.828215] ata6.00: exception Emask 0x0 SAct 0x0 SErr 0x0 action 0x0
 [517871.828219] ata6.00: BMDMA stat 0x25
 [517871.828223] ata6.00: failed command: READ DMA EXT
 [517871.828229] ata6.00: cmd 25/00:00:00:4f:68/00:02:c6:00:00/e0 tag 0 dma 262144 in
 [517871.828229]          res 51/40:cf:30:50:68/40:00:c6:00:00/e0 Emask 0x9 (media error)
 [517871.828232] ata6.00: status: { DRDY ERR }
 [517871.828234] ata6.00: error: { UNC }
 [517871.840411] ata6.00: configured for UDMA/133
 [517871.840538] sd 5:0:0:0: [sdd] tag#0 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
 [517871.840543] sd 5:0:0:0: [sdd] tag#0 Sense Key : Medium Error [current] [descriptor]
 [517871.840547] sd 5:0:0:0: [sdd] tag#0 Add. Sense: Unrecovered read error - auto reallocate failed
 [517871.840551] sd 5:0:0:0: [sdd] tag#0 CDB: Read(16) 88 00 00 00 00 00 c6 68 4f 00 00 00 02 00 00 00
 [517871.840554] blk_update_request: I/O error, dev sdd, sector 3328725040
 [517871.840576] ata6: EH complete

and in S.M.A.R.T. like:

Error 4527 [14] occurred at disk power-on lifetime: 1282 hours (53 days + 10 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 01 00 00 00 c6 49 3c a0 e0 00  Error: UNC 256 sectors at LBA = 0xc6493ca0 = 3326688416

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  25 00 00 01 00 00 00 c6 49 3c 00 e0 08  5d+23:59:09.617  READ DMA EXT
  25 00 00 00 18 00 00 c6 49 38 e8 e0 08  5d+23:59:09.617  READ DMA EXT
  25 00 00 00 10 00 00 c5 9d e7 00 e0 08  5d+23:59:09.610  READ DMA EXT
  25 00 00 00 c0 00 00 c5 9d b5 00 e0 08  5d+23:59:09.581  READ DMA EXT
  35 00 00 00 18 00 00 c6 49 38 e8 e0 08  5d+23:59:09.581  WRITE DMA EXT

to me, this first looks like there is a surface error and the reallocation failed. However from that I am used to see some of the smart value counters rise, specifically either the current pending sectors, or the reallocated sector count. But no value is increasing:

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAGS    VALUE WORST THRESH FAIL RAW_VALUE
  1 Raw_Read_Error_Rate     POSR-K   165   145   051    -    36676
  3 Spin_Up_Time            POS--K   100   253   021    -    0
  4 Start_Stop_Count        -O--CK   100   100   000    -    3
  5 Reallocated_Sector_Ct   PO--CK   200   200   140    -    0
  7 Seek_Error_Rate         -OSR-K   200   200   000    -    0
  9 Power_On_Hours          -O--CK   099   099   000    -    1402
 10 Spin_Retry_Count        -O--CK   100   253   000    -    0
 11 Calibration_Retry_Count -O--CK   100   253   000    -    0
 12 Power_Cycle_Count       -O--CK   100   100   000    -    2
192 Power-Off_Retract_Count -O--CK   200   200   000    -    0
193 Load_Cycle_Count        -O--CK   200   200   000    -    7
194 Temperature_Celsius     -O---K   119   119   000    -    31
196 Reallocated_Event_Count -O--CK   200   200   000    -    0
197 Current_Pending_Sector  -O--CK   200   200   000    -    0
198 Offline_Uncorrectable   ----CK   100   253   000    -    0
199 UDMA_CRC_Error_Count    -O--CK   200   200   000    -    0
200 Multi_Zone_Error_Rate   ---R--   001   001   000    -    102665
                            ||||||_ K auto-keep
                            |||||__ C event count
                            ||||___ R error rate
                            |||____ S speed/performance
                            ||_____ O updated online
                            |______ P prefailure warning

TL;DR

Is this just some case of "bad harddrive behaving in a bad way" or is there anything wrong on the controller side of things? Or even S.M.A.R.T. itself broken? I have the feeling that I am missing something obvious here that would explain the discrepancy.


Note: I have the replacement in standby and in a few days I will have the opportunity to physically visit the server, so until then no cable wiggling or things like that.

PlasmaHH

Posted 2016-07-11T20:28:45.783

Reputation: 123

A very similar question is here.

– guest-vm – 2016-07-14T02:21:46.097

Question seems ambiguous. You meant a RAID controller, but before reading all your details, it looked like you were probably asking about a "hard drive controller". – TOOGAM – 2017-02-26T06:28:09.700

@TOOGAM: I am not quite sure how you got the idea that a raid controller is involved, the thing in question is commonly referred to as "sata controller" around here, whatever it's "official" name might be – PlasmaHH – 2017-02-26T22:34:24.213

Answers

2

The short answer and answer for your header question: Yes, it is possible for the controller/computer to cause SMART errors. The #1 reason is if you have a noisy cable or bad/out of spec SATA/SAS drivers and corrupts commands sent over SATA/SAS to the drive. The drive will CRC check and fail that command, and it will log that error to SMART as a command CRC error.

The long and complicated answer based on the body data: However I don't think this is the case for you because there was no CRC error. Do keep in mind there are two "connections" to the hard drive from the computer - data and POWER. Though not certain, it is most likely if it's slot related, the power going to the drive is causing the drive to behave badly if you're not just CRC errors.

There is really not enough data here to definitively answer your question, quite possibly power going to that slot is having issues. When power is not certain, all bets are off reading/writing to the disk.

boxer4

Posted 2016-07-11T20:28:45.783

Reputation: 36

Thanks for an answer, as you might have guessed, I already replaced the HDD last year, and until now everything is fine there. The HDD I removed was really bad (verified it afterwards on a different machine) with weird intermittent read errors. Seems to have been just a case of bad luck to have that many disks fail in a row – PlasmaHH – 2017-02-26T22:36:24.937

If there are power issues, those issues can damage the drives, and then the drives may remain damaged and continue to have problems even if they are in another machine. Definitely try replacing any SATA cables, since those are cheap. Then, if you get tired of replacing drives, consider replacing the slot. (Yes, I understand that likely means replacing more than just one slot, possibly a case.) Or consider just leaving that slot unused, if that's feasible. – TOOGAM – 2017-02-26T22:43:14.320