blk_update_request: critical medium error, but hdparm is successful

0

I apologize for the long post; the tl;dr version is that I got a couple of "critical medium error" messages in dmesg, but hdparm is able to read the affected sectors just fine. What gives?!

Read on for all the gory details.

I put a Dell H310 (EDIT: flashed to IT mode) in my home server today, with a SAS-SATA cable to connect my hard drives, and a short time after booting (with no complications), I saw the following error appear on the console: blk_update_request: critical medium error, dev sdc, sector 440819800. Immediately concerned, I logged in and checked dmesg, to find the following panic-inducing lines:

[ 3868.082497] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3868.082516] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3868.082526] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3868.082534] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3868.082541] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3868.082549] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3868.082652] sd 2:0:2:0: [sdc] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 3868.082665] sd 2:0:2:0: [sdc] tag#3 Sense Key : Medium Error [current] 
[ 3868.082676] sd 2:0:2:0: [sdc] tag#3 Add. Sense: Unrecovered read error
[ 3868.082688] sd 2:0:2:0: [sdc] tag#3 CDB: Read(10) 28 00 1a 46 5b 00 00 05 80 00
[ 3868.082696] blk_update_request: critical medium error, dev sdc, sector 440819800
[ 3872.487468] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3872.487484] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000)
[ 3872.487559] sd 2:0:2:0: [sdc] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE
[ 3872.487571] sd 2:0:2:0: [sdc] tag#1 Sense Key : Medium Error [current] 
[ 3872.487590] sd 2:0:2:0: [sdc] tag#1 Add. Sense: Unrecovered read error
[ 3872.487601] sd 2:0:2:0: [sdc] tag#1 CDB: Read(10) 28 00 1a 46 60 58 00 00 08 00
[ 3872.487610] blk_update_request: critical medium error, dev sdc, sector 440819800

Knowing just enough to be dangerous (and assuming that "dev sdc" in the error message means /dev/sdc), I tried reading that sector with hdparm:

root@home:~# hdparm --read-sector 440819800 --direct /dev/sdc

/dev/sdc:
reading sector 440819800: succeeded

hdparm -a /dev/sdc showed me that readahead is on, and is 256 (sectors, I assume). Not wanting to pick through the output from 256+ consecutive calls to hdparm, I wrote a Little Script to read the 512 sectors on each side of the supposedly bad block:

erroringsector=440819800
startfromsector=$((${erroringsector} - 512))
for x in $(seq 0 1024)
do
currentsector=$((${startfromsector} + ${x}))
status=$(hdparm --read-sector $((${currentsector})) --direct /dev/sdc 2>&1)
z=$?
[ $z -ne 0 -o -n "$(echo "${status}" | grep -i error)" ] && echo "ERROR reading sector ${currentsector}: ${status}"
done

Not knowing the behavior of hdparm when it encounters an I/O error (the man pages are no help, or I missed the small print that would have helped), I tried to cover all the bases by folding stderr into stdout, checking the exit code, and checking for "error" in the output.

When I run the above Little Script, I get no output at all, which I think means that hdparm was able to read all of the sectors I told it to read, right?

I also manually checked the 50 or so sectors on either side of the troublesome sector, finding only successful reads.

smartctl -A /dev/sdc did not expose any especially worrisome data:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   200   200   051    Pre-fail  Always       -       12
  3 Spin_Up_Time            0x0003   163   163   021    Pre-fail  Always       -       4816
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       57
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15924
 10 Spin_Retry_Count        0x0012   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0012   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       55
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       37
193 Load_Cycle_Count        0x0032   113   113   000    Old_age   Always       -       262898
194 Temperature_Celsius     0x0022   105   090   000    Old_age   Always       -       42
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0012   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0010   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x003e   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

Raw_Read_Error_Rate actually has a number in there, but otherwise I think that report shows a hard drive that is getting a bit long in the tooth, but is otherwise alive and kicking. Please correct my inexperienced assessment with a minimum of flaming :-)

My further research and analysis of the SCSI Sense messages has not been fruitful, probably because until today, I knew nothing about that.

Yes, I checked (and re-seated) the HBA card and the cabling.

Bottom line, what does this all mean? Why the "critical medium error" message, but then complete success reading the sectors? More importantly, can I use this to justify upgrading to SSDs? ;-)

Peter

Posted 2019-03-05T05:54:05.643

Reputation: 257

No answers