0
I apologize for the long post; the tl;dr version is that I got a couple of "critical medium error" messages in dmesg
, but hdparm
is able to read the affected sectors just fine. What gives?!
Read on for all the gory details.
I put a Dell H310 (EDIT: flashed to IT mode) in my home server today, with a SAS-SATA cable to connect my hard drives, and a short time after booting (with no complications), I saw the following error appear on the console: blk_update_request: critical medium error, dev sdc, sector 440819800
. Immediately concerned, I logged in and checked dmesg
, to find the following panic-inducing lines:
[ 3868.082497] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3868.082516] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3868.082526] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3868.082534] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3868.082541] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3868.082549] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3868.082652] sd 2:0:2:0: [sdc] tag#3 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 3868.082665] sd 2:0:2:0: [sdc] tag#3 Sense Key : Medium Error [current] [ 3868.082676] sd 2:0:2:0: [sdc] tag#3 Add. Sense: Unrecovered read error [ 3868.082688] sd 2:0:2:0: [sdc] tag#3 CDB: Read(10) 28 00 1a 46 5b 00 00 05 80 00 [ 3868.082696] blk_update_request: critical medium error, dev sdc, sector 440819800 [ 3872.487468] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3872.487484] mpt2sas_cm0: log_info(0x31080000): originator(PL), code(0x08), sub_code(0x0000) [ 3872.487559] sd 2:0:2:0: [sdc] tag#1 FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_SENSE [ 3872.487571] sd 2:0:2:0: [sdc] tag#1 Sense Key : Medium Error [current] [ 3872.487590] sd 2:0:2:0: [sdc] tag#1 Add. Sense: Unrecovered read error [ 3872.487601] sd 2:0:2:0: [sdc] tag#1 CDB: Read(10) 28 00 1a 46 60 58 00 00 08 00 [ 3872.487610] blk_update_request: critical medium error, dev sdc, sector 440819800
Knowing just enough to be dangerous (and assuming that "dev sdc" in the error message means /dev/sdc
), I tried reading that sector with hdparm
:
root@home:~# hdparm --read-sector 440819800 --direct /dev/sdc /dev/sdc: reading sector 440819800: succeeded
hdparm -a /dev/sdc
showed me that readahead
is on, and is 256 (sectors, I assume). Not wanting to pick through the output from 256+ consecutive calls to hdparm
, I wrote a Little Script to read the 512 sectors on each side of the supposedly bad block:
erroringsector=440819800 startfromsector=$((${erroringsector} - 512)) for x in $(seq 0 1024) do currentsector=$((${startfromsector} + ${x})) status=$(hdparm --read-sector $((${currentsector})) --direct /dev/sdc 2>&1) z=$? [ $z -ne 0 -o -n "$(echo "${status}" | grep -i error)" ] && echo "ERROR reading sector ${currentsector}: ${status}" done
Not knowing the behavior of hdparm
when it encounters an I/O error (the man pages are no help, or I missed the small print that would have helped), I tried to cover all the bases by folding stderr
into stdout
, checking the exit code, and checking for "error" in the output.
When I run the above Little Script, I get no output at all, which I think means that hdparm
was able to read all of the sectors I told it to read, right?
I also manually checked the 50 or so sectors on either side of the troublesome sector, finding only successful reads.
smartctl -A /dev/sdc
did not expose any especially worrisome data:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE 1 Raw_Read_Error_Rate 0x000f 200 200 051 Pre-fail Always - 12 3 Spin_Up_Time 0x0003 163 163 021 Pre-fail Always - 4816 4 Start_Stop_Count 0x0032 100 100 000 Old_age Always - 57 5 Reallocated_Sector_Ct 0x0033 200 200 140 Pre-fail Always - 0 7 Seek_Error_Rate 0x000e 200 200 000 Old_age Always - 0 9 Power_On_Hours 0x0032 079 079 000 Old_age Always - 15924 10 Spin_Retry_Count 0x0012 100 253 000 Old_age Always - 0 11 Calibration_Retry_Count 0x0012 100 253 000 Old_age Always - 0 12 Power_Cycle_Count 0x0032 100 100 000 Old_age Always - 55 192 Power-Off_Retract_Count 0x0032 200 200 000 Old_age Always - 37 193 Load_Cycle_Count 0x0032 113 113 000 Old_age Always - 262898 194 Temperature_Celsius 0x0022 105 090 000 Old_age Always - 42 196 Reallocated_Event_Count 0x0032 200 200 000 Old_age Always - 0 197 Current_Pending_Sector 0x0012 200 200 000 Old_age Always - 0 198 Offline_Uncorrectable 0x0010 200 200 000 Old_age Offline - 0 199 UDMA_CRC_Error_Count 0x003e 200 200 000 Old_age Always - 0 200 Multi_Zone_Error_Rate 0x0008 200 200 000 Old_age Offline - 0
Raw_Read_Error_Rate
actually has a number in there, but otherwise I think that report shows a hard drive that is getting a bit long in the tooth, but is otherwise alive and kicking. Please correct my inexperienced assessment with a minimum of flaming :-)
My further research and analysis of the SCSI Sense messages has not been fruitful, probably because until today, I knew nothing about that.
Yes, I checked (and re-seated) the HBA card and the cabling.
Bottom line, what does this all mean? Why the "critical medium error" message, but then complete success reading the sectors? More importantly, can I use this to justify upgrading to SSDs? ;-)