mdadm marks hdd faulty even though it's in pristine health?

Question

I have my custom NAS configured to spin down drives after 20 minutes of idle.

Just now I checked /proc/mdstat and noticed a drive was marked as failed, however SMART shows the drive to be in very good health. Hence I suspect md-raid thought spin-up was taking too long and marked the drive failed.

Re-adding and rebuilding doesn't seem to be a problem either.

dmesg shows the following interesting lines which I can't find much on googling.

[97144.228682] sd 0:0:2:0: attempting task abort! scmd(ffff97f7b14ce948)
[97144.228688] sd 0:0:2:0: [sdc] tag#0 CDB: opcode=0x12 12 00 00 00 24 00
[97144.228692] scsi target0:0:2: handle(0x000c), sas_address(0x5001438020b9ee12), phy(18)
[97144.228694] scsi target0:0:2: enclosure_logical_id(0x5001438020b9ee25), slot(49)
[97148.184253] sd 0:0:2:0: task abort: SUCCESS scmd(ffff97f7b14ce948)
[97148.235864] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
--- last message repeated a couple dozen times ---
[97148.490304] sd 0:0:2:0: [sdc] tag#16 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490308] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490310] sd 0:0:2:0: [sdc] tag#13 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490315] sd 0:0:2:0: [sdc] tag#13 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e af f0 00 00 00 10 00 00
[97148.490317] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490321] print_req_error: I/O error, dev sdc, sector 225357808
[97148.490326] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490331] sd 0:0:2:0: [sdc] tag#16 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 18 00 00 00 20 00 00
[97148.490334] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490337] print_req_error: I/O error, dev sdc, sector 225357848
[97148.490341] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490354] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490358] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490366] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490370] sd 0:0:2:0: [sdc] tag#15 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490374] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490378] sd 0:0:2:0: [sdc] tag#15 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ae 68 00 00 00 08 00 00
[97148.490380] print_req_error: I/O error, dev sdc, sector 225357416
[97148.490383] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490392] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490399] mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
[97148.490403] sd 0:0:2:0: [sdc] tag#14 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490407] sd 0:0:2:0: [sdc] tag#14 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad 90 00 00 00 30 00 00
[97148.490409] print_req_error: I/O error, dev sdc, sector 225357200
[97148.490435] sd 0:0:2:0: [sdc] tag#11 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490439] sd 0:0:2:0: [sdc] tag#11 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad c8 00 00 00 58 00 00
[97148.490441] print_req_error: I/O error, dev sdc, sector 225357256
[97148.490450] sd 0:0:2:0: [sdc] tag#10 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490454] sd 0:0:2:0: [sdc] tag#10 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e ad 00 00 00 00 50 00 00
[97148.490456] print_req_error: I/O error, dev sdc, sector 225357056
[97148.490464] sd 0:0:2:0: [sdc] tag#9 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490468] sd 0:0:2:0: [sdc] tag#9 CDB: opcode=0x35 35 00 00 00 00 00 00 00 00 00
[97148.490472] print_req_error: I/O error, dev sdc, sector 16
[97148.490474] md: super_written gets error=10
[97148.490477] md/raid:md0: Disk failure on sdc, disabling device.
               md/raid:md0: Operation continuing on 3 devices.
[97148.490496] sd 0:0:2:0: [sdc] tag#8 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490500] sd 0:0:2:0: [sdc] tag#8 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 40 00 00 00 20 00 00
[97148.490502] print_req_error: I/O error, dev sdc, sector 225357888
[97148.490510] sd 0:0:2:0: [sdc] tag#7 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490514] sd 0:0:2:0: [sdc] tag#7 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e af b8 00 00 00 30 00 00
[97148.490516] print_req_error: I/O error, dev sdc, sector 225357752
[97148.490524] sd 0:0:2:0: [sdc] tag#6 UNKNOWN(0x2003) Result: hostbyte=0x0b driverbyte=0x00
[97148.490528] sd 0:0:2:0: [sdc] tag#6 CDB: opcode=0x88 88 00 00 00 00 00 0d 6e b0 00 00 00 00 08 00 00
[97148.490530] print_req_error: I/O error, dev sdc, sector 225357824

Is there a timeout value I can increase to make md-raid wait a couple minutes for drives to come online?
Any other options for preventing this in the future (other than keeping my drives spinning 24/7 because I also want to sleep every now and then)?

Update 2017-10-07

Updating controller firmware (it's a Perc H310 cross flashed to 9211-8i IT-mode), updating SAS expander firmware and increasing timeouts seems to have greatly reduced the above errors frequency, but they still happen and at some of these occasions md-raid still fails the drive.

I've decoded the SAS error code:

Value           31110101h
Type:           30000000h       SAS
Origin:         01000000h       PL
Code:           00110000h       PL_LOGINFO_CODE_RESET See Sub-Codes below (PL_LOGINFO_SUB_CODE)
Sub Code:       00000100h       PL_LOGINFO_SUB_CODE_OPEN_FAILURE
SubSub Code:    00000001h       PL_LOGINFO_SUB_CODE_OPEN_FAILURE_NO_DEST_TIMEOUT

For which I couldn't find anything but a brief description online (in a LSI pdf from 2009):

Failed to open connection with error Open Reject (No Destination). Retried for 50milliseconds.

After some further testing (provoking the problem with hdparm -y ... to spin drives down and hddtemp ... to spin them up with a simple command) i found the timeout to be at slightly above 11 seconds, which is weird because the only timeout settings left at a value of 10 are generic I/O timeouts for "sequential", "removable", and "unknown" devices.

Update 2017-10-08

Here's the topology of my setup:

Dell Perc H310 (LSISAS2008: FWVersion(20.00.07.00), ChipRevision(0x03), BiosVersion(07.39.02.00)) (flashed to 9211-8i IT-mode)
    `- HP SAS Expander card (FW 2.10)
        |- Hitachi HDS72404 } md0
        |- Hitachi HDS72404 } md0
        |- HGST HDN724040AL } md0
        |- HGST HDN724040AL } md0
        |- ST8000AS0002-1NA (btrfs)
        |- ST8000AS0002-1NA (btrfs)
        `- ST8000AS0002-1NA (xfs)

The four Hitachi/HGST drives comprise the md-raid array, the Seagate drives are unrelated to md-raid but also affected by the root problem (but btrfs doesn't seem to care as much).

Here's what I did so far, after many hours of research and experimenting, and hasn't helped much:

Run the following code at boot, increasing some mpt2sas timeouts:

for f in /sys/block/sd?/device/timeout; do
        echo 90 > "$f"
done

for f in /sys/block/sd?/device/eh_timeout; do
        echo 90 > "$f"
done

for f in /sys/class/scsi_disk/*/manage_start_stop; do
        echo 1 > "$f"
done

I've updated my HBA and expander firmware.

I've set any and all timeouts in the HBA BIOS config utility to 90 seconds.

Yet, timeouts still happen quite predictably during hard drive wake up (spin up) from standby after between 11 and 12 seconds. (I'm suspecting a 10 second timeout, since thats the default for a lot of timeouts, with some extra delay.)

Update 2017-10-10

I've now written a script that continually scans dmesg for dropped md devices and automatically issues mdadm --manage /dev/md0 --re-add /dev/sdx for them. With the write intent bitmap recovery now takes a couple seconds rather than a day. But this can't possibly be the proper solution to this problem.

I've also just written to Broadcom, maybe they're able to help.

Update 2017-10-11

I'm in the process of debugging my kernel for possible issues:

--drive put to standby with hdparm -y--
18:16:35 sd 0:0:1:0: [sdb] sd_open
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc94ea548
18:16:35 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e0 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:35 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 00 00 00 00 00 00 00 00 00 40 e0 00
18:16:35 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:35 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:35 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:35 sd 0:0:1:0: [sdb] sd_release
18:16:35 sd 0:0:1:0: [sdb] sd_check_events
18:16:35 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:35 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bc866e148
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense()=>SUCCESS [nasty midlayer TURs] 
18:16:35 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 sd 0:0:1:0: tag#0 Sense Key : Unit Attention [current] 
18:16:35 sd 0:0:1:0: tag#0 Add. Sense: Power on, reset, or bus device reset occurred
18:16:35 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:35 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bc866e148
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:35 SCSI DEBUG: scsi_check_sense()=>SUCCESS [nasty midlayer TURs] 
18:16:35 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:35 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:35 sd 0:0:1:0: tag#0 Sense Key : Not Ready [current] 
18:16:35 sd 0:0:1:0: tag#0 Add. Sense: Logical unit not ready, initializing command required
18:16:35 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:35 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
--command executed on drive with hddtemp--
18:16:45 sd 0:0:1:0: [sdb] sd_open
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: Inquiry 12 00 00 00 24 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: Inquiry 12 00 00 00 24 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 0)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:45 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 SCSI DEBUG: scsi_check_sense() scsi_check_sense 442 
18:16:45 SCSI DEBUG: scsi_check_sense() continuing default behaviour past line 484 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Done: SUCCESS Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 08 2e 00 00 00 00 00 00 00 00 00 00 00 ec 00
18:16:45 sd 0:0:1:0: [sdb] tag#0 Sense Key : Recovered Error [current] [descriptor] 
18:16:45 sd 0:0:1:0: [sdb] tag#0 Add. Sense: ATA pass through information available
18:16:45 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:45 sd 0:0:1:0: Notifying upper driver of completion (result 8000002)
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:45 sd 0:0:1:0: [sdb] tag#0 Send: scmd 0xffff989bc8669548
18:16:45 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 sd 0:0:1:0: [sdb] tag#0 Done: TIMEOUT_ERROR Result: hostbyte=DID_OK driverbyte=DRIVER_OK
18:16:53 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 sd 0:0:1:0: [sdb] tag#0 scsi host busy 1 failed 0
18:16:53 sd 0:0:1:0: [sdb] tag#0 abort scheduled
18:16:53 sd 0:0:1:0: [sdb] tag#0 aborting command
18:16:53 sd 0:0:1:0: attempting task abort! scmd(ffff989bc8669548)
18:16:53 sd 0:0:1:0: [sdb] tag#0 CDB: ATA command pass through(16) 85 06 20 00 d8 00 00 00 00 00 4f 00 c2 00 b0 00
18:16:53 scsi target0:0:1: handle(0x000a), sas_address(0x5001438020b9ee10), phy(16)
18:16:53 scsi target0:0:1: enclosure_logical_id(0x5001438020b9ee25), slot(51)
18:16:57 sd 0:0:1:0: task abort: SUCCESS scmd(ffff989bc8669548)
18:16:57 sd 0:0:1:0: [sdb] tag#0 finish aborted command
18:16:57 sd 0:0:1:0: Notifying upper driver of completion (result 30000)
18:16:57 sd 0:0:1:0: [sdb] sd_release
18:16:57 sd 0:0:1:0: [sdb] sd_check_events
18:16:57 sd 0:0:1:0: scsi_block_when_processing_errors: rtn: 1
18:16:57 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:57 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:57 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:57 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:57 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:57 sd 0:0:1:0: unblocking device at zero depth
18:16:57 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:57 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: NEEDS_RETRY Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: tag#0 Inserting command ffff989bd1de9148 into mlqueue
18:16:58 sd 0:0:1:0: unblocking device at zero depth
18:16:58 sd 0:0:1:0: tag#0 Send: scmd 0xffff989bd1de9148
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 mpt2sas_cm0: log_info(0x31110101): originator(PL), code(0x11), sub_code(0x0101)
18:16:58 sd 0:0:1:0: tag#0 Done: SUCCESS Result: hostbyte=DID_SOFT_ERROR driverbyte=DRIVER_OK
18:16:58 sd 0:0:1:0: tag#0 CDB: Test Unit Ready 00 00 00 00 00 00
18:16:58 sd 0:0:1:0: tag#0 scsi host busy 1 failed 0
18:16:58 sd 0:0:1:0: Notifying upper driver of completion (result b0000)
18:16:58 sd 0:0:1:0: device_block, handle(0x000a)
18:16:59 sd 0:0:1:0: device_unblock and setting to running, handle(0x000a)

What I find especially worrying about this is

18:16:53 sd 0:0:1:0: [sdb] tag#0 Done: TIMEOUT_ERROR Result: hostbyte=DID_OK driverbyte=DRIVER_OK

immediately leading to

18:16:53 sd 0:0:1:0: [sdb] tag#0 abort scheduled
18:16:53 sd 0:0:1:0: [sdb] tag#0 aborting command

I'd like to know where that timeout is defined and how to change it.

Update 2017-10-13

By debugging I encountered the following timeouts in practice:

7s
15s
20s
90s (as set in /sys/block/sd?/device/timeout)
180s (seems to be double the previous setting)

Additional timeouts are defined in the kernel sources:

./include/linux/blkdev.h:

#define BLK_DEFAULT_SG_TIMEOUT  (60 * HZ)
#define BLK_MIN_SG_TIMEOUT  (7 * HZ)

./include/scsi/scsi.h:

#define FORMAT_UNIT_TIMEOUT     (2 * 60 * 60 * HZ)
#define START_STOP_TIMEOUT      (60 * HZ)
#define MOVE_MEDIUM_TIMEOUT     (5 * 60 * HZ)
#define READ_ELEMENT_STATUS_TIMEOUT (5 * 60 * HZ)
#define READ_DEFECT_DATA_TIMEOUT    (60 * HZ )

These get applied in ./block/scsi_ioctl.c functions sg_scsi_ioctl(...) and blk_fill_sghdr_rq(...).

This explains where the short 7s timeout comes from (BLK_MIN_SG_TIMEOUT).

The 15s and 20s timeouts seems to come from sg_io_hdr*->timeout in blk_fill_sghdr_rq(...) but I can't find out where it's previously set.

There seems to be a somewhat related older thread on the gentoo forums without any real solution either: https://forums.gentoo.org/viewtopic-t-969756-start-0.html — Cobra_Fast, Oct 08 '17 at 20:17
https://github.com/zfsonlinux/zfs/issues/4713 seems to be a relatively recent and active discussion by ZFS users encountering a very similar problem. These people are referring to a possibly related linux kernel committ: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/drivers/scsi/scsi_error.c?id=14216561e164671ce147458653b1fea06a4ada1e — Cobra_Fast, Oct 10 '17 at 23:55
Increasing `#define BLK_MIN_SG_TIMEOUT (7 * HZ)` to higher values seems to fix the problem, but I'm not sure about side effects. It also hasn't been changed since 2007 so the problem is likely coming from elsewhere. — Cobra_Fast, Oct 13 '17 at 20:57
You might want to read [my earlier answer on the famous google labs report](https://serverfault.com/a/223043/55514) on SMART. It boils down to: SMART is fairly reliable when it says your drive is going to fail; it's not reliable when it tells you everything's fine. This drive is on the way out, and should be replaced, SMART output notwithstanding. — MadHatter, Feb 04 '19 at 08:50
@MadHatter *this* drive? I find it highly unlikely that suddenly *all* of my drives (random drive affected every time, different manufacturers and production charges) are *completely* faulty (random sector number every time) just because I connected them to different kind of controller and the faults only manifest when they wake up from standby/sleep. Additionally they work completely fine with aforementioned kernel patch. I appreciate the thought but I'm gonna keep my stance that this time around it's not a drive defect but a firmware or software problem. — Cobra_Fast, Feb 04 '19 at 16:25
Entirely fair point. I admit to having focussed on your use of SMART data to declare the drive "*is in pristine health*"; I still strongly advise against sole reliance on SMART data for that purpose, but as long as you have other reasons to think the drives OK, then your analysis holds water. I wouldn't spin down the drives on my NAS, though; the Linux kernel will want to wake the drive every 30s to sync atime updates and increment logfiles, so it's fairly pointless. — MadHatter, Feb 04 '19 at 19:42
@MadHatter atime only changes if a file was accessed - for which the drive usually needed to be up in the first place. If nothing happens on the drives no atimes need to be updated. Logfiles all go on the OS SSD. It does work in practice and let's me sleep comfortably at night without spinning rust whirring away in the other corner. — Cobra_Fast, Feb 04 '19 at 21:18
Apparently it doesn't work in practice, hence the question. But I take your point, though I still wouldn't do it myself. I hope this gets fixed one day! — MadHatter, Feb 04 '19 at 22:06

David McNeill · Answer 1 · 2019-02-04T02:28:56.107

2

Surely it's just that the drive really is faulty.

You are seeking a complex answer in timeout/spinup, when the reality is

[97148.490321] print_req_error: I/O error, dev sdc, sector 225357808

Controller could not read or write a specific sector to the drive. Cache would normally accept a write while the spin up proceeds.

Normally this is only ever seen on genuinely faulty drives, irrespective of what smartctl says.

Does swapping the drive for a different make a difference?

edited Feb 04 '19 at 02:28

answered Feb 02 '19 at 23:03

David McNeill

196
1
10

You shouldn't be suggesting to use a virtual device as member in a RAID. – kasperd Feb 03 '19 at 00:04
I checked that, but it was always a different drive and very different sector. SMART self tests didn't find any broken sectors on any of the member drives either. In any case, raising the `BLK_MIN_SG_TIMEOUT` kernel variable turned out to be a viable workaround to the problem - even if inconvenient to have to compile my own kernel. I also havn't tested newer kernels (it's been over a year) without the patch so I don't know if it's been fixed in the meantime. – Cobra_Fast Feb 03 '19 at 15:38

mdadm marks hdd faulty even though it's in pristine health?

1 Answers1