ZFS checksum errors, when do I replace the drive?

Question

I'm fairly new to ZFS and I have a simple mirrored storage pool setup with 8 drives. After a few weeks of running, one drive seemed to generate a lot of errors, so I replaced it.

A few more weeks go by and now I'm seeing small errors crop up all around the pool (see the zpool status output below). Should I be worried about this? How can I determine if the error indicates the drive needs to be replaced?

# zpool status
  pool: storage
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 22.5K in 1h18m with 0 errors on Sun Jul 10 03:18:42 2016
config:

        NAME        STATE     READ WRITE CKSUM
        storage     ONLINE       0     0     0
          mirror-0  ONLINE       0     0     0
            enc-a   ONLINE       0     0     2
            enc-b   ONLINE       0     0     0
          mirror-1  ONLINE       0     0     0
            enc-c   ONLINE       0     0     0
            enc-d   ONLINE       0     0     2
          mirror-2  ONLINE       0     0     0
            enc-e   ONLINE       0     0     2
            enc-f   ONLINE       0     0     1
          mirror-3  ONLINE       0     0     0
            enc-g   ONLINE       0     0     0
            enc-h   ONLINE       0     0     3

errors: No known data errors

ZFS helpfully tells me to "Determine if the device needs to be replaced..." but I'm not sure how to do that. I did read the referenced article which was helpful but not exactly conclusive.

I have looked at the SMART test results for the effected drives, and nothing jumped out at me (all tests were completed without errors), but I can post the SMART data as well if it would be helpful.

Update: While preparing to reboot into Memtest86+, I noticed a lot of errors on the console. I normally SSH in, so I didn't see them before. I'm not sure which log I should have been checking, but the entire screen was filled with errors that look like this (not my exact error line, I just copied this from a different forum):

blk_update_request: I/0 error, dev sda, sector 220473440

From some Googling, it seems like this error can be indicative of a bad drive, but it's hard for me to believe that they are all failing at once like this. Thoughts on where to go from here?

Update 2: I came across this ZOL issue that seems like it might be related to my problem. Like the OP there I am using hdparm to spin-down my drives and I am seeing similar ZFS checksum errors and blk_update_request errors. My machine is still running Memtest, so I can't check my kernel or ZFS version at the moment, but this at least looks like a possibility. I also saw this similar question which is kind of discouraging. Does anyone know of issues with ZFS and spinning down drives?

Update 3: Could a mismatched firmware and driver version on the LSI controller cause errors like this? It looks like I'm running a driver version of 20.100.00.00 and a firmware version of 17.00.01.00. Would it be worth while to try to flash updated firmware on the card?

# modinfo mpt2sas
filename:       /lib/modules/3.10.0-327.22.2.el7.x86_64/kernel/drivers/scsi/mpt2sas/mpt2sas.ko
version:        20.100.00.00
license:        GPL
description:    LSI MPT Fusion SAS 2.0 Device Driver
author:         Avago Technologies <MPT-FusionLinux.pdl@avagotech.com>
rhelversion:    7.2
srcversion:     FED1C003B865449804E59F5

# sas2flash -listall
LSI Corporation SAS2 Flash Utility
Version 20.00.00.00 (2014.09.18) 
Copyright (c) 2008-2014 LSI Corporation. All rights reserved 

    Adapter Selected is a LSI SAS: SAS2308_2(D1) 

Num   Ctlr            FW Ver        NVDATA        x86-BIOS         PCI Addr
----------------------------------------------------------------------------

0  SAS2308_2(D1)   17.00.01.00    11.00.00.05    07.33.00.00     00:04:00:00

Update 4: Caught some more errors in the dmesg output. I'm not sure what triggered these, but I noticed them after unmounting all of the drives in the array in preparation for updating the LSI controller's firmware. I'll wait a bit to see if the firmware update solved the problem, but here are the errors in the meantime. I'm not really sure what they mean.

[87181.144130] sd 0:0:2:0: [sdc] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144142] sd 0:0:2:0: [sdc] CDB: Write(10) 2a 00 35 04 1c d1 00 00 01 00
[87181.144148] blk_update_request: I/O error, dev sdc, sector 889461969
[87181.144255] sd 0:0:3:0: [sdd] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144259] sd 0:0:3:0: [sdd] CDB: Write(10) 2a 00 35 04 1c d1 00 00 01 00
[87181.144263] blk_update_request: I/O error, dev sdd, sector 889461969
[87181.144371] sd 0:0:4:0: [sde] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144375] sd 0:0:4:0: [sde] CDB: Write(10) 2a 00 37 03 87 30 00 00 08 00
[87181.144379] blk_update_request: I/O error, dev sde, sector 922978096
[87181.144493] sd 0:0:5:0: [sdf] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87181.144500] sd 0:0:5:0: [sdf] CDB: Write(10) 2a 00 37 03 87 30 00 00 08 00
[87181.144505] blk_update_request: I/O error, dev sdf, sector 922978096
[87191.960052] sd 0:0:6:0: [sdg] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87191.960063] sd 0:0:6:0: [sdg] CDB: Write(10) 2a 00 36 04 18 5c 00 00 01 00
[87191.960068] blk_update_request: I/O error, dev sdg, sector 906238044
[87191.960158] sd 0:0:7:0: [sdh] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87191.960162] sd 0:0:7:0: [sdh] CDB: Write(10) 2a 00 36 04 18 5c 00 00 01 00
[87191.960179] blk_update_request: I/O error, dev sdh, sector 906238044
[87195.864565] sd 0:0:0:0: [sda] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87195.864578] sd 0:0:0:0: [sda] CDB: Write(10) 2a 00 37 03 7c 68 00 00 20 00
[87195.864584] blk_update_request: I/O error, dev sda, sector 922975336
[87198.770065] sd 0:0:1:0: [sdb] FAILED Result: hostbyte=DID_OK driverbyte=DRIVER_OK
[87198.770078] sd 0:0:1:0: [sdb] CDB: Write(10) 2a 00 37 03 7c 88 00 00 20 00
[87198.770084] blk_update_request: I/O error, dev sdb, sector 922975368

Update 5: I updated the firmware for the LSI controller, but after clearing the ZFS errors and scrubbing, I'm seeing the same behavior (minor checksum errors on a few of the drives). The next step will be updating the firmware on the drives themselves.

Update 6: I replaced the PCI riser after reading in some forums that other people with the U-NAS NSC800 case have had issues with the provided riser. There was no effect on the checksum errors. I have been putting off the HDD firmware update because the process is such a pain, but I guess it's time to suck it up and make a bootable DOS flash drive.

Update 7: I updated the firmware on the three of the Seagate drives. The other drives either didn't have a firmware update available or I wasn't able to get it (Western Digital told me there was no firmware update for my drive). No errors popped up after an initial scrub, but I'm going to give it at least a week or two before I say this solved the problem. It seems highly unlikely to me that the firmware in three drives could be effecting the entire pool like this.

Update 8: The checksum errors are back, just like before. I might look into a firmware update for the motherboard, but at this point I'm at a loss. It will be difficult/expensive to replace the remaining physical components (controller, backplane, cabling), and I'm just not 100% sure that it's not a problem with my setup (ZFS + Linux + LUKS + Spinning down idle drives). Any other ideas are welcome.

Update 9: Still trying to track this one down. I came across this question which had some similarities to my situation. So, I went ahead and rebuilt the zpool using ashift=12 to see if that would resolve the issue (no luck). Then, I bit the bullet and bought a new controller. I just installed a Supermicro AOC-SAS2LP-MV8 HBA card. I'll give it a week or two to see if this solves the problem.

Update 10: Just to close this out. It's been about 2 weeks since the new HBA card went in and, at the risk of jinxing it, I've had no checksum errors since. A huge thanks to everyone who helped me sort this one out.

Can you tell us more about the hardware? Having those errors on multiple drives seems to indicate a backplane/controller/cabling problem more than a disk issue. — ewwhite, Jul 11 '16 at 18:16
I hadn't thought of that. The drives are in a [U-NAS NSC-800 chassis](http://www.u-nas.com/xcart/product.php?productid=17617) that came with a built in SATA/SAS backplane. That is connected via 2 mini-sas connectors to an [LSI SAS 9207-8i](https://smile.amazon.com/gp/product/B0085FT2JC/ref=oh_aui_search_detailpage?ie=UTF8&psc=1) HBA. That is connected via a PCI riser that came with the chassis to a [Supermicro MBD-X10SDV-4C](http://www.supermicro.com/products/motherboard/xeon/d/X10SDV-4C-TLN2F.cfm). — Dominic P, Jul 11 '16 at 18:29
Is your RAM okay? I've had similar errors when a memory module was bad - no disk errors, but some (low) amount of checksum errors on all drives. — user121391, Jul 12 '16 at 07:33
Thanks for the tip. I'll run memtest86 tonight and see what I get. I haven't had any other issues that might point to a memory problem, but you never know. — Dominic P, Jul 12 '16 at 19:40
After 18 hours and 2 complete passes of memtest86+ with no errors I feel pretty confident that we aren't looking at faulty RAM here. I'm at a loss now for what could be causing the errors. — Dominic P, Jul 14 '16 at 04:15
Not sure if this helps you, but I seem to have far more in terms of "odd" behavior of SATA HDDs on a LSI 9211 than with SAS HDDs on the same controller and cabling, with ZoL. — user, Jul 15 '16 at 11:34
@MichaelKjörling, thanks that's definitely something to keep in mind for the future. — Dominic P, Jul 18 '16 at 17:12
That it's most likely the controller is blatantly obvious at this point. Actually it was blatantly obvious some time ago, with "Update 4". — Michael Hampton, Aug 11 '16 at 22:15
Thanks @Michael for the help. Could you elaborate a bit on why you are thinking it's the controller instead of the backplane or cabling? Does something in those error messages point to a controller issue? — Dominic P, Aug 15 '16 at 04:42
I just wanted to pop in and say thanks for doing the updates. I know of someone who was having similar issues and did a similar flow of tests (memtest, etc) but is dead set on thinking it's not the controller. Hopefully this will help convince them that it may be the issue. — hak8or, Dec 18 '17 at 02:38
You're welcome @hak8or. The people on this site have helped me so much, so it's nice to hear my example might help someone else. — Dominic P, Dec 19 '17 at 18:22

score 8 · Accepted Answer · answered Jul 14 '16 at 05:35

8

Having those errors across multiple drives seems to indicate a backplane/controller/cabling problem more than a disk or RAM issue.

answered Jul 14 '16 at 05:35

ewwhite

194,921
91
434
799

Thanks for the help. I am not able to swap out all of those components at the moment. Do you have a suggestion on how I could narrow it down or what might be the most likely culprit? – Dominic P Jul 14 '16 at 17:07
Try firmware updates of all impacted components. Are these SATA disks? – ewwhite Jul 14 '16 at 17:08
Will do, thanks. I'll start with the firmware update on the controller because I have seen elsewhere that the firmware and driver versions should match (see update 3 on my question). Yes, they are all 1TB SATA disks, and I remember that `smartctl` said there was a firmware update available for some of the Seagate disks I'm using, so I'll update them as well. – Dominic P Jul 14 '16 at 17:13

score 7 · Answer 2 · answered Jul 11 '16 at 18:26

My general rule of thumb is that if the errors are continuing to rise unexpectedly, the disk needs replaced; if it's static, there might have been some transient condition that caused the error, and the system's not reproducing the conditions that caused problems.

A few checksum errors doesn't necessarily indicate anything bad mechanically with the drive (bit rot happens, ZFS just happens to detect it while other filesystems don't), but if those errors have happened over the course of an hour, then it's a much different situation than if they've happened over the course of a year.

ZFS checksum errors, when do I replace the drive?

2 Answers2

Linked