4

I'm running Ubuntu 14.04 with ZOL version v0.6.5.4:

root@box ~# dmesg | egrep "SPL|ZFS"
[   34.430404] SPL: Loaded module v0.6.5.4-1~trusty
[   34.475743] ZFS: Loaded module v0.6.5.4-1~trusty, ZFS pool version 5000, ZFS filesystem version 5

ZFS is configured in raidz2 across 6x 2TB Seagate SpinPoint M9T 2.5" drives, with a read cache, deduplication and compression enabled:

root@box ~# zpool status -v
  pool: bigpool
 state: ONLINE
config:

        NAME                                           STATE     READ WRITE CKSUM
        bigpool                                        ONLINE       0     0     0
          raidz2-0                                     ONLINE       0     0     0
            ata-ST2000LM003_HN-M201RAD_S37<redactedid> ONLINE       0     0     0
            ata-ST2000LM003_HN-M201RAD_S37<redactedid> ONLINE       0     0     0
            ata-ST2000LM003_HN-M201RAD_S37<redactedid> ONLINE       0     0     0
            ata-ST2000LM003_HN-M201RAD_S37<redactedid> ONLINE       0     0     0
            ata-ST2000LM003_HN-M201RAD_S37<redactedid> ONLINE       0     0     0
            ata-ST2000LM003_HN-M201RAD_S34<redactedid> ONLINE       0     0     0
        cache
          sda3                                         ONLINE       0     0     0

Every few days, the box will lock up, and I'll get errors such as:

blk_update_request: I/O Error, dev sdh, sector 764218200
blk_update_request: I/O Error, dev sdf, sector 764218200
blk_update_request: I/O Error, dev sde, sector 764218200
blk_update_request: I/O Error, dev sdd, sector 764218200
blk_update_request: I/O Error, dev sdc, sector 764218432
blk_update_request: I/O Error, dev sdg, sector 764218200

smartctl shows that the disks are not recording any SMART errors, and they're all fairly new disks. I find it odd too that they're all failing on the same sector (with the exception of sdc). I was able to grab a screenshot of the terminal (I can't ssh in once the errors start):

console errors

Perhaps this is a controller failing, or a bug related to zfs?

dymk
  • 41
  • 2
  • Were you able to track down the cause of these errors? I'm in a [similar situation](http://serverfault.com/questions/789194/zfs-checksum-errors-when-do-i-replace-the-drive), and I'm having a hard time figuring out what the underlying issue is. – Dominic P Jul 13 '16 at 19:56

1 Answers1

0

You have a controller, cabling or backplane problem. Note how all drives are impacted at the same time...

I'd also caution against using deduplication on a setup like this unless it's totally necessary.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thank you, but how did you arrive at this being a controller/cabling/backplane problem? Is it just the timing of the failure that indicates this? Is there something I can do to confirm that this is indeed the issue, and not something software based (causing all drives to fail at the same time)? – dymk Feb 20 '16 at 22:15
  • 2
    ZFS doesn't fail like that. I doubt you have a software issue. It's far more likely to be your hardware: specifically the controller or whatever is between your disks and the controller. – ewwhite Feb 20 '16 at 22:17
  • I've swapped out the backplane, RAID controller, cabling, and motherboard (the parts from the previous server are the PSUs, RAM, CPUs, and disks), and I'm still getting the same errors: https://i.imgur.com/EzFU3D3.png Could this possibly be a CPU or RAM issue? – dymk Mar 19 '16 at 17:15
  • You have ext4 errors too. So this is probably still a hardware problem. – ewwhite Mar 19 '16 at 17:18