1

So ZFS is reporting some "read issues", so it would seem that this disk is failing, based on the fact nothing given in the ZFS-8000-9P document reports has occurred we are aware of. These disks are fairly new, the only issue we had recently was a full ZFS.

The ZFS runs on top of a LSI MegaRAID 9271-8i, all disks run "raid 0" per disk. I am not very familiar with this raid card, so I found a script that returns data derived from the megacli command line tool. I added 1 drive to show the setup, they are all setup the same. (system disks are different)

zpool status output

  pool: data
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        data        ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            br0c2   ONLINE       0     0     0
            br1c2   ONLINE       0     0     0
            br2c2   ONLINE       0     0     0
            br0c3   ONLINE       0     0     0
            br1c3   ONLINE       0     0     0
            br2c3   ONLINE       0     0     0
            r2c1    ONLINE       0     0     0
            r1c2    ONLINE       0     0     0
            r5c3    ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            r3c1    ONLINE       0     0     0
            r4c1    ONLINE       2     0     0
... cut raidz2-1 ...
errors: No known data errors

The output of LSI script

Virtual Drive: 32 (Target Id: 32)
Name                :
RAID Level          : Primary-0, Secondary-0, RAID Level Qualifier-0
Size                : 3.637 TB
Sector Size         : 512
Is VD emulated      : No
Parity Size         : 0
State               : Optimal
Strip Size          : 512 KB
Number Of Drives    : 1
Span Depth          : 1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Default Access Policy: Read/Write
Current Access Policy: Read/Write
Disk Cache Policy   : Disk's Default
Encryption Type     : None
PI type: No PI

Is VD Cached: No

The script doesn't report any faulty disk, nor does the raidcontroller mark the drive as faulty. I found some other topics zpool error that gave the advice to clear the error and run a scrub. Now my question is, when is the threshold to run a scrub, how long would this take (assuming this zfs raid will take a performance hit for running scrub) Also when this disk is really fautly, will hot-swapping initialize a "rebuild" ? All the disks are "Western Digital RE 4TB, SAS II, 32MB, 7200rpm, enterprise 24/7/365". Is there a system that will check for zfs errors, since this was just a routine manual check ?

zfs version : 0.6.4.1 zfsonlinux

I know 2 read errors are not allot, but i'd prefer to be replacing disks to early then to late.

SvennD
  • 739
  • 5
  • 18
  • RAID-0 is usually slower compared to JBOD modus. Are you using an expander for the disks, and if so, what type/brand? – Jeroen Jun 15 '15 at 08:48

2 Answers2

3

I'd do what ZFS tells you to do in this case. Please run a scrub.

I scrub my systems weekly on a schedule. I also use the zfswatcher daemon to monitor the health of Linux ZFS installs.

Your ZFS array is probably untuned, so there are some values that can help improve scrubbing performance, but at this point, you should just run it.

And for the other question, your hot swap probably won't do what you expect it to... See rant below.


rant:

Having a bunch of RAID-0 virtual drives behind a hardware controller is a bad idea!

You have the worst of both worlds. Recoverability and error checking is limited. A failed disk is essentially a failed virtual drive and there are hot-swap implications. Let's say you remove the disk(s) in question. You'd likely need to create a new virtual disk or may end up with different drive enumeration.

At a certain point, it's better to get a real HBA and run the disks as try passthrough devices (with no RAID metadata) or just run ZFS on top of vdevs protected by hardware arrays. E.g. run a RAID-6 on your controller and install ZFS on top. Or run multiple RAID-X groups and have ZFS mirror or stripe the resulting vdevs.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Thx for zfswatcher. I thought a hardware raid under ZFS isn't a very good idea. The alternative was RAID0 on every disk :( now you say its quit the opposite. ? – SvennD Jun 15 '15 at 13:31
  • Yes, that is correct. – ewwhite Jun 15 '15 at 13:34
  • Ah, I found why; This raid controller doesn't support [JBOD](http://serverfault.com/questions/335144/megaraid-jbod-substitute). Removing a RAID0 disk is idd a bit more difficult ... [source](https://calomel.org/megacli_lsi_commands.html) Using the raid controller's cache seems like a good thing ? Tho we cannot use ZFS if we go with the 2xRAID 6 setup. And it has some nice features (like detecting bitrot) we would like to have – SvennD Jun 15 '15 at 13:39
  • While a bunch of RAID0 drives is not perfect, it is not the worst of both worlds. ZFS is better at managing data failures and coordinating writes than the RAID controller. Using the RAID controllers redundancy features make it more likely that you could have an unrecoverable data error by hiding the real disks from ZFS. – janm Nov 19 '18 at 15:33
2

zfs scrub is the "system that will check for zfs errors". It will take as long as it takes to read all data stored in the volume (going in sequential order of txg, so it can be seeking a lot, depending on how full the pool is and how the data was written). Once started, zfs status will show some estimate. Running scrub can be stopped.

If you want something to periodically check zpool status, the simplest way would be to run something like zpool status | grep -C 100 Status periodically (once a 6 hours) and email the output if any. You could probably find a plugin for your favourite monitoring system, like nagios. Or it'd be pretty straightforward to write yourself.

Just hot swapping the drive will not trigger resilver. You will have to run zfs replace for that to happen.

The read error you are seeing may as well be some kind of controller mishap. Even though it's an enterprise hardware, these (HW RAID) controllers sometimes behave weird. And these errors may, for example, be a result of a command taking too long - controller being busy with whatever. That's why I try to stay away from those unless necessary.

I'd go with checking the SMART data on the drive (see man smartctl) and scrubbing the pool. If both look OK, clear the errors and do not mess with your pool. Because if the pool is near full reading all the data during resilver can actually trigger another error. Start panicing once you see errors on the same drive again ;).

btw. for best performance you should use n^2+2 drives in RAIDZ2 vdevs.

Fox
  • 3,887
  • 16
  • 23
  • Would you scrub with 2 errors ? We had 34 disks and we prefered the space over speed, thank you for your advice! – SvennD Jun 15 '15 at 10:04
  • Well 2 errors on one drive does not make the whole pool unusable even in the event of another drive failing. But it really depends on how you feel about your data. Scrubbing will tell you, if there are any inconsistencies and will bring afront all still undiscovered errors in this and other drives. If you care about your data very much, you probably should scrub regularly. If you can recover from any amount of errors (got backups?), you can skip it. – Fox Jun 15 '15 at 11:26
  • We do care about the data, we just don't wane break another disk scrubbing when we can just replace a disk ... We have a backup, but considering its about ~100TB i'd don't want to freely try my luck – SvennD Jun 15 '15 at 13:23