6

I am backing up data stored in a zpool consisting of a single raidz vdev with 2 hard disks. During this operation, I got checksum errors, and now the status looks as follows:

  pool: tmp_zpool
 state: ONLINE
status: One or more devices has experienced an error resulting in data
    corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
    entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

    NAME                  STATE     READ WRITE CKSUM
    tmp_zpool             ONLINE       0     0     2
      raidz1-0            ONLINE       0     0     4
        tmp_cont_0        ONLINE       0     0     0
        tmp_cont_1        ONLINE       0     0     0

errors: Permanent errors have been detected in the following files:

        /some/file

What I find confusing is that the checksum error appears at vdev level, but not at disk level. Perhaps I should note, one of the hard disks is internal and the other is external (this is a temporary situation). Can this be an issue with the hard drive controllers?

Is there anything I could try to do to get back the affected file? Like clearing the error and importing the vdev degrade with only one of the disks? I didn't even try to read the file again to see what happens. (Not sure if it would affect anything.)

Update: I gave up waiting for an explanation of what might go wrong if I clear the errors and retry, so I went ahead and tried that. I first did zpool clear, then zpool status showed no errors. Then, I tried to read the files with errors (2 of them in the end), but the respective blocks were still being reported as bad/unreadable. This time, zpool status no longer showed increasing checksum errors. Next, I tried to offline one of the disks in the raidz1 vdev and repeat the process, but the results did not change. In total, I lost 2 128K blocks out of 1.6T.

Answer Status: Currently, I find there is no comprehensive answer to this question. If somebody wants to write one up or edit an existing one, please address the following:

  1. What could have caused this situation.
  2. What could be done about it.
  3. How it could have been prevented.

For 1, the theories and their problems seem to be:

  • Choice of raidz1 over raidz2. Problem: one needs a minimum of 4 disks for raidz2. While the need for redundancy is clear, it is not useful to repeatedly suggest that the cure for failing redundancy is more redundancy. It would be much more useful to understand how to best use the redundancy you have.

  • Choice of raidz1 over mirror. Problem: At first sight, the difference between these seems to be efficiency, not redundancy. This might be wrong, though. Why: zfs saves a checksum with each block on each disk, but neither disk reported individual checksum errors. This seems to suggest that for every bad block, the 2 disks contained different block payloads, each with a matching checksum, and zfs was unable to tell which is correct. This suggests there were 2 different checksum calculations, and that the payload somehow changed between them. This could be explained by RAM corruption, and maybe (need confirmation) with a choice of mirror over raidz1, only one checksum would have been needed.

  • RAM corruption during writing, not reading. As explained above, this seems plausible. Problem: why was this not be detected as an error at write time? Can it be that zfs doesn't check what it writes? Or rather, that the block payloads written to the different disks are the same?

For 2:

  • Since the disks have no individual checksum errors, is there some low-level way in zfs to gain access to the 2 different copies of such bad blocks?

For 3:

  • Is it clear that mirror over raidz1 would have prevented this situation?

  • I assume a scrub of this zpool have detected the problem. In my case, I was moving some data around, and I destroyed the source data before I actually read this zpool, thinking that I have a 2 disk redundancy. Would the moral here be to scrub a zpool before trusting its contents? Surely scrubbing is useful, but is it necessary? For instance, would a scrub be necessary with mirror instead of raidz1?

Matei David
  • 231
  • 2
  • 6
  • 3
    Hmm, RAID Z1 with *two* disks doesn't even make sense. Why weren't mirrors used? – ewwhite Aug 07 '15 at 19:32
  • I confess I didn't know better. I mean, I understand the need for redundancy, but not the difference between mirror&raidz1 over 2 devices. This is a temporary setup anyhow, part of moving the data to a raidz2 over 4 devices. – Matei David Aug 07 '15 at 20:15
  • 1
    @MateiDavid Some of the differences between a two-device raidz1 and a two-device mirror is that with a mirror, either side can satisfy any read request, and with a raidz1, the parity data needs to be specifically calculated (as opposed to a mirror, which just writes the same thing to all mirrored devices). So with 2-dev raidz1, compared to 2-dev mirror, you lose flexibility (can't add or remove devices to the vdev) and you potentially lose I/O performance (because of redundancy data calculations). For a start... – user Aug 07 '15 at 20:57
  • Ok but all this is not that important for a temporary storage, right? Redundancy-wise they should be the same. – Matei David Aug 08 '15 at 03:15

2 Answers2

3

This is the problem with raidz1 (and also RAID5). If the data on the disk changes but no drive fault occurs to let ZFS or the RAID controller know which drive caused the error, then it can't know which drive is correct. With raidz2 (and higher) or RAID6, you get a quorum of drives that can decide which drive to ignore for reconstruction.

Your only solution here is to overwrite the file, either by restoring a backup copy or writing /dev/null to the file.

longneck
  • 22,793
  • 4
  • 50
  • 84
  • Because ZFS is CoW, wouldn't overwriting also require that no snapshots refer to the bad blocks? Otherwise the snapshots will point to the bad data which will be traversed during any following scrub. – user Aug 07 '15 at 22:25
  • 1
    I'm also a bit confused about your saying that ZFS can't know which drive is correct. Wouldn't the same checksums that were used in detecting the error also be useful in detecting which drive, *if either*, has valid data? Though maybe this becomes more difficult in a two-device raidz1 setup? (Note: I'm not arguing that your answer is wrong; I just think it could use a slight clarification in the specific case of ZFS.) – user Aug 07 '15 at 22:27
  • In addition to the clarification mentioned above, I would also ask- could I clear the errors, offline one of the drives, and try to read that file again? What does it mean for the vdev to have errors but not the individual drives? If both drives are fine but discordant (I completely don't understand how that could happen, but let's say it did), can I get the 2 different copies of the file and decide by hand which is correct? – Matei David Aug 08 '15 at 03:39
  • 1
    The checksums ZFS uses on all data/metadata blocks should allow ZFS to know which data is the correct one and ZFS should correct the wrong data. If ZFS cannot do it, maybe it was already written wrong (thats one reason EEC Memory is recommended, so that RAM bit Errors cannot corrupt the cheksums). – Sunzi Aug 09 '15 at 09:04
0

I'm running into a similar issue. I'm not sure if it's helpful, but I found this relevant post about vdev-level checksum errors from a FreeBSD developer.

https://lists.freebsd.org/pipermail/freebsd-hackers/2014-October/046330.html

The checksum errors will appear on the raidz vdev instead of a leaf if vdev_raidz.c can't determine which leaf vdev was responsible. This could happen if two or more leaf vdevs return bad data for the same block, which would also lead to unrecoverable data errors. I see that you have some unrecoverable data errors, so maybe that's what happened to you.

Subtle design bugs in ZFS can also lead to vdev_raidz.c being unable to determine which child was responsible for a checksum error. However, I've only seen that happen when a raidz vdev has a mirror child. That can only happen if the child is a spare or replacing vdev. Did you activate any spares, or did you manually replace a vdev?

I myself am considering deleting my zpool.cache file and importing my pool to regenerate that zpool.cache file.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940
user260467
  • 275
  • 2
  • 13
  • I don't see what possible help reimporting the zpool would be. – Michael Hampton Sep 02 '16 at 04:22
  • @MichaelHampton Any way to specifically find and discard the bad files? Something with `zdb`? – user260467 Sep 02 '16 at 05:03
  • Um... are you trying to answer the question or ask a question? – Michael Hampton Sep 02 '16 at 05:15
  • Both--the piece of information I posted might be useful to someone who understands more, but I too am theorizing about possible fixes for my own similar situation. – user260467 Sep 02 '16 at 05:28
  • This is not a forum and doesn't work that way. Nobody will see or respond to your question here, if posted as an answer. I've left this up _because_ it seems to provide at least part of an answer to the original question, but if you have a new question, you should post it as such. – Michael Hampton Sep 02 '16 at 13:48