19

Several permanent errors were reported on my zpool today.

  pool: seagate3tb
 state: ONLINE
status: One or more devices has experienced an error resulting in data
        corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
        entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        seagate3tb  ONLINE       0     0    28
          sda       ONLINE       0     0    56

errors: Permanent errors have been detected in the following files:

        /mnt/seagate3tb/Install.iso
        /mnt/seagate3tb/some-other-file1.txt
        /mnt/seagate3tb/some-other-file2.txt

Edit: I'm sure sure if those CKSUM values are accurate. I was redacting data and may have mangled those by mistake. They may have been 0. Unfortunately, I can't find a conclusive answer in my notes and the errors are resolved now so I'm not sure, but everything else is accurate/reflects what zpool was reporting.

/mnt/seagate3tb/Install.iso is one example file reported as having a permanent error.

Here's where I get confused. If I compare my "permanently errored" Install.iso against a backup of that exact same file on another filesystem, they look identical.

shasum "/mnt/seagate3tb/Install.iso"
1ade72fe65902b2a978e5504aaebf9a3a08bc328  /mnt/seagate3tb/Install.iso
shasum "/mnt/backup/Install.iso"
1ade72fe65902b2a978e5504aaebf9a3a08bc328  /mnt/backup/Install.iso
cmp /mnt/seagate3tb/Install.iso /mnt/backup/Install.iso
diff /mnt/seagate3tb/Install.iso /mnt/backup/Install.iso

The files seem to be identical. What's more, the file works perfectly fine. If I use it in an application, it behaves like I'd expect it to.

As the docs state:

Data corruption errors are always fatal.

But based on my rudimentary file verifications, I'm not sure I understand the definition of fatal.

status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.

action: Restore the file in question if possible. Otherwise restore the entire pool from backup.

Maybe I'm missing something, but the file seems perfectly fine as far as I can tell, and does need any restoration nor does it show any corruption, despite the reccomendation from ZFS.

I've seen other articles with the same error, but I have yet to find an answer to my question.

What is the permanent error with the file? Is there some lower level issue with the file that's just not readily apparent to me? If so, why would that not be detected by a shasum as a difference in the file?

From a layperson's perspective, I see nothing to indicate any error with this file.

Will Haley
  • 293
  • 2
  • 7
  • Do you have snapshots? – ewwhite Sep 02 '16 at 07:07
  • 3
    Will, since no one else has said it, may I welcome you to ServerFault? This looks to me like an *excellent* first question, and I hope it continues to glean instructive answers. I hope you decide to stick around SF and contribute further. – MadHatter Sep 02 '16 at 08:54
  • Thank you @MadHatter! I appreciate your kind welcome, and will certainly be sticking around SF. I've already added it to my brief list of SE communities. – Will Haley Sep 02 '16 at 14:54

2 Answers2

22

The wording of zpool status is a bit misleading. A permanent error (in this context) indicates that an I/O error has occurred and has been logged to the SPA (Storage Pool Allocator) error log for that pool. This does not necessarily mean there is irrecoverable data corruption.

What you should do is run a zpool scrub on the pool. When the scrub completes, the SPA error log will be rotated and will no longer show errors from before the scrub. If the scrub detects no errors then zpool status will no longer show any "permanent" errors.

Regarding the documentation, it is saying that only "fatal errors" are logged in this way. A fatal error is an I/O error that could not be automatically corrected by ZFS and therefore was exposed to an application as a failed I/O. By contrast, if the I/O was immediately retried successfully or if the logical I/O was satisfied from a redundant device, it would not be considered a fatal error and therefore would not be logged as a data corruption error.

A fatal error does not necessarily mean permanent data loss, it just means that at the time it could not be fixed before it propagated up to the application. For example, a loose cable or a bad controller could cause temporary fatal errors which ZFS would describe as "permanent." Whether it truly is a problem depends on the nature of the I/O and whether the application is capable of recovering from I/O errors.

EDIT: Fully agree with @bahamat that you should invest in redundancy as soon as possible.

Tom Shaw
  • 3,702
  • 15
  • 23
  • The SPA error log reporting this as "permanent" does indeed seem a bit misleading. The `zpool scrub` did exactly what you suggested @tom-shaw, and your explanation makes perfect sense. I no longer see any "permanent errors" on this array after the scrub. I didn't think about fatal errors in the context of a failed read. I think it must have just been a temporary I/O error on a read like you suggest. I also totally agree on the need for redundancy. – Will Haley Sep 02 '16 at 14:52
  • Tom, haven't seen you in a while. Welcome back. – the-wabbit Sep 02 '16 at 18:41
7

A permanent error means that there has been a checksum error in the file and there were not sufficient replicas to repair. It means that at least one read returned corrupted data due to an I/O error. If whatever received the read, then wrote that back to the same disk file you would now have irrecoverable data corruption.

Looking at your pool configuration, it looks like you have no redundancy. This is very dangerous. You don't get any of the self-healing benefits of ZFS, but it will be able to tell you when there has been data corruption. Ordinarily ZFS will automatically and silently correct corrupted reads, but in your case it can't. It also looks like you've already run zpool clear because the CKSUM count is 0 for both drives.

Unfortunately, with no replicas there's really no way to know.

bahamat
  • 6,193
  • 23
  • 28
  • 2
    Wouldn't `zpool clear` also clear the error message itself, not just the error counts? It is strange that the message persists, yet no errors are shown. – user121391 Sep 02 '16 at 08:59
  • 2
    My apologies. I had omitted the files from the list of permanent errors for privacy. In editing that output, I also mangled the CKSUM counts and lost valuable context. I've edited the question to reflect reality. @user121391 – Will Haley Sep 02 '16 at 15:11
  • In that case, if the numbers you show are correct, then you likely have a hardware error somewhere. Since both disks show `CKSUM counts` it might be the controller, cable, or any shared hardware between the two disks. It's also possible that *both* disks are failing. In any event this underscores the need to add redundancy ASAP, and inspect the indicated files for corruption. – bahamat Sep 02 '16 at 16:12
  • OP doesn't seem to have any redundancy; the vdev has 56 CKSUM errors, and the pool has 28 CKSUM errors. So I'm not sure what you were referring to by "both disks" in your previous comment. I agree with your point on the value of redundancy. – user Sep 03 '16 at 14:20
  • You're right. I misread the pool name as if it were another disk. Thanks for pointing that out. – bahamat Sep 03 '16 at 14:54