How to interpret failure data provided by SMART and zfs

Question

In a small server system, I have a zfs file system with a mirrored pair of consumer grade drives (Seagate Barracudas). Recently, during a periodic scrub operation the following result was given:

  pool: storage
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
        attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
        using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 10.9M in 44h14m with 0 errors on Tue Jun  6 00:11:23 2017
config:

        NAME          STATE     READ WRITE CKSUM
        storage       ONLINE       0     0     0
          mirror-0    ONLINE       0     0     0
            map2_sda  ONLINE       0     0     0
            map2_sdb  ONLINE       0     0    55

errors: No known data errors

There have been a few power failures and similar events between this scrub operation and the previous one, which I think may be a plausible cause of the failure, but I worry about the possibility that it is an impending hardware fault, particularly given that one disk was entirely clean and the other had multiple errors.

smartctl tells me that the suspect drive has had a total of 117 errors during its lifetime (of 935 days), but the most obvious error indicators are all well clear of their threshold values:

SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x000f   109   081   006    Pre-fail  Always       -       22737688
  5 Reallocated_Sector_Ct   0x0033   092   092   010    Pre-fail  Always       -       9784
  7 Seek_Error_Rate         0x000f   083   060   030    Pre-fail  Always       -       213798923
  9 Power_On_Hours          0x0032   075   075   000    Old_age   Always       -       22599
 10 Spin_Retry_Count        0x0013   100   100   097    Pre-fail  Always       -       0

Does anything here indicate that I need to be preemptively replacing this disk? I don't need 100% uptime on this machine, but would rather not have to worry about the multiple days of resilvering that would be required if I did have to replace the disk in an emergency situation.

Can you post the complete `smartctl` output of *both* disks? — shodanshok, Jun 12 '17 at 05:49
Related topic regarding the cause of Checksum errors: https://serverfault.com/questions/789194/zfs-checksum-errors-when-do-i-replace-the-drive — user121391, Jun 12 '17 at 09:46

score 2 · Answer 1 · answered Jun 12 '17 at 03:58

I wouldn't really panic if I were you, certainly not to replace it (which actually puts you in a dicier situation, with only one drive, nearly three years running, for a 44+ hour resilver...) I'd put the biggest drive I could reasonably afford into a free slot and add that to the pool (not as a spare, as a 3 way mirror) and when (if) one of the other two failed first I'd replace it with another big one and grow the pool...one the nicer features of zfs...but that's just me.

Old, but see google's experience with SMART, drive failure rates, heat, age...

This is a mirror vdev. You can easily add a new drive while both old are still online and providing data, then remove one of the old ones to get back to a two-way mirror. In fact, I'm pretty sure that's exactly what `zpool replace` will do for you. — user, Jun 12 '17 at 14:30

score 1 · Answer 2 · answered Jun 12 '17 at 09:58

Checksum errors are far less critical than read or write errors. While read/write errors indicate that a block could not be read or written at all (which is most likely because it is permanently damaged), checksum errors just mean that what was received is not what should have been received (according to ZFS' own checksums).

You may want to investigate the cause of the errors:

Did they happen sometime already or was it the first time?
Has anything happened to the machine (somebody moved it, touched it, replaced other hardware)?
Were there unexpected reboots and/or power losses or other power supply events (if your devices allow to monitor that)?
How is the situation of heat and shock in the case for both disks?
Are both disks in any way different (different cables, different positions in case regarding cables, on different controllers, etc.)?
Has anything odd happened in any available logs?

If you cannot find anything AND get additional (possibly increasing or high) amounts of checksum errors, you may want to replace the disk. You can do it by adding a third mirror first, like quadruplebucky suggested and resilver it in the off-hours. Any additional load on the machine will slow down the resilvering. Depending on the disk, it could also be possible that the "good" disk alone resilvers faster than both, but only if the "bad" one is really bad (what I don't assume).

The second sentence could also be rephrased to state that "while read/write errors indicate that the disk knows it's having difficulties, checksum errors means that something was read back that was different from what was originally written or intended to be written, but it was read back cleanly by the disk". I'm not sure that's generally better than actual errors. — user, Jun 12 '17 at 14:31
@MichaelKjörling Yes, from the perspective of the applications/system silent corruption would be much worse than not delivering anything. My answer was more in regard to the question "what is more likely to indicate an imminent hardware failure of the disk", where read/write failures mean that either the disk refuses to work on this sector or the whole communication link is broken - maybe similar in idea to analog shortwave radio vs. digital - you get some static and broken/garbled sound, but at least you get *something*, so the the sender must not be completely dead. — user121391, Jun 13 '17 at 07:17

How to interpret failure data provided by SMART and zfs

2 Answers2