Is this disk about to die?

Question

I have a WD Red 4 TB disk (WD40EFRX-68WT0N0, firmware 82.00A82) that is occasionally showing uncorrectable read errors in the SMART error log, e.g.:

Error 43 [18] occurred at disk power-on lifetime: 13157 hours (548 days + 5 hours)
  When the command that caused the error occurred, the device was active or idle.

  After command completion occurred, registers were:
  ER -- ST COUNT  LBA_48  LH LM LL DV DC
  -- -- -- == -- == == == -- -- -- -- --
  40 -- 51 00 00 00 00 00 02 e9 e0 40 00  Error: UNC at LBA = 0x0002e9e0 = 190944

  Commands leading to the command that caused the error were:
  CR FEATR COUNT  LBA_48  LH LM LL DV DC  Powered_Up_Time  Command/Feature_Name
  -- == -- == -- == == == -- -- -- -- --  ---------------  --------------------
  60 01 00 00 08 00 00 00 02 ea 48 40 00 12d+15:42:14.157  READ FPDMA QUEUED
  60 00 e0 00 00 00 00 00 02 e9 68 40 00 12d+15:42:14.157  READ FPDMA QUEUED
  60 00 e0 00 08 00 00 00 02 e8 88 40 00 12d+15:42:10.216  READ FPDMA QUEUED
  60 01 00 00 00 00 00 00 02 e7 88 40 00 12d+15:42:10.215  READ FPDMA QUEUED
  60 01 00 00 08 00 00 00 02 e6 88 40 00 12d+15:42:07.629  READ FPDMA QUEUED

(full report from smartctl here)

With the latest error, zpool status reports the following:

$ zpool status cloudpool
  pool: cloudpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 0B in 3h57m with 0 errors on Wed Oct 17 03:53:57 2018
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       1     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     0
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

(previously, some runs of zpool scrub had reported that it had repaired some data, but this is the first time I'm seeing this new status).

However, running the short, conveyance, and extended SMART tests reveals nothing amiss.

I also thought the Load/Unload cycle count was suspiciously high, but this is a Red drive, not a Green one, and the official tool from WD (wd5741.exe) reports that there is nothing to do.

So do I have a drive that's about to die / needs to be replaced, or is that just normal occasional sector reallocation?

EDIT: I've had an issue with another drive, although I'm using ECC RAM:

  pool: cloudpool
 state: ONLINE
status: One or more devices has experienced an unrecoverable error.  An
    attempt was made to correct the error.  Applications are unaffected.
action: Determine if the device needs to be replaced, and clear the errors
    using 'zpool clear' or replace the device with 'zpool replace'.
   see: http://zfsonlinux.org/msg/ZFS-8000-9P
  scan: scrub repaired 768K in 2h56m with 0 errors on Sun Jan 13 03:20:40 2019
config:

    NAME                                          STATE     READ WRITE CKSUM
    cloudpool                                     ONLINE       0     0     0
      mirror-0                                    ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17FZXF          ONLINE       0     0     0
        ata-ST8000VN0022-2EL112_ZA17H5D3          ONLINE       0     0     0
      mirror-1                                    ONLINE       0     0     0
        ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5NFLRU3  ONLINE       0     0     0
        ata-ST4000VN000-2AH166_WDH0KMHT           ONLINE       0     0     0
      mirror-2                                    ONLINE       0     0     0
        ata-WDC_WD30EFRX-68EUZN0_WD-WCC4N3EHHA2E  ONLINE       0     0     6
        ata-ST3000DM001-1CH166_Z1F1HL4V           ONLINE       0     0     0

errors: No known data errors

Since the Reallocated_Sector_Ct is 0, then maybe the cable is bad. I once had something similar where the disk didn't had black sectors but had read errors, and it was a bad cable. — Stone, Oct 18 '18 at 10:28
Hmm thanks for the cable suggestion, I'll take a look at that! As for the long smart test, I did it, and it completed without errors. — FlorentR, Oct 18 '18 at 13:07
Also, pay attention to the PSU and power cabling. If there are any molex/SATA splitters, drive cage power distributors, that's the stuff that can cause all sorts of trouble.. — Peter Zhabin, Oct 18 '18 at 17:43
Also, is ZFS the only thing detecting errors? If so, it could be a RAM problem that ZFS finds via checksumming the data. It's not likely if it's always the same disk showing the error, but maybe if you have bad memory the read pattern that hits that memory only causes data from that disk to hit the bad memory. Are you using ECC RAM? — Andrew Henle, Oct 19 '18 at 15:03
Andrew - Yes, ZFS seems to be the only thing detecting errors. I've since had a few checksum issues with another drive (see addendum to original post). I am however using ECC RAM, so the problem should not originate from there? — FlorentR, Jan 13 '19 at 16:12

Is this disk about to die?

0 Answers0