ZFS pool continually DEGRADED or FAULTED

Question

I've got a pool in raidz1-0 with 5 drives in it. I'm not sure exactly when, but all of the sudden all the drives went from always being ONLINE with no read, write or checksum errors to randomly spitting out all sort of issues.

    NAME                                            STATE     READ WRITE CKSUM
    Data                                            DEGRADED     0     0     0
      raidz1-0                                      DEGRADED   149   185     0
        gptid/905fe084-a003-11e9-9d12-000c29c8a62a  DEGRADED    57   127     5  too many errors
        gptid/2b75693a-9f09-11e9-8310-000c29c8a62a  ONLINE       7     5     5
        gptid/b8b4dd8f-82e9-11eb-b23f-000c29c8a62a  DEGRADED    70   171     5  too many errors
        gptid/b88beac0-e1f3-11e7-aeb0-000c29c8a62a  DEGRADED    51     6    14  too many errors
        gptid/4eb702b3-e2c3-11e7-9896-000c29c8a62a  FAULTED      8    13     2  too many errors

I've done some basic troubleshooting:

SMART shows that everything is fine (apart from some warmer than I'd like temps around the 40C range). So the drives look like they're in good shape. No bad sectors, no pending sectors, nothing out of the ordinary. All of the drives have been spinning for ~3 years at this point.
Each of the drives are connected directly to the motherboard via individual SATA connections. I've reseated and replaced the SATA cables with no success.

At some point in time, I replaced the 3rd disk in the pool. At the time, it was spitting out the most errors and could always be the first to go into a DEGRADED state. I replaced it with a brand new drive and it's been running for months now, immediately picking up the same issue as the rest of the pool.

Even after a zpool clear, about 5 hours later I had the following status.

    NAME                                            STATE     READ WRITE CKSUM
    Data                                            DEGRADED     0     0     0
      raidz1-0                                      DEGRADED     1     0     0
        gptid/905fe084-a003-11e9-9d12-000c29c8a62a  ONLINE       2     4     0
        gptid/2b75693a-9f09-11e9-8310-000c29c8a62a  ONLINE       0     0     0
        gptid/b8b4dd8f-82e9-11eb-b23f-000c29c8a62a  FAULTED      1    11     0  too many errors
        gptid/b88beac0-e1f3-11e7-aeb0-000c29c8a62a  ONLINE       1     1     0
        gptid/4eb702b3-e2c3-11e7-9896-000c29c8a62a  ONLINE       1     6     0

I'm not exactly sure what's going on here or where else to look.

I don't know if it's a coincidence, but I noticed this started to happen after upgrading the ZFS pool as part of one of FreeNAS's updates (I think it was 11.2U - also yeah, I'm running FreeNAS)

The only last thing I can possibly think of is a bad SATA controller. But before I get to that, is there anything else I can troubleshoot? This is for a hobby home server and replacing the controller essentially means a whole new server so I'd like to avoid that if possible. And there aren't any PCIe ports remaining to install an external controller unfortunately.

Thanks in advance!

ASRock Fatal1ty B85 Killer https://www.asrock.com/mb/intel/fatal1ty%20b85%20killer/ One of the few motherboards at the time with Xeon support and a number of SATA connectors. (This was before I had learned about an external SATA controller) — SteppingHat, May 07 '21 at 04:23
Multiple concurrent read/write errors are generally related to controller issue. Your B85 chipset is quite old, and some years ago Intel had significant age-related SATA issues. I would suggest to replace your mainboard or to add standalone PCI-E SATA controller. — shodanshok, May 08 '21 at 17:18

score 1 · Accepted Answer · answered Jun 02 '21 at 10:58

After almost a month of debugging, it's safe to say that it was indeed the chipset's SATA controller.

@shodanshok brought to my attention that there is a "significant age-related SATA issue" with intel chipsets, and some extra googling showed that I wasn't the only one.

I've bought some new hardware, alongside a LSI 9205-8I H220 to connected all the drives into. Without any changes to the configuration (apart from a more modern motherboard + CPU), they ZFS pool was imported with no issue and the pool has been running for a whole day with 0 checksum/read/write errors. By now it would have been in the hundreds. This confirms that the issue was the onboard SATA controller.

Hope this helps anyone who is experiencing a similar issue!

ZFS pool continually DEGRADED or FAULTED

1 Answers1