Externally attached ZFS pool hangs up, no sign of errors on drives

5

I have an array of 5 1TB WD Red drives in an external enclosure on a SATA multiplexer. This is being fed into a desktop machine with a SATA multiplexer controller.

After about a year of service (this has happened twice) the array will start to reset itself as in this video. There is no indication that any particular drive is at fault, just that the enclosure shuts down and all drives disconnect in the array.

I have two such enclosures, and the fault always goes with the redundant array when I move them from one to the other. The enclosures have remained constant for years, as have the interface cards, but new drives installed solved the issue for another year.

It could be dozens of things, from noisy power supply killing drive power circuits slowly to a bad OS implementation of ZFS, but it's so hard to know where to even start. What strategy would let me find out what the problem actually is?

  • OS: CentOS 7.0, kernel: 3.10.0

  • Enclosure: SiI 3726 multiplexer

  • Interface card: SiI 3132 demultiplexer

  • Hard Drives: WD10EFRX

Messages:

When the resets are occurring:

[ttt.tttt] ata4.03: exception Emask 0x100 SAct 0x0 SErr 0x0 action 0x6 frozen
[ttt.tttt] ata4.03: failed command: WRITE DMA EXT
[ttt.tttt] ata4.03: cmd 35/00:.. ...:00/e0 tag 3 dma 144688 out
[ttt.tttt] ata4.03: status: { Busy }
[ttt.tttt] ata4.03: error: { ICRC UNC AMNF IDNF ABRT }

Once zpool has stopped completely:

[ttt.tttt] INFO: task txg_sync:xxxxx blocked for more than 120 seconds
[ttt.tttt] INFO: task zpool:xxxxx blocked for more than 120 seconds

Once the second has occurred in response to terminal commands like

$ zpool status

the system is essentially useless and requires a full reboot.

The problem does not correlate with a drop in voltages to the drives, as can be seen in the latest video. I think it is a key piece of information that the box itself is resetting, all lights, even its own power light is resetting.

The messages to dmesg are vast, much too long to attach.

Output from badblocks:

$ badblocks -vn /dev/sdp1
irq_stat 0x00060002, device error via SDB FIS
SError: { Handshk }
failed command: WRITE FPDMA QUEUED
cmd 61/...
res 41/... ...Emask 0x410 (ATA bus error) <F>
status: { DRDY ERR }
error: { ICRC ABRT }

And this occurs equally for all 5 drives in the array. It's like the box is getting overloaded and resetting itself.

Update: 06/12/2017

All drives were moved to a second enclosure on USB3 interconnect rather than eSATA.

  • Enclosure: ICY BOX IB-3810U3
    • Multiplexer chip: ASMedia ASM1074L
  • Server motherboard USB3 host: Gigabyte GA-B85-HD3 SKT 1150

With all drives moved to new enclosure the badblocks command was run on each drive without a single error. Then the pool was imported and a scrub run. No error was found and the scrub completed successfully. Today however, a message was listed for all 5 drives, (it was impossible to tell if they were the drives of this pool/tank/array):

WARNING: Your hard drive is failing
Device: /dev/sdk [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdl [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdm [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdn [SAT], unable to open device
WARNING: Your hard drive is failing
Device: /dev/sdo [SAT], unable to open device

After this, an attempt to list the contents of the drive, it locked up the terminal. A new terminal locked up on any zpool command. top lists txg_sync and a hoard of z_rd_int_x processes that all have some CPU usage. Two other pools are successfully serving files over SAMBA, with one continuing to resilver itself (only evidenced by the HD lights), as zpool status hangs.

smartctl data: 12/12/2017

As per a commentor, the following is the smartctl data for UDMA_CRC_Error_Count.

For the second iteration of array currently failing:

4193, 4030, 3939, 2869, 3977

For the original array (with drive three having been swapped out):

3003, 3666,    0, 4536, 5309

For a RAID0 stripe in the same enclosure and connectivity

 523,  504,  526,  553,  476

For a ZFS mirror with hot spare hosted inside the host machine:

   0,    0,    0

On a Seagate Archive drive, seeming nonsense. :

Temperature_Celsius   UDMA_CRC_Error_Count   Head_Flying_Hours
   40  (0 16 0 0 0)                      0      57501022168585

This potentially just goes to show eSATA and USB 3.0 are inherently noisy and data corruption is inevitable.

J Collins

Posted 2017-11-30T14:21:46.640

Reputation: 412

1The ZFS driver is tied to your kernel. The current kernel is 4.14 do you have that version? The more current your kernel the less likely its the ZFS driver. Does dmesg log anything different during these periods? – cybernard – 2017-11-30T14:46:50.433

I am using CentOS and I believe the version is, by uname -r, 3.10.0. However it reports being fully up to date. dmesg does go quite mad, in fact in any open terminal window, messages interleave the normal screen output reporting hardware issues, 'action 0x6 frozen' etc. – J Collins – 2017-11-30T14:56:19.790

You are trapped by the CentOS distro, they are saying 3.10.0 is current but it is not. http://www.kernel.org is the official source. It is possible to download and compile the new kernel yourself. They say 3.10 is EOL or end of life. Also 3.10.108 not 3.10.0 is the newest version of that branch. If your going for a new kernel 4.14.3 is your best option. One of the reasons I switched to OpenSuse with tumbleweed repo is they are rarely more than approx 1 month behind.

– cybernard – 2017-11-30T16:10:41.120

While I appreciate the comment and have since researched installing a new kernel in CentOS, is there an aspect in the new versions that you think might be more robust and avoiding this issue? – J Collins – 2017-12-04T12:00:57.520

The kernel has been literally updated dozens of time since your old kernel. Although I did not read every change log, I know from experience, how big a deal a new version kernel is. I have been keeping 5+ linux VM updated over the years. Are you still having the issue? Did the dmesg events change? – cybernard – 2017-12-04T12:50:38.000

I have updated the question with more information, the seemingly salient excerpts from dmesg as well as a video of the live output while the fault is occurring. Note the messages on screen are not requested, they override the terminal screen. – J Collins – 2017-12-04T17:00:49.367

"ICRC"; // INTERFACE CRC ERROR,"UNC"; // UNCORRECTABLE,"AMNF"; // ADDRESS MARK NOT FOUND and "IDNF"; // ID NOT FOUND Sound like one of the drives has at least 1 if not more bad sectors. – cybernard – 2017-12-04T17:40:59.143

Have you done zpool scrub rpool? Note: This consumes all available IO plan downtime accordingly. – cybernard – 2017-12-04T17:50:57.373

Is 'rpool' a special keyword? Generally when it gets in a bad state all IO is suspended on the array. In this case any 'zpool' call now locks up that terminal and I need a new one Alt F2 etc. Also scrubbing activity seems to be interrupted by the drive enclosure restart. Perhaps a drive attracting bad sectors is not being handled well by the multiplexer? – J Collins – 2017-12-04T18:11:07.400

I agree that all the LEDs going out is suspicious... though it depends heavily on what is driving them. What supply rails is the scope connected to? I'd suspect that the SATA link(s) are falling over - can you connect the disks directly and remove the enclosure from the equation? – Attie – 2017-12-04T19:31:54.290

Am I understanding correctly that you have moved all 5 disks to another enclosure, and the problem still occurs? – Attie – 2017-12-04T19:32:50.207

Scope is on the +5 and +12 V rails coming straight from the power supply. I mean there could be a bad connection after that point, but these voltages look solid. One thing I got from analysing those videos is that all the drive lights go together a split second before it fails. That's why I was suspecting power supply, but it seems good. In the past I have moved a misbehaving array to a new identical enclosure and it took the fault with it. I am about to try moving this array to a different model of box. Historically though the fault moved to this box also. – J Collins – 2017-12-04T19:44:20.607

A first test of one drive in this new box using badblocks looks solid so far. – J Collins – 2017-12-04T19:49:40.107

@JCollins Probably need to replace rpool with the name of your pool or volume. – cybernard – 2017-12-04T23:56:20.817

I ran the scrub, and have updated the question with new information. – J Collins – 2017-12-06T13:10:20.227

Other things to check is smartctl (counter) output. Some of the counters can indicate issues like faulty SATA cables... – Gert van den Berg – 2017-12-07T15:31:44.993

@GertvandenBerg I have updated the question with smartctl data for UDMA_CRC_Error_Count, my understanding is this is the best indicator of interconnect failure. – J Collins – 2017-12-12T16:28:26.303

@JCollins: Correct.... There have been some errors... If they increase cables might be the issue. (It is a lifetime count, a non-zero value just mean there have been some errors in the driver's lifetime). Non-zero values usually shows that cables might be problematic. (SATA/SCSI over USB seems to be common with USB3, not sure if that might result in higher counts...) – Gert van den Berg – 2017-12-14T14:39:36.967

The archive drive's error counts seems correct, the flying hours probably not... (Ot it is actually milliseconds or something similar instead of hours) – Gert van den Berg – 2017-12-14T15:14:15.937

Answers

1

The SMART statistics indicate that the hard drives have seen CRC errors on their links. (To verify that it is not an issue that was previously resolved, you should monitor the value of UDMA_CRC_Error_Count over time - it is a total of the errors over the disk's lifetime)

The cases where I have previously seen this, involved bad SATA cables. Swapping the cable has resolved the issue (The counters still have their values, but the value stays constant). This is a quite complex setup though, and the problem might be on a cable, on the mux/demux or somewhere in the enclosure.

Gert van den Berg

Posted 2017-11-30T14:21:46.640

Reputation: 538

I did suspect bad cables. Having a load of spares, replaced all of them. Not so with the USB 3.0 unit, as at this point I'd discounted cables as the source of error. Are these mux/demux chips designed pretty close to the limit of the electrical standard and even have as standard an accepted error rate? – J Collins – 2017-12-14T15:15:54.173

If the drivers were moved over to the USB enclosure, they would still have an error count from previous issues... (The change in the values are important) I'm not especially familiar with the muxes / demuxes. It might even be the PCB layout... It is a complex setup with many components - ideally each item would need to be tested separately to identify the exact problem. (I would assume that the port expanders would retransmit the signal) – Gert van den Berg – 2017-12-14T15:39:37.710

1The eSATA cable might be the more likely issue - it is common to all the drives. Demux does not seem like the correct term for the SiI 3132 - it seems to be a plain SATA interface IC. (Connecting a single drive (with unimportant data) on the same SATA cable and dd if=/dev/disk of=/dev/null and dd if=/dev/zero of=/dev/disk # this will wipe it though to generate lots of activity for checking counter increases is one way of testing) – Gert van den Berg – 2017-12-14T15:45:35.277