5

Here's one of my Solaris 10 servers after a couple of disk replacements in a zpool

  pool: volume
 state: ONLINE
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
 scan: resilver in progress since Wed Jul  6 11:43:51 2016
    207M scanned out of 36.7T at 1.48M/s, (scan is slow, no estimated time)
    207M scanned out of 36.7T at 1.48M/s, 7235h37m to go
    13.5M resilvered, 0.00% done
config:

        NAME         STATE     READ WRITE CKSUM
        volume       ONLINE       0     0     0
          raidz2-0   ONLINE       0     0     0
            c4t0d0   ONLINE       0     0     1
            c4t0d1   ONLINE       0     0     0
            c4t0d2   ONLINE       0     0     0
            c4t0d3   ONLINE       0     0     0
            c4t0d22  ONLINE       0     0     0
            c4t0d5   ONLINE       0     0     0
            c4t0d6   ONLINE       0     0     0
            c4t0d23  ONLINE       0     0     0  (resilvering)
            c4t0d15  ONLINE       0     0     0
            c4t0d9   ONLINE       0     0     0
            c4t0d10  ONLINE       0     0     0
          raidz2-1   ONLINE       0     0     4
            c4t0d11  ONLINE       0     0     0
            c4t0d8   ONLINE       0     0     0  (resilvering)
            c4t0d13  ONLINE       0     0     0
            c4t0d14  ONLINE       0     0     0
            c4t0d20  ONLINE       0     0     0
            c4t0d16  ONLINE       0     0     0
            c4t0d4   ONLINE       0     0     0
            c4t0d18  ONLINE       0     0     2
            c4t0d19  ONLINE       0     0     0
            c4t0d17  ONLINE       0     0     0
            c4t0d21  ONLINE       0     0     0

errors: No known data errors

The scan status returns to 0.00% done every 10-15 minutes, thus re-starting the resilvering. Here is the output of echo "::zfs_dbgmsg" | mdb -k.

iostat -En shows a high (and growing) number of errors on all disks.

zpool iostat -v volume shows normal resilvering activity (write on new disks, read from old disks)

/var/adm/messages is full of messages like this:

Jul  6 12:08:25 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1/sd@0,15 (sd20):
Jul  6 12:08:25 raid2   SCSI transport failed: reason 'reset': retrying command
Jul  6 12:08:28 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1/sd@0,b (sd8):
Jul  6 12:08:28 raid2   Error for Command: read(10)                Error Level: Retryable
Jul  6 12:08:28 raid2 scsi: [ID 107833 kern.notice]     Requested Block: 21523458                  Error Block: 21523458
Jul  6 12:08:28 raid2 scsi: [ID 107833 kern.notice]     Vendor: transtec                           Serial Number: 63881076-00 
Jul  6 12:08:28 raid2 scsi: [ID 107833 kern.notice]     Sense Key: Unit Attention
Jul  6 12:08:28 raid2 scsi: [ID 107833 kern.notice]     ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Jul  6 12:09:35 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1 (mpt0):
Jul  6 12:09:35 raid2   Disconnected command timeout for Target 0
Jul  6 12:09:39 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1/sd@0,15 (sd20):
Jul  6 12:09:39 raid2   incomplete read- retrying
Jul  6 12:10:46 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1 (mpt0):
Jul  6 12:10:46 raid2   Disconnected command timeout for Target 0
Jul  6 12:10:49 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1/sd@0,e (sd11):
Jul  6 12:10:49 raid2   incomplete read- retrying
Jul  6 12:11:56 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1 (mpt0):
Jul  6 12:11:56 raid2   Disconnected command timeout for Target 0
Jul  6 12:13:03 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1/sd@0,1 (sd35):
Jul  6 12:13:03 raid2   Error for Command: write                   Error Level: Retryable
Jul  6 12:13:03 raid2 scsi: [ID 107833 kern.notice]     Requested Block: 644                       Error Block: 644
Jul  6 12:13:03 raid2 scsi: [ID 107833 kern.notice]     Vendor: transtec                           Serial Number: 023CEC5B-00 
Jul  6 12:13:03 raid2 scsi: [ID 107833 kern.notice]     Sense Key: Unit Attention
Jul  6 12:13:03 raid2 scsi: [ID 107833 kern.notice]     ASC: 0x29 (power on, reset, or bus reset occurred), ASCQ: 0x0, FRU: 0x0
Jul  6 12:13:03 raid2 scsi: [ID 107833 kern.warning] WARNING: /pci@0,0/pci8086,3597@4/pci8086,329@0/pci1000,1060@1/sd@0,5 (sd2):

Is there anything I can do to make resilvering finish so that the pool can return to a normal state?

Pavel
  • 988
  • 1
  • 8
  • 29
  • 2
    Your discs are dying. Get your data off while you still can. – MadHatter Jul 06 '16 at 11:00
  • I wouldn't expect all 24 disks to die at the same time. Can this be somehow linked to the controller / storage? I already tried powering off and rebooting both the storage (an ancient Transtec T6100S24R1-F) and the head node. – Pavel Jul 06 '16 at 11:15
  • 2
    Neither would I, though four of them (`sd{8,11,20,35}`) are explicitly fingered in the output above, and if I'd bought 22 discs altogether I wouldn't be completely shocked if four went south at the same time. My point stands: if this data is of value to you, whether it's a controller dying (which can *royally* screw up a file system) or just several discs dying, **get that data off while you still can**. Then validate the hardware at your leisure, and restore the data once you're happy with it. **Do not take chances with data.** – MadHatter Jul 06 '16 at 11:44

0 Answers0