1

I have a 3-disk RAIDZ1 array with 3 HDDs:

# zpool status
...
config:

    NAME        STATE     READ WRITE CKSUM
    gpool       ONLINE       0     0     0
      raidz1-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdd     ONLINE       0     0     0
        sda     ONLINE       0     0     0

(when i created the pool i used /dev/disk/by-id paths but they show up as /dev/sdx).

I wanted to swap all 3 HDDs with SSDs, but do it gradually. As i have 6 SATA slots and an extra cable, i plugged a new SSD first and set it up, using the to-be-replaced disk as source:

    # sgdisk --replicate=/dev/disk/by-id/newSSD1 /dev/disk/by-id/oldHDD1
        The operation has completed successfully.
    # sgdisk --randomize-guids /dev/disk/by-id/newSSD1
        The operation has completed successfully.
    # grub-install /dev/disk/by-id/newSSD1
        Installing for i386-pc platform.
        Installation finished. No error reported.

Then fdisk -l /dev/disk/by-id/newSSD1 showed me that the partitions were the same as the 3 HDDs, meaning:

        Disk /dev/disk/by-id/newSSD1: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
        Disk model: CT1000MX500SSD1 
        Units: sectors of 1 * 512 = 512 bytes
        Sector size (logical/physical): 512 bytes / 4096 bytes
        I/O size (minimum/optimal): 4096 bytes / 4096 bytes
        Disklabel type: gpt
        Disk identifier: EF97564D-490F-4A76-B0F0-4E8C7CAFFBD2

        Device                                                      Start        End    Sectors   Size Type
        /dev/disk/by-id/newSSD1-part1       2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
        /dev/disk/by-id/newSSD1-part2         48       2047       2000  1000K BIOS boot
        /dev/disk/by-id/newSSD1-part9 1953507328 1953523711      16384     8M Solaris reserved 1

        Partition table entries are not in disk order.

Then i went ahead and replaced the disk:

    # zpool offline gpool /dev/sdb
    # zpool status
          pool: gpool
         state: DEGRADED
        status: One or more devices has been taken offline by the administrator.
            Sufficient replicas exist for the pool to continue functioning in a
            degraded state.
        action: Online the device using 'zpool online' or replace the device with
            'zpool replace'.
          scan: scrub repaired 0B in 0 days 00:30:46 with 0 errors on Sat Jun 27 12:29:56 2020
        config:

            NAME        STATE     READ WRITE CKSUM
            gpool       DEGRADED     0     0     0
              raidz1-0  DEGRADED     0     0     0
                sdb     OFFLINE      0     0     0
                sdd     ONLINE       0     0     0
                sda     ONLINE       0     0     0

        errors: No known data errors
    
    # zpool replace gpool /dev/sdb /dev/disk/by-id/newSSD1
    Make sure to wait until resilver is done before rebooting.

    # zpool status
      pool: gpool
     state: DEGRADED
    status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
    action: Wait for the resilver to complete.
      scan: resilver in progress since Thu Jul 16 20:00:58 2020
        427G scanned at 6.67G/s, 792M issued at 12.4M/s, 574G total
        0B resilvered, 0.13% done, 0 days 13:10:03 to go
    config:

        NAME                                    STATE     READ WRITE CKSUM
        gpool                                   DEGRADED     0     0     0
          raidz1-0                              DEGRADED     0     0     0
            replacing-0                         DEGRADED     0     0     0
              sdb                               OFFLINE      0     0     0
              ata-newSSD1                       ONLINE       0     0     0
            sdd                                 ONLINE       0     0     0
            sda                                 ONLINE       0     0     0

    errors: No known data errors

Eventually it resilvered.

    # zpool status
      pool: gpool
     state: ONLINE
    status: Some supported features are not enabled on the pool. The pool can
        still be used, but some features are unavailable.
    action: Enable all features using 'zpool upgrade'. Once this is done,
        the pool may no longer be accessible by software that does not support
        the features. See zpool-features(5) for details.
      scan: resilvered 192G in 0 days 00:27:48 with 0 errors on Thu Jul 16 20:28:46 2020
    config:

        NAME                                  STATE     READ WRITE CKSUM
        gpool                                 ONLINE       0     0     0
          raidz1-0                            ONLINE       0     0     0
            ata-SSD1                          ONLINE       0     0     0
            sdd                               ONLINE       0     0     0
            sda                               ONLINE       0     0     0

    errors: No known data errors

this time with by-id labels. Since i replicated the partitions and installed GRUB on the new SSD, i wasn't expecting any trouble.

However, when i booted, GRUB dropped me into a grub rescue> prompt because grub_file_filters not found. I tried booting off of the other 2 HDDs and the SSD, same error every time. Plugged the 3rd HDD back in, same result.

Today i booted off of the SSD.. it worked. The zpool is as expected, no grub errors. I'm writing this on this system.

ls on the rescue prompt does show a bunch of partitions as expected but i could only get GRUB to show meaningful information once i insmod zfs (or similar). However, trying to ls something like (hd0,gpt1)/ROOT/gentoo@/boot results in compression algorithm 73 not supported (or 80 also).

I'm running kernel 5.4.28 with accompanying initramfs, and root=ZFS parameters for GRUB. I haven't had any ZFS-root-boot related incidents until i decided to swap a drive. My /etc/default/grub has entries to find the ZFS root,

GRUB_CMDLINE_LINUX_DEFAULT="dozfs spl.spl_hostid=0xa8c06101 real_root=ZFS=gpool/ROOT/gentoo"

and it does. I'd like to keep on replacing the other disks, but i'd prefer to know what happened and how to avoid it first.

Edit 1

Something i noticed. After running sgdisk --replicate i get 3 partitions, same as the original disks:

# fdisk -l ${NEWDISK2}
Disk /dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 2190C74D-46C8-44AC-81FB-36C3B72A7EA7

Device                                                      Start        End    Sectors   Size Type
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part1       2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part2         48       2047       2000  1000K BIOS boot
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part9 1953507328 1953523711      16384     8M Solaris reserved 1

Partition table entries are not in disk order.

...but after running zpool replace i lose one partition:

# fdisk -l ${NEWDISK}
Disk /dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2: 931.53 GiB, 1000204886016 bytes, 1953525168 sectors
Disk model: CT1000MX500SSD1 
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: gpt
Disk identifier: 0FC0A6C0-F9F1-E341-B7BD-99D7B370D685

Device                                                      Start        End    Sectors   Size Type
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part1       2048 1953507327 1953505280 931.5G Solaris /usr & Apple ZFS
/dev/disk/by-id/ata-CT1000MX500SSD1_NEWDISK2-part9 1953507328 1953523711      16384     8M Solaris reserved 1

...the boot partition. This is weird, considering i managed to boot off of the new SSD.

I'll keep experiment. And as for ZFS versions:

# zpool version
zfs-0.8.4-r1-gentoo
zfs-kmod-0.8.3-r0-gentoo

Edit 2

This is consistent. When i replicate with sgdisk --replicate i get 3 partitions as their originals, including the BIOS boot partiton. After running zpool replace and resilvering, i lose the boot partition.

I assume the system still boots because the data of that partition is still there in the MBR, so the BIOS can kickstart GRUB.

This is the current status:

# zpool status
  pool: gpool
 state: ONLINE
status: Some supported features are not enabled on the pool. The pool can
    still be used, but some features are unavailable.
action: Enable all features using 'zpool upgrade'. Once this is done,
    the pool may no longer be accessible by software that does not support
    the features. See zpool-features(5) for details.
  scan: resilvered 192G in 0 days 00:08:04 with 0 errors on Fri Jul 17 21:04:54 2020
config:

    NAME                            STATE     READ WRITE CKSUM
    gpool                           ONLINE       0     0     0
      raidz1-0                      ONLINE       0     0     0
        ata-CT1000MX500SSD1_NEWSSD1 ONLINE       0     0     0
        ata-CT1000MX500SSD1_NEWSSD2 ONLINE       0     0     0
        ata-CT1000MX500SSD1_NEWSSD3 ONLINE       0     0     0

errors: No known data errors
vesperto
  • 220
  • 2
  • 6

0 Answers0