8

I just upgraded Ubuntu 14.04, and I had two ZFS pools on the server. There was some minor issue with me fighting with the ZFS driver and the kernel version, but that's worked out now. One pool came online, and mounted fine. The other didn't. The main difference between the tool is one was just a pool of disks (video/music storage), and the other was a raidz set (documents, etc)

I've already attempted exporting and re-importing the pool, to no avail, attempting to import gets me this:

root@kyou:/home/matt# zpool import -fFX -d /dev/disk/by-id/
   pool: storage
     id: 15855792916570596778
  state: UNAVAIL
 status: One or more devices contains corrupted data.
 action: The pool cannot be imported due to damaged devices or data.
   see: http://zfsonlinux.org/msg/ZFS-8000-5E
 config:

        storage                                      UNAVAIL  insufficient replicas
          raidz1-0                                   UNAVAIL  insufficient replicas
            ata-SAMSUNG_HD103SJ_S246J90B134910       UNAVAIL
            ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523  UNAVAIL
            ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969  UNAVAIL

The symlinks for those in /dev/disk/by-id also exist:

root@kyou:/home/matt# ls -l /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910* /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51*
lrwxrwxrwx 1 root root  9 May 27 19:31 /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910 -> ../../sdb
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910-part1 -> ../../sdb1
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910-part9 -> ../../sdb9
lrwxrwxrwx 1 root root  9 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523 -> ../../sdd
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523-part1 -> ../../sdd1
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523-part9 -> ../../sdd9
lrwxrwxrwx 1 root root  9 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969 -> ../../sde
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969-part1 -> ../../sde1
lrwxrwxrwx 1 root root 10 May 27 19:15 /dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969-part9 -> ../../sde9

Inspecting the various /dev/sd* devices listed, they appear to be the correct ones (The 3 1TB drives that were in a raidz array).

I've run zdb -l on each drive, dumping it to a file, and running a diff. The only difference on the 3 are the guid fields (Which I assume is expected). All 3 labels on each one are basically identical, and are as follows:

version: 5000
name: 'storage'
state: 0
txg: 4
pool_guid: 15855792916570596778
hostname: 'kyou'
top_guid: 1683909657511667860
guid: 8815283814047599968
vdev_children: 1
vdev_tree:
    type: 'raidz'
    id: 0
    guid: 1683909657511667860
    nparity: 1
    metaslab_array: 33
    metaslab_shift: 34
    ashift: 9
    asize: 3000569954304
    is_log: 0
    create_txg: 4
    children[0]:
        type: 'disk'
        id: 0
        guid: 8815283814047599968
        path: '/dev/disk/by-id/ata-SAMSUNG_HD103SJ_S246J90B134910-part1'
        whole_disk: 1
        create_txg: 4
    children[1]:
        type: 'disk'
        id: 1
        guid: 18036424618735999728
        path: '/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51422523-part1'
        whole_disk: 1
        create_txg: 4
    children[2]:
        type: 'disk'
        id: 2
        guid: 10307555127976192266
        path: '/dev/disk/by-id/ata-WDC_WD10EARS-00Y5B1_WD-WMAV51535969-part1'
        whole_disk: 1
        create_txg: 4
features_for_read:

Stupidly, I do not have a recent backup of this pool. However, the pool was fine before reboot, and Linux sees the disks fine (I have smartctl running now to double check)

So, in summary:

  • I upgraded Ubuntu, and lost access to one of my two zpools.
  • The difference between the pools is the one that came up was JBOD, the other was zraid.
  • All drives in the unmountable zpool are marked UNAVAIL, with no notes for corrupted data
  • The pools were both created with disks referenced from /dev/disk/by-id/.
  • Symlinks from /dev/disk/by-id to the various /dev/sd devices seems to be correct
  • zdb can read the labels from the drives.
  • Pool has already been attempted to be exported/imported, and isn't able to import again.

Is there some sort of black magic I can invoke via zpool/zfs to bring these disks back into a reasonable array? Can I run zpool create zraid ... without losing my data? Is my data gone anyhow?

Matt Sieker
  • 358
  • 4
  • 11

3 Answers3

5

After lots and lots more Googling on this specific error message I was getting:

root@kyou:/home/matt# zpool import -f storage
cannot import 'storage': one or more devices are already in use

(Included here for posterity and search indexes) I found this:

https://groups.google.com/a/zfsonlinux.org/forum/#!topic/zfs-discuss/VVEwd1VFDmc

It was using the same partitions and was adding them to mdraid during any boot before ZFS was loaded.

I remembered seeing some mdadm lines in dmesg and sure enough:

root@kyou:/home/matt# cat /proc/mdstat
Personalities : [linear] [multipath] [raid0] [raid1] [raid6] [raid5] [raid4] [raid10]
md126 : active raid5 sdd[2] sdb[0] sde[1]
      1953524992 blocks level 5, 64k chunk, algorithm 2 [3/3] [UUU]

These drives were, once upon a time, part of a software raid5 array. For some reason, during the upgrade, it decided to rescan the drives, and find that the drives were once part of an md array, and decided to recreate it. This was verified with:

root@kyou:/storage# mdadm --examine /dev/sd[a-z]

Those three drives showed a bunch of information. For now, stopping the array:

root@kyou:/home/matt# mdadm --stop /dev/md126
mdadm: stopped /dev/md126

And re-running import:

root@kyou:/home/matt# zpool import -f storage

has brought the array back online.

Now I make a snapshot of that pool for backup, and run mdadm --zero-superblock on them.

Matt Sieker
  • 358
  • 4
  • 11
4

Ubuntu seems to have some annoying udev issues that we don't see on the Red Hat/CentOS side. I'd recommend using the WWN-based device names if you can, as they seem less susceptible to this.

Have you seen: Why did rebooting cause one side of my ZFS mirror to become UNAVAIL?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 2
    I've seen those, and reading the thread linked in one, it seems the problem is udev not creating symlinks for all partitions on the device. I just checked all three drives. They each have partition numbers 1 and 9, and those have symlinks in `/dev/disk/by-id` for those drives, and all of the symlinks for one device point to the same `/dev/sd*` drive. And the closest thing I can find to a solution (use zpool replace), I can't do since I can't re-import the pool. – Matt Sieker May 28 '14 at 01:42
2

I ran into almost this exact problem trying to upgrade to the 3.13 series kernels on Debian Wheezy. You are right in your comment; it is a udev bug. I never did get it sorted unfortunately but it's worth exploring other kernels, especially the 3.11 series, for compatibility with the 0.6.2 version of ZOL. Just use the older kernel until 0.6.3 comes out.

Joshua Boniface
  • 324
  • 3
  • 14
  • It's pretty unacceptable that udev would break in this manner. I don't use Ubuntu, but things like this make it seem really unpolished compared to the RHEL offerings. – ewwhite May 28 '14 at 07:18