1

I have a proxmox (v5.4 I think) installation on top of ZFS in a server with 6 disks. There are 2 pools: rpool in a mirror of two SSDs that has the proxmox root filesystem and some containers and zvols. And the other 4 disks are in another pool (HDD) in raidz-1.

Two months ago I was doing some optimization work in the ZFS parameters and I activated dnodesize = auto in all pools, including rpool pool and children pools. Now I have rebooted the server and the following message said hello:

error: no such device: 40d7d14f38cc...
error: unknown filesystem
Entering rescue mode...
grub rescue>

With the research I have done I think it is related with the dnodesize parameter (check here, here and here). Directly quoting from the first link:

Some file must have triggered a non-legacy (512 bytes) dnode size in the dataset, which meant that GRUB could no longer read the drive.

Now I have some ideas on where to start but also a lot of doubts (I'm a not very experimented admin), so I wanted to ask you first, before reaching a dead end and/or break something definitely.

In the first place I could create a live usb with a ZFS capable OS. Probably the best option would be to use the same 5.4 proxmox ISO. Once there, could it be possible to load and manipulate the old proxmox pools from the live USB? How?

The problem is that isn't enough to set dnodesize = legacy as changes are not applied retroactively. So I don't know exactly how to proceed here. The options I have though are the following:

  1. If there is a way to detect which files are breaking the standard dnode limit of 512 bytes. Then cut/move them to another location and paste/move them back. Problem here, how to detect those files?
  2. Cut all the rpool content to an external disk and paste them again into rpool. Problem here is that I'm don't feel safe moving the proxmox root filesystem forward and backward and expect it to work again. Maybe with cp -a?
  3. Seems that the only content that needs to be readable by grub is the content inside /boot. But /boot folder is inside the rpool pool, and zfs pools aren't shrinkable. So I can't shrink any of the current pools to create a new one for boot, and I can't add any extra disk to the server. The only option would be to move the /boot to a pen drive and tell grub to search there. But it doesn't seem very robust or definitive solution.

Any advice?

EDIT: What about replacing grub2 for a more zfs friendly bootloader? From proxmox wiki:

systemd-boot is a lightweight EFI bootloader. It reads the kernel and initrd images directly from the EFI Service Partition (ESP) where it is installed. The main advantage of directly loading the kernel from the ESP is that it does not need to reimplement the drivers for accessing the storage. In the context of ZFS as root filesystem this means that you can use all optional features on your root pool instead of the subset which is also present in the ZFS implementation in grub or having to create a separate small boot-pool.

In fact it seems that:

Proxmox VE currently uses one of two bootloaders depending on the disk setup selected in the installer. For EFI Systems installed with ZFS as the root filesystem systemd-boot is used. All other deployments use the standard grub bootloader.

So I don't know why it has been installed with grub instead of systemd-boot. It's feasible to replace the bootloader without breaking anything?

Thanks for your help.

H25E
  • 11
  • 2
  • Root on ZFS has way too many possible failure modes and potential for data loss for me to think about using it, but at the minimum, /boot should not have ever been in there. It should have remained on a separate partition, which eliminates some (but not all) of those failure modes, including I expect the one you're experiencing. I would do a fresh installation and restore from backup, and very strongly consider not putting the root filesystem on ZFS. – Michael Hampton Dec 29 '20 at 20:52
  • Thanks for reply. I wasn't the admin guy when proxmox was installed. I replaced the admin guy just after the proxmox installation. I have never installed proxmox, but I think that this is the default proxmox installation mode for proxmox on zfs. – H25E Dec 29 '20 at 21:07
  • Switching from legacy to EFI boot generally requires reinstalling, or at least repartitioning (which you can't really do). So yes, you can and probably should do that, and curse the previous admin for _not_ using EFI boot. – Michael Hampton Dec 29 '20 at 23:01
  • Nono, it's an EFI boot. Tell me what make you though that it was an old BIOS legacy boot and I will try to explain it better. – H25E Dec 29 '20 at 23:19
  • What the previous admin did was to install GRUB2 instead of systemd-boot, which is proxmox recommendation for installations where the root filesystem is on ZFS, which is our case. Then, the probable fix is "only" to replace the current GRUB bootloader with a systemd-boot one. And configure it to boot from the already existant pools. – H25E Dec 29 '20 at 23:20
  • Well, that's an odd configuration. Sure, if it's already EFI, then you ought to be able to use systemd-boot. – Michael Hampton Dec 30 '20 at 03:45
  • But I don't know how. On bios I have `Storage Boot Option Control = UEFI` and `Other PCI devices = UEFI`. And I can't check it now, but I think that both drives of the rpool mirror have an efi partition. Which would be the best way to install systemd-boot now? – H25E Dec 30 '20 at 08:32

0 Answers0