1

I got a server with Debian Jessie, 4 Drives sda to sdd all of which are partitioned likewise. The system is in a raid1 md raid over all drives. All drives do have grub and I can swap discs with each other, each one is bootable and the system boots up happy. All drives do contain exactly the same format:

  sdx1 - Boot Partition, GRUB installed
  sdx2 - Raid 1 /boot
  sdx3 - Raid 1 /
  sdx4 - Raid 10 swap
  sdx5 - non-md btrfs Raid 6 /data

The data partition is raid6 btrfs, I'm currently trying to upgrade my capacity by swapping out a drive for a bigger one, since I can have two fails my first instinct was to just replace one of the drives and boot back up, restore the failed raid arrays with the newly installed drive and after the rebuild everything is back to normal.

BUT the machine (which sadly is headless currently) does not boot once I swap the drives to something that invalidates the raid array. I can swap the discs with each other all day long and it happily boots. But if I remove a disc or swap in anything that is not part of the raid it fails to boot.

Am I missing something? How can I tell md that it is ok to boot with missing discs/degraded array automatically? In the end as far as md is concerned even one of the four discs can support the whole system by itself, the data partition is another beast as it needs at least two drives but md should not be concerned with that as that is a pure btrfs raid.

I know for the current usecase I could just remove the drive from the raid, upgrade it and then put it back there, but in the event of a fail I don't have the possibility to remove the drive if the system does fail to start up.

bardiir
  • 71
  • 7
  • Which mount points are stored on which raid arrays? For a standard linux system /boot and swap at a minimum should be on your raid1 array as the standard default grub installed to the boot sectors of the disks cannot read raid5 or 6, that requires the programs store in /boot. Also since it isn't mounting the raid6 you may find it gets part way through the boot sequence until it needs something on there such as from the /usr /bin /sbin /etc folders? – BeowulfNode42 Feb 02 '17 at 09:02
  • I've updated the question to include this. But in the end everything that is required for the system is within a raid1 md raid. Everything else should be optional anyway for a boot. And it does boot perfectly fine from any of the discs, just not if the raid is not complete. – bardiir Feb 03 '17 at 14:51
  • Perhaps something to do with auto starting degraded arrays, or starting them as read only. I've come across a distro or two that defaulted to false for that feature, which could have included debian. Try searching along those lines. Perhaps grub is not mounting the raid 1 due to it being degraded, hence no / or /boot filesystems, hence no boot. – BeowulfNode42 Feb 15 '17 at 07:26

2 Answers2

1

As an update and the answer - in the meantime I figured out that the only thing really missing here was the nofail flag in fstab. The filesystem was degraded and it would not mount the filesystem in a degraded state without the nofail option beeing set.

bardiir
  • 71
  • 7
0

As far as I know it is not yet possible to create a raid with mdadm which you can boot from without having separate boot partitions. I assume you set it up in a similar way as described here, it uses a raid10, but applies to other raid levels:

How to create a bootable redundant Debian system with a 3 or 4 (or more) disk software raid10?

It's possible you did not configure the other disks to be booted from in the bios? Or else the boot partitions are not exactly the same, that is exact copies with the same UUID.

To enable a specific disk to boot it will need to have a boot sector, and the bios needs to be configured to boot from it (along with a list of other boot disks that are part of the raid). Of course for a boot to complete successfully the disk will also need to have a boot partition. Since these boot partitions are not part of the raid each boot disk has its own. If you make sure each boot partition contains the exact same filesystem (using dd for example, to copy it over) and each disk has a boot sector created using the images on that boot partition the system should be able to boot from any of the disks. Even if the raid is degraded, a degraded raid should not prevent a successful boot. Otherwise that renders a big benefit of having a raid moot.

Quoting from the link:

Each disk that is part of the raid should have a bootable partition of about 1 GB that is NOT part of the raid. Create these partitions as normal, they have to be exactly the same size. Mark them as bootable, the mountpoint on one of the disks should be /boot, you can leave the others as unmounted.

Once you have used dd to make exact copies of the boot partition:

Now make sure that your bios is configured to attempt to boot from all 3 disks, order doesn't matter. As long as the bios will try to boot from any disk then in case one of the disks fails the system will automagically boot from the other disk because the UUIDs are exactly the same.

aseq
  • 4,550
  • 1
  • 22
  • 46
  • No, all disks are perfectly bootable, I can boot every disc and it boots up perfectly fine. But if any disk is missing then it doesn't boot at all. So order of the disks doesn't matter, which disk boots is completely irrelevant sadly, otherwise the matter would be easier to debug. The system boots up to at least grub I guess. If I remove one disk, boot up wait a while nothing happens, but if I replace the disk and boot up again it comes up with the disk marked as unsynced in the md raid. After a rebuild of the disk everything is fine again. But if I replace any new disk or no disk then no boot – bardiir Jan 29 '17 at 16:49
  • You mentioned that nothing happens. Do you wait long enough? Typically if grub tries to boot and it can not find a boot or root filesystem it will show an error of some kind. It may take a while for this to show up. Some speculating, it may just be waiting for the missing disk and after a timeout it will boot fine. The mdadm software my be causing this delay. – aseq Feb 04 '17 at 00:34
  • Have you tried removing a disk and then booting with rescue media? This might show you more about the state of the mdd. If the mdd is totally broken, your raid1 isn't what you think it is, which would be good to know. – Dylan Martin Feb 04 '17 at 00:40
  • As stated, the server is headless, no graphic output. So if it doesn't boot there is currently no way to tell what is showing up. Removing a disc does break the array after replacing the disc and booting again, so something happens in between, the removed disc is not in sync anymore and needs to be re-added to the raid. After that everything is fine. I waited for 4 hours one time for it to boot with one disc less. I know it left the bios as the status led stopped blinking but I guess it just stopped in GRUB then with an error, but I want a degraded boot, no error. – bardiir Feb 05 '17 at 20:35
  • Lacking any other information I have no other solution I can provide. By the way it may prove helpful to upvote people who take the time and effort to help you out. – aseq Feb 06 '17 at 21:36