I have a problem on a server with 4 x 1 TB drives running Debian wheezy and GRUB 1.99-27+deb7u3.
sda and sdb have partitions mirrored using (Linux software) RAID1, including /boot
. sdc and sdd have a single partition each, mirroring a LVM physical volume for data. GRUB is installed to sda and sdb. I used mdadm
to --fail
and --remove
the 1 TB sdc, and replaced the old drive (a ST91000640NS) with a new 2 TB ST2000NX0243.
With the new drive in, GRUB gets as far as
GRUB loading.
Welcome to GRUB!
but fails to show the menu. The drive light on sdc is lit continuously, so presumably the GRUB core is trying to read that drive, even though it's not needed to access /boot/grub. I've tried two drives of the same model, both of which test fine with smartctl
, with the same result. With the sdc drive bay empty, everything boots normally. The system boots from live USB and the new drive is accessible, so it's not a hardware incompatibility(*). I'm sure it was sdc that was removed, and there's no indication the BIOS reordered the drives.
(*) this may not have been a safe assumption. See answers.
So I have the following related questions:
- Could the changed logical sector size (4096 rather than 512 bytes) be causing a problem, perhaps in the RAID support built into the GRUB core? Why don't I at least get a
grub rescue>
prompt? Could a 4K problem also prevent using the drive for Linux RAID? - What's the quickest way to solve this? [Previous suggestions included: Do I need to reinstall GRUB with the new drive in place, and in that case how? Would a GRUB rescue USB (made from the same system) have the same problem? Is it a known bug in GRUB, and should I upgrade? Answers to these appear to be: no, yes and no.] Can I permanently configure the GRUB image prefix used by Debian?
- How would one go about debugging this stage of GRUB? It might be sensitive to what modules are built in, but how do you find that out?
I'm thinking of a debug.cfg with just debug=all
and something like:
grub-mkimage -c debug.cfg -o dcore.img configfile normal raid fs multiboot
grub-setup -c dcore.img /dev/sda
Would that work? (I address this point 3 in my own answer, but the hang in my case appears to happen before embedded configuration is acted on.)
More system details
In case it helps visualise, here's part of lsblk
output:
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sdb 8:16 0 931.5G 0 disk
├─sdb1 8:17 0 957M 0 part
│ └─md0 9:0 0 956.9M 0 raid1 /boot
├─sdb2 8:18 0 9.3G 0 part
│ └─md1 9:1 0 9.3G 0 raid1 /
├─sdb3 8:19 0 279.4G 0 part
│ └─md2 9:2 0 279.4G 0 raid1 /var
└─sdb4 8:20 0 641.9G 0 part
└─md3 9:3 0 641.9G 0 raid1
├─vg0-home (dm-0) 253:0 0 1.4T 0 lvm /home
└─vg0-swap (dm-2) 253:2 0 32G 0 lvm [SWAP]
sdc 8:32 0 931.5G 0 disk
└─sdc1 8:33 0 931.5G 0 part
└─md4 9:4 0 931.5G 0 raid1
└─vg0-home (dm-0) 253:0 0 1.4T 0 lvm /home
sdd 8:48 0 931.5G 0 disk
└─sdd1 8:49 0 931.5G 0 part
└─md4 9:4 0 931.5G 0 raid1
└─vg0-home (dm-0) 253:0 0 1.4T 0 lvm /home
sda 8:0 0 931.5G 0 disk
├─sda1 8:1 0 957M 0 part
│ └─md0 9:0 0 956.9M 0 raid1 /boot
├─sda2 8:2 0 9.3G 0 part
│ └─md1 9:1 0 9.3G 0 raid1 /
├─sda3 8:3 0 279.4G 0 part
│ └─md2 9:2 0 279.4G 0 raid1 /var
└─sda4 8:4 0 641.9G 0 part
└─md3 9:3 0 641.9G 0 raid1
├─vg0-home (dm-0) 253:0 0 1.4T 0 lvm /home
└─vg0-swap (dm-2) 253:2 0 32G 0 lvm [SWAP]
This is a pre-2010 BIOS and has no EFI capability.
Irrelevant: on the running system the following gives the same LVM error from grub-probe 1.99 as I get on grub-install, although everything appears to work (this seems fixed in GRUB 2.02).
# grub-fstest /dev/sda cp '(loop0,msdos1)/grub/grub.cfg' grub.cfg
error: unknown LVM metadata header.
The debug methods in the answer below show the prefix of the image being installed to sd[ab] is:
grub-mkimage -d /usr/lib/grub/i386-pc -O i386-pc --output=/boot/grub/core.img '--prefix=(mduuid/<UUID of sdN1>)/grub' biosdisk ext2 part_msdos part_msdos raid mdraid09
I don't know why 'part_msdos' is repeated. There are no gpt tables. md0 (boot) uses RAID superblock version 0.9, as do md1, md2 and md4 (these are old arrays). md3 is super 1.2, but shouldn't be involved in booting.
Update
Thanks for the suggestions so far. After further testing:
- The BIOS was already set to boot using sda (ata1.00). After GRUB was reinstalled to all drives with
dpkg-reconfigure grub-pc
, nothing changed and GRUB still hangs before the menu when the new drive is connected by SATA. This couldn't have been accounted for by /boot/grub contents not matching the core image anyway. Similarly, physically rearranging drives makes no difference. - An upgrade to GRUB to 2.02 in Debian Jessie only has the effect that the
Welcome to GRUB!
messages are not printed - instead it gets as far as changing graphics mode. It still hangs under the same conditions. - The hang appears to occur before the embedded configuration sets the
debug
variable. No useful debug information is emitted. - GRUB shows a menu when booted from a removable medium where the prefix does not use UUIDs, and in this way it is possible to boot the system with the drive physically present. However, TAB enumeration of drives freezes. As expected, chainloading GRUB from a hard drive hangs as before. Booting from a USB drive made by
grub-mkrescue
from the same system also hangs. - As a separate fault, on the live system (Linux 3.2.0-4-amd64), trying to add the new 4Kn drive to the RAID1 array, either via internal SATA or USB results in
Bad block number requested
on the device, followed by the md system failing the drive,BUG: unable to handle kernel paging request
and a kernel oops. (mdadm --remove
says the failed element is busy and the md-resync process doesn't respond to SIGKILL. I didn't tryecho frozen > /sys/block/mdX/md/sync_action
. Testing the drive usingdd
over SATA everything appears fine.). Surely the Linux MD drivers are capable of syncing a 4Kn drive with older drives and do not use the BIOS?
So workarounds might include mounting a non-RAID partition as /boot/
; installing GRUB with a device-dependent prefix; or flashing the BIOS. The most sensible thing is probably to contact the supplier to exchange the drives.
In other words question 3 has a solution whose ineffectiveness is possibly subject of a GRUB feature request; question 2 was barking up the wrong tree, so I've revised it; and question 1, if it's not going too far off topic, is now additionally about why the drive apparently cannot be used for Linux RAID.
I'd be happy to award the bounty to a decent explanation of any of this, something about the RAID resync bug, or anecdotes of using flashrom
for 4Kn support, how to tell grub-install not to use UUIDs or any relevant sysadmin tips.