0

I have my hardware-owned dedicated server hosted in remote datacenter, running CentOS release 6.4 (Final) with grub (GNU GRUB 0.97). Server has 6 2TB drives with 2 software raids on them - md0 raid 1 for system and swap and md1 raid 5 for data. Here is fdisk and mdstat output:

[root@s3 ~]# fdisk -l

Disk /dev/sda: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x0004f1ce

   Device Boot      Start         End      Blocks   Id  System
/dev/sda1   *           1         653     5242880   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sda2             653      243202  1948270592   fd  Linux raid autodetect

Disk /dev/sdb: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x0008177f

   Device Boot      Start         End      Blocks   Id  System
/dev/sdb1   *           1         653     5242880   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdb2             653      243202  1948270592   fd  Linux raid autodetect

Disk /dev/sdc: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x0008afb7

   Device Boot      Start         End      Blocks   Id  System
/dev/sdc1   *           1         653     5242880   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdc2             653      243202  1948270592   fd  Linux raid autodetect

Disk /dev/sde: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00012b28

   Device Boot      Start         End      Blocks   Id  System
/dev/sde1   *           1         653     5242880   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sde2             653      243202  1948270592   fd  Linux raid autodetect

Disk /dev/sdd: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000df271

   Device Boot      Start         End      Blocks   Id  System
/dev/sdd1   *           1         653     5242880   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdd2             653      243202  1948270592   fd  Linux raid autodetect

Disk /dev/sdf: 2000.4 GB, 2000398934016 bytes
255 heads, 63 sectors/track, 243201 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x000b6b9e

   Device Boot      Start         End      Blocks   Id  System
/dev/sdf1   *           1         653     5242880   fd  Linux raid autodetect
Partition 1 does not end on cylinder boundary.
/dev/sdf2             653      243202  1948270592   fd  Linux raid autodetect

Disk /dev/md1: 9974.5 GB, 9974471720960 bytes
2 heads, 4 sectors/track, -1859793536 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 2621440 bytes
Disk identifier: 0x00000000


Disk /dev/md0: 5368 MB, 5368643584 bytes
2 heads, 4 sectors/track, 1310704 cylinders
Units = cylinders of 8 * 512 = 4096 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disk identifier: 0x00000000


Disk /dev/mapper/vg_s3-LogVol01: 12.6 GB, 12582912000 bytes
255 heads, 63 sectors/track, 1529 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 2621440 bytes
Disk identifier: 0x00000000


Disk /dev/mapper/vg_s3-LogVol00: 3271 MB, 3271557120 bytes
255 heads, 63 sectors/track, 397 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 2621440 bytes
Disk identifier: 0x00000000


Disk /dev/mapper/vg_s3-LogVol02: 9958.6 GB, 9958615678976 bytes
255 heads, 63 sectors/track, 1210732 cylinders
Units = cylinders of 16065 * 512 = 8225280 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 524288 bytes / 2621440 bytes
Disk identifier: 0x00000000

[root@s3 ~]# cat /proc/mdstat
Personalities : [raid6] [raid5] [raid4] [raid1]
md0 : active raid1 sdc1[2] sda1[0] sdd1[3] sdb1[1] sde1[4] sdf1[5]
      5242816 blocks super 1.0 [6/6] [UUUUUU]
      bitmap: 1/1 pages [4KB], 65536KB chunk

    md1 : active raid5 sda2[0](F) sdd2[3] sdc2[2] sdb2[1] sde2[4] sdf2[6]
          9740695040 blocks super 1.1 level 5, 512k chunk, algorithm 2 [6/5] [_UUUUU]
          bitmap: 15/15 pages [60KB], 65536KB chunk

    unused devices: <none>

As you can see, the array md1 is in degraded state, /dev/sda drive is faulty. I have exchanged many faulty drives before in other servers, but here /dev/sda is also boot drive (probably). Here is my boot grub device map:

# this device map was generated by anaconda
(hd0)     /dev/sdb
(hd1)     /dev/sdc
(hd2)     /dev/sdd
(hd3)     /dev/sde
(hd4)     /dev/sdf
(hd5)     /dev/sdg

/dev/sda is not listed there for some reason and it lists /dev/sdg, which according to fdisk does not exist. Here is /boot/grub/grub.conf:

# grub.conf generated by anaconda
#
# Note that you do not have to rerun grub after making changes to this file
# NOTICE:  You do not have a /boot partition.  This means that
#          all kernel and initrd paths are relative to /, eg.
#          root (hd0,0)
#          kernel /boot/vmlinuz-version ro root=/dev/md0
#          initrd /boot/initrd-[generic-]version.img
#boot=/dev/sdb
default=0
timeout=5
splashimage=(hd0,0)/boot/grub/splash.xpm.gz
hiddenmenu
title CentOS (2.6.32-358.el6.x86_64)
    root (hd0,0)
    kernel /boot/vmlinuz-2.6.32-358.el6.x86_64 ro root=UUID=cf9ba269-255e-4650-a095-87f2cdc5e22e rd_NO_LUKS rd_LVM_LV=vg_s3/LogVol01 LANG=en_US.UTF-8 rd_MD_UUID=596e4bea:c4494ac6:b2007529:1c8053a7 SYSFONT=latarcyrheb-sun16 crashkernel=auto rd_MD_UUID=40df07b4:85b88119:f11dabdf:97836f34  KEYBOARDTYPE=pc KEYTABLE=us rd_NO_DM rhgb quiet
    initrd /boot/initramfs-2.6.32-358.el6.x86_64.img

It lists hd(0,0) as boot drive. I think this is dev/sda drive, which is faulty. So if I turned now the server off to change the drive, it would not boot again. I am trying to switch boot drive to some other drive, I have done a lot of searching online, but I am unable to figure it out. I tried these commands:

[root@s3 ~]# grub-install /dev/sdb
/dev/sda1 does not have any corresponding BIOS drive.
[root@s3 ~]# grub
Probing devices to guess BIOS drives. This may take a long time.


    GNU GRUB  version 0.97  (640K lower / 3072K upper memory)

 [ Minimal BASH-like line editing is supported.  For the first word, TAB
   lists possible command completions.  Anywhere else TAB lists the possible
   completions of a device/filename.]
grub> find /grub/stage1
find /grub/stage1

Error 15: File not found
grub> find /boot/grub/stage1
find /boot/grub/stage1
 (hd0,0)
 (hd1,0)
 (hd2,0)
 (hd3,0)
 (hd4,0)
 (hd5,0)
grub> cat (hd0,0)/grub/grub.conf
cat (hd0,0)/grub/grub.conf

Error 15: File not found
grub> quit
quit 

What changes should I make to grub to be able to boot after I swap /dev/sda. HDD boot order will have to be probably modified also in BIOS.

Josef
  • 21
  • 1
  • 4

1 Answers1

0

When using software RAID, the selected boot drive depends entirely on the BIOS (and on how it enumerate drives).

To have a bootable machine, simply replace the failed drive and use grub-install to install the bootloader on the BIOS boot drive or, even better, to all the drive partecipating in the RAID1 array.

Please see here for more information.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Thanks for your help. I looked at the link and they are using `grub-install` command aswell. When I try to use `grub-install` like `[root@s3 ~]# grub-install /dev/sdb` or `[root@s3 ~]# grub-install /dev/sdc`, it gives this error: `/dev/sda1 does not have any corresponding BIOS drive.` – Josef Nov 19 '16 at 19:13
  • Have you rebooted the machine after installing the new drive? Does the BIOS recognize it? – shodanshok Nov 19 '16 at 19:24
  • No, I have not changed the drive yet. I am trying to prepare for the change. I wanted to shutdown the server for the change as I am not 100 % sure that disk marked as /dev/sda is really /dev/sda. If the server was off, I would check it by disk serial number. If I swapped bad drive while server is on, raid 5 would break. So I wanted to make sure that the other drives are bootable before I change the drive. But now I see that I will have to risk it and change the drive while server is running. – Josef Nov 19 '16 at 20:47
  • Rather then pulling a random drive, try to identify the right disk correlating information you can obtain from `dmesg`, `mdamd -E` and `smartctl` – shodanshok Nov 19 '16 at 22:04
  • I know serial number of the drive from `hdparm`. Problem is that I will be able to confirm the serial number only after I pull the disk out. I put stickers on drives with sda, sdb etc. labels on drives when I first installed the server, I just hope that I did it correctly back then. If the server was off, I would be sure that I changed the right drive, but it would be risk that it would not boot again. – Josef Nov 20 '16 at 00:05
  • From your comment above, you run `grub-install` on both `sdb` and `sdc`, so you should have the boot code on these two drives. I would pull the drive *after* a server shutdown, rather than risk a full server crash due to removing of a wrong drive. Anyway, searching in the output of `dmesg` and looking inside `/sys/`, you should be able to identify which drive is connected to which SATA channel. Give a look [here](http://serverfault.com/questions/244944/linux-ata-errors-translating-to-a-device-name) for more information. – shodanshok Nov 20 '16 at 06:50