-1

I'm running a Ubuntu server on a md RAID5. I started to have some issue with one disk, I received the following email from mdadm:

A DegradedArray event had been detected on md device /dev/md/0.
md0 : active raid5 sdb2[1](F) sdd2[2] sda2[0]
      1952861184 blocks super 1.2 level 5, 512k chunk, algorithm 2 [3/2] [U_U]
md1 : active raid0 sdb3[1] sdd3[2] sda3[0]
      2927924736 blocks super 1.2 512k chunks

And the following from smartd :

Device: /dev/sdb [SAT], Self-Test Log error count increased from 0 to 2
Device info:
ST2000DM001-1CH164, S/N:Z1E3M3TE, WWN:5-000c50-050534ead, FW:CC24, 2.00 TB

md0 is my /, and md1 is just for some none important data.

So, sdb is definitely falling appart... The issue is, the system apparently crashed somehow, and is not booting anymore. Right after the bios the screen go black and that's it, nothing more... I was expecting it to still boot on 2 disk and be slow, but it's not the case. Any idea why ?

I would like to boot the server in degraded mode asap as I need the server to be running, but I don't know what to do. Can you suggest something ? From that I'll then be able to repaire the raid5 volume isn't it ?

Do you think the error is localised in the disk and I'll be able to repair and come back to a stable state, or the disk is dead and I need to buy a new one ?

Thanks for your help.

Xantra
  • 55
  • 2
  • 6
  • 2
    Replace the faulty disk, and then restore your backups. If you've been doing it right, you'll be back up and running in not time. If you *haven't* been doing it right, there's no time like the present to start. – HopelessN00b Mar 03 '16 at 19:40

2 Answers2

0

Do you manage to get as far as the bootloader? If yes, remove all "spash" or "quiet" options to get as much output as possible.

If it doesn't even enter the bootloader, I can only imagine the disk is badly broken and somehow prevents the whole SATA controller to function. You could physically unplug the failed drive (you have the serial number in the email) and try if it boots then. It should boot off a degraded Raid5, and let you replace the disk and resynchronize.

And I would definitely buy a replacement disk, you'll need that anyway!

BeerSerc
  • 489
  • 3
  • 6
  • 1
    You're right, the HDD was so badly broken it was preventing my bios to post. I couldn't figure this out sooner as I don't have physical access to the machine, the computer is in an other country. So after asking someone there to remove the faulty drive I now have my system properly booting on the degraded raid5. Thanks everyone for your ideas – Xantra Mar 05 '16 at 19:25
0

I'm assuming you didn't forget to grub-install on all disks. If it's the problem I think it is, this problem has been known and ignored for years. Because the documentation says it is supported, people in the distros that can fix it don't acknowledge it. They even say to add a kernel cmdline option like "bootdegraded=1" which doesn't seem to do anything.

You cannot reliably boot mdadm with raid levels other than 1. And you cannot reliably boot with metadata versions other than 0.90 and 1.0. The documents, etc. all say anything is supported, but they simply don't work properly in certain cases, like when degraded. (Some distros have fixes for metadata, but they don't warn you about raid level... for example Ubuntu's installer will even use metadata 1.2 even though that is a bad idea). So you should have built the array long ago with raid1 and metadata 0.90 or 1.0 on a separate /boot array.

To fix it now, I guess you could boot a rescue system, and then free up some space on the disk, or a new disk, and create a separate /boot. Or use the rescue system only to rebuild disk1 (don't forget grub-install to the new disk).

Peter
  • 2,546
  • 1
  • 18
  • 25