sda1 (?) raid failed on debian - what to do now?

Question

ispconfig says that my server has raid problems. The server is not mine, it is rented from a hosting company. The OS installation was not my business: the hosting firm did it.

cat /proc/mdstat 

Personalities : [raid1] 
md0 : active raid1 sda1[2](F) sdb1[1]
      312568576 blocks [2/1] [_U]

I'm really not familiar with this problem, I never met problem like this.

I guess sda1 is dead. Can you help me what to do now (apart from that I should call the hosting firm)? I have everything important backed up.

score 8 · Accepted Answer · answered Apr 29 '12 at 20:52

8

Don't panic, this is a common and recoverable error. Your hosting company set up a two-disk redundant array to protect the data in case one of the disks fails. This failure has now occurred. The output indicates that sda1 has failed, and that the RAID1 array is working, but degraded.

Right now you're on borrowed time, though. If the second disk fails, that data is gone and you'll have to restore from backup. Ask your hosting company to replace the failed disk immediately and get back to you when it's done!

answered Apr 29 '12 at 20:52

Joel E Salas

5,562
15
25

thanks a lot. will the hosting company need to log in to my server and do anyting via the console, or they 'only' have to physically replace the disk and thats all? will i have to do anyting after the disk was replaced? – Wildfire Apr 29 '12 at 20:57
That depends on your service agreement with the hosting company. If they're a "full service" shop, then they'll do everything that's required and you can put your feet up on the table. If they ONLY physically swap the disk, you need to get your feet wet with mdadm. You don't sound comfortable with this option. If at ALL possible, give them access and have them do the entire job. Expect them to bill at a hefty rate (100-200 USD per hour). It's worth the peace of mind (and having someone to yell at). – Joel E Salas Apr 29 '12 at 21:00
ok, my last question before calling them (its almost midnigt here): if they will *not* do anyting except replacing the drive, do i need to do anyting *before* the disk replacement? – Wildfire Apr 29 '12 at 21:04
Depends on the storage controller, and the way that booting is configured. Pulling the disk hot (while it's on) could cause a kernel panic. Furthermore, that failed disk may have other things (like the files and settings required for Linux to boot). These are questions that your hosting company should be able to answer. – Joel E Salas Apr 29 '12 at 21:06
The disk does not necessarily have to be dead. I have brought disks back to life and made them an active part of the mdadm array again (and running for years) after having shown a similar failure. You may have good luck failing/removing and re-adding the disk with the proper mdam commands. Of course you need to know what you're doing, so if you feel uncomfortable with it, don't try it. – aseq Apr 30 '12 at 22:05
@aseq: Disks are cheap, your data probably isn't. Unless you know that a transient error happened and probably won't happen again, once a drive is marked bad, why not replace it? – Bill Weiss May 06 '12 at 02:30
Sure if this is a server with critical data go ahead and replace it. Though it's useful to try and bring the disk back to life until you got a replacement. But if the server is lowly QA machine or dev test box that's being abused anyways I see little point replacing a disk unless it is really broken. – aseq May 06 '12 at 04:28

score 1 · Answer 2 · answered May 06 '12 at 01:31

Also, making sure the backups are fine before messing or having someone mess with a degraded raid is a good idea. Cascade failures sadly happen, and so do mistakes by host staff (triple check that you and the hands at the hosting company are on the same page as to what is to be done to which disk).

AFAIK, if the device is called /dev/mdX, it is always linux softraid, so no hardware storage controller apart from a straight SATA or SAS host adapter is involved.

There are ways in linux to tell it that a disk is to be logically removed or has been added, however these should ever only be necessary if hotplugging directly attached PATA or parallel SCSI devices (which should be considered verboten anyway on hardware that does not explicitly support it anyway).

smartctl (from the smartmontools package; do not run it if there is an ssd involved AND the provider did not set up a smartmontools daemon or cron script on the server; there are versions of smart utilities known to damage certain ssds) can tell you a lot about WHAT is wrong with a drive, especially if directly attached as is the case here, so can calling dmesg. The spinup_count and power_on_hours values you get on the replacement disks you get sometimes make for interesting discussion topics with hosters ;)

sda1 (?) raid failed on debian - what to do now?

2 Answers2