5

But I don't believe it.

The machine is a Dell PowerEdge 2600 server running Windows Server 2008 trial 32bit (yah, its not supposed to...but it works! [well, it used to]).

For the sake of confusion: the drives are numbered 0, 1 and 2.

I was coding away as usual when I noticed the Dell logo on the front of the case was orange. So I opened the case door and saw that the HD vents were completely covered with dust (I know its not related to the orange light...but I hate dust). Since the drives are hot-swappable I yanked drive 2 out and cleaned off the dust and put it back in. I then yanked drive 1 out and cleaned the dust off that one and put it back in. Someone asked be to help set up a printer on their machine so I got up and 20 minutes later came back to see 'No boot device available - strike F1 to retry boot, F2 for setup utility` displayed on the server's monitor. I look down at the drives and drives 1 and 2 have orange lights instead of the green ones!

Since then here is what I have tried:

  • Installed drives into a Dell PowerEdge 2500. Drives were detected fine. Got a message stating Missing operating system.
  • Reset the bios on the original PowerEdge 2600 (pulled the bios battery out). All drives appear fine. Get the Missing operating system message when booting. A drive lights are green.
  • Booted Ubuntu from a CD to inspect the drives. 2 of the drives are displayed in Computer. Since the data is striped the files/folders in the drives are gibberish.
  • Booted Ubuntu and opened Terminal and executed sudo fdisk -l which listed the 3 drives. On the 3rd drive listed it states Disk identifier: 0x00000000 Disk /dev/sdb doesn't contain a valid partition table

Do you think the drives ARE actually toast?
Could it be SCSI or other hardware failure?
Could it be incorrect System Settings? Is there any way to create a virtual RAID in Ubuntu on the 2 drives that are "valid" so I can copy the data to a network share?
Should I try reinstalling the Windows Server OS(eek!)? Do you have any suggestions that I can try?


UPDATE

After doing lots of googling I came across Raid Reconstructor. I tried this program using my Dell PowerEdge 2600 using a bootable windows XP CD but it did not work (no drives detected). I then installed two of the drives into the PowerEdge 2500 alongside the 2500's existing single-drive RAID 0 running Microsoft Server 2003. I then installed and activated Raid Reconstructor which created a virtual image of the RAID-5 array, opened the image with Captain Nemo, and backed up my C:/Websites directory to another computer...with ALL files intact (so far)!!!

I will hopefully be able to restore the drives 100%

Lessons learned:

  • I don't care if the server can "hot-swap" drives. DON'T FREAKING DO IT!
  • Back up your data, dummy!

Thanks for all your help, answers, and comments (and for being wrong about the data loss. haha)!

David Murdoch
  • 492
  • 6
  • 19
  • 10
    Of course you can hot-swap drives with no problems, assuming you do it right. You can't just yank a drive, put it back, and immediately yank another drive. Arrays need to rebuild after a failure, and removing a drive is treated as a failure. Just because you didn't know this doesn't mean that there is a problem with hot-swapping drives. – MDMarra Jun 04 '10 at 22:49
  • I've spoken to others who had bad experiences with hot-swapping drives in Dell x9xx machines. On Dell 2950 and 1950 machines, we've had a number of problems when hot-swapping a drive (both SAS and SATA drives). It should work, but I've seen some pretty bad failures, but for us most of the failures were recoverable. – Stefan Lasiewski Jun 04 '10 at 22:57
  • 4
    @Stefan: The issue isn't with the hot swapping at all. The problem is when drive 2 was pulled, the array would have gone to *Degraded* state, then when drive 1 was pulled, the array would have gone *Failed*, and as Chopper3 aptly puts it - is toast. – Ben Pilbrow Jun 04 '10 at 23:02
  • @Stefan - We are almost exclusively a Dell shop and have dozens of 19xx and 29xx series servers in production and haven't had a single issue with hot-swap. We also keep the RAID controller firmware up to date when the release notes indicate a worthwhile fix. But many failed drives have been hot-swapped by me personally in those systems with no ill effects. – MDMarra Jun 04 '10 at 23:08
  • 3
    A better clarification of, "I don't care if the server can "hot-swap" drives. DON'T FREAKING DO IT!" would be to make sure you thoroughly understand RAID, what it means, what the different RAID levels are, and how each RAID level handles disk loss/removal. If this has been a RAID10 array, you probably would have been fine (assuming you didn't pull two disks from the same mirrored pair). It all comes down to understanding your system before you touch things like that. Hot Swap was never the problem here. Ignorance was. – Christopher Cashell Jun 23 '10 at 16:21
  • @Christoper +1. I agree. :-) – David Murdoch Jun 23 '10 at 16:31
  • It's also a good idea to keep the firmware in the RAID cards and the drives up to date. But I agree, you can't pull a drive, put it back, and immediately pull another. You need to check the state of the RAID array in the software tools before pulling anything out. Thanks for the tip on the recovery tools though. – mauvedeity Sep 23 '11 at 19:39

3 Answers3

24

Biggest 'Doh!' of the week I reckon - sorry dude.

The drives themselves won't be physically broken, this is simply that you've killed the array by removing a second disk before the first one had rebuilt - I'm >90% sure your array is toast. Basically you shouldn't have removed them at all while live, if you absolutely had to you should have waited for the array to rebuild before doing the second disk.

It's reinstall/restore time I'm afraid - your data is gone.

Chopper3
  • 100,240
  • 9
  • 106
  • 238
  • Sounds like this is exactly what happened. – egorgry Jun 04 '10 at 14:20
  • Yah. Thats what I was afraid of too. – David Murdoch Jun 04 '10 at 14:26
  • 1
    Never, ever pull a working drive in a live server unless you absolutely have no choice in the matter (like someone has a gun to your head). Shutdown the server (or heck, yank the power cords) before ever messing with working hardware. – Chris S Jun 04 '10 at 14:26
  • Hope you didn't have anything planned for the weekend :( - I think we've all found out the hard way that you should never remove a disk just for the hell of it - I did just the same kind of thing back when you were in kindergarden ;) – Chopper3 Jun 04 '10 at 14:27
  • I still get nervous swapping drives on our netapp with raid-dp. :) Funny little story I once worked with a gentleman that tried to hot swap a scsi cable on a HPUX system, I think a K580? That didn't work out in his favor. – egorgry Jun 04 '10 at 14:35
  • I have thousands of servers and I pull drives live all the time (can't take the server down no matter what, and true they are usually in clusters so its not an apples to apples comparison), the trick is to know when not to pull them. The biggest problem I see is that sysadmin's don't install the array manager for the given os. Thus they are blind to some issues that they need to be aware of before pulling a drive. – tony roth Jun 04 '10 at 14:38
  • I'm going to first try to reinstall the OS i guess. – David Murdoch Jun 04 '10 at 14:45
  • You'll have to rebuild the array first of course – Chopper3 Jun 04 '10 at 14:51
  • Please read my update posted in the answer. – David Murdoch Jun 04 '10 at 22:20
  • I can't see what's changed? – Chopper3 Jun 04 '10 at 22:37
  • Ah seen your update now, well done, I was certain it was dead and had never come across that app, so thanks. – Chopper3 Jun 05 '10 at 06:45
2

After retrieving my data with Raid Reconstructor I went to reconfigure my raid and reinstall the OS.

When I got the OS install prompt I decided one last time to try to repair the OS boot files manually from CMD prompt....

It Worked.

Computer is back up and running (limping). I still need to do a full repair install since some system files are being reported as corrupt.

David Murdoch
  • 492
  • 6
  • 19
0

a lot of times its the backplane or scsi controller thats bad, if it was a backplane issue then in my case 9 times out of 10 it was a firmware issue.

on the 2500 did you have it rebuild the array or just put the drives in and it found the array?

edit:
should have read your question better! Chopper3 is right.

tony roth
  • 3,844
  • 17
  • 14