1

I have a HyperV box that is showing bad blocks on one of the disks. I got this from diskpart;

DISKPART> list disk

  Disk ###  Status         Size     Free     Dyn  Gpt
  --------  -------------  -------  -------  ---  ---
  Disk 0    Online          148 GB  4096 MB
  Disk 1    Online         1863 GB      0 B   *
  Disk 2    Online         1863 GB      0 B   *
  Disk 3    Errors         1863 GB      0 B   *

I typed;

sel disk 3
offline disk
online disk

And now it's showing simply as online. Is that enough? Presumably it can work around a bad block or two. Is there any way of re-formatting the failing and re-syncing it with the array from the command line. Will I have to replace it immediately?

Update - still shows 'Failed Rd' after repair

So using a spare cable, I've plugged in a brand new additional HDD. Apparently I'm supposed to leave the existing unit in place as it cannot repair an array with a missing disk (I don't know why - that would have seemed to be the point). Then I did the following to initialise it;

sel disk 4
convert dynamic

then to repair the array;

sel vol 0
repair disk=4

As I understand it, this is supposed to use the new disk 4 to repair the array without the failing disk 3. And as expected, I get this;

  DISKPART> list vol

  Volume ###  Ltr  Label        Fs     Type        Size     Status     Info
  ----------  ---  -----------  -----  ----------  -------  ---------  --------
* Volume 0     E   E_RAID5_4TB  NTFS   RAID-5      3726 GB  Rebuild
  Volume 1     C   C_BOOT(MIR)  NTFS   Partition     39 GB  Healthy    System
  Volume 2     D   D_DATA(MIR)  NTFS   Partition     52 GB  Healthy
  Volume 3     G   G_IMAGES(MI  NTFS   Partition     52 GB  Healthy    Boot

However after around 12-24 hours the array reverted back to Failed Rd and 1863 free space on the new disk. I've tried twice now with the same results. I'm now trying the simpler recover command but I'm expecting the same result.

Over the last decade or so, this has continued to be my experience with RAID. For personal servers where the frequency of drive failure is relatively small (around every 2-3 years om average), I'm certainly of the opinion that Windows RAID is less hassle than any hardware controller that always seems to be deprecated by the time I need to use it for recovery. However, I don't think I've ever managed to recover a RAID array with either hardware OR software easily and live the dream the way it was promised.

When (as I expect) the recover command fails to repair the disk, I'm going to try physically putting the new HDD in place of the failed disk and try to bring it online that way. I seem to vaguely recall doing something similar last time.

I'd be grateful of further advice to this situation though, even if it's to remind me the steps for replacing the physical disk.

cirrus
  • 121
  • 6
  • Seems I'm not alone http://social.technet.microsoft.com/Forums/windowsserver/en-US/70f2c113-0a99-411a-9399-08bfef7c568e/cannot-replace-failed-raid5-member-disk – cirrus Aug 17 '14 at 12:05

2 Answers2

1

The RAID array will come with it's own software. Most controllers will let you run the software under core. If not you can run some under BIOS and some off a CD. With RAID array you usually need to introduce a new harddrive as a hot spare, it will then add it to the RAID itself and re-sync. A changed harddrive is now a new harddrive with regards to the raid array. Doing it any otherway, you could lose your whole array. Read The Fine Manual regarding your RAID controller.

Glen
  • 211
  • 1
  • 3
  • I believe it's windows dynamic raid. I've ordered a new 2TB disk which I hope will be large enough. I'd have thought I could just pull the failing disk out and plug a new one in but from reading the diskpart docs it doesn't look to be the case. Mind you, it's not clear at all from the docs what I need to do. – cirrus Aug 15 '14 at 00:52
  • Hi @glen, I've added an update to my question if you'd care to expand your answer further? – cirrus Aug 17 '14 at 11:08
1

So as has always been my experience the recovery process doesn't seem to work for the way it's documented.

I solved it by cloning the failing disk to a brand new disk and then physically plugging it in place of the failed one. Then, I issued the diskpart recover command (which I believe is a shorthand for repair.

I managed to do this without ISO boot recover CDs and external hardware as follows;

1) Plug in new disk (using a spare port). And note the disk ID. Make sure it's offline.

2) Take the RAID volume offline cleanly; sel vol 0

REM Remove drive letter association - you may need to shut...
REM ...down any services using this
remove

offline vol

3) Locate the physical disk responsible for the failure. detail vol will tell you which disks are in the volume, list vol will show which disk ID has errors. All my disks are identical models so I physically pulled the SATA cable out, waited a few seconds and issued list disk again to see which disk id was missing. And took a note of that. Then in my case;

sel disk 3
offline disk

4) To clone the disk sufficiently for windows to be fooled into thinking the new disk was simply the old disk repaired I suspected it would need to have the same disk 'signature' so I needed a low-level sector copy.

Most cloning tools that used VSS or copied files wouldn't do, so I found this: http://hddguru.com/software/HDD-Raw-Copy-Tool/ which was brilliant and has a zero-install EXE that looks like it is designed to run under WinPE so worked perfectly under HyperV server (so presumably Server Core as well) when launched from the command line.

Again however, I crucially needed to know which disks were source and target but the tool showed disk model and serial number rather than diskpart ID so I used the same trick of pulling out the physical cables on my (now known) HDD and re-launching the HDD Guru tool until I'd written down the identifiers for the two disks I needed to copy from.

Then I just ran the copy which continued even after read errors. I suspect that I only needed to copy the first few sectors but I let it run to completion anyway (12 hours).

5) Now pull both SATA cables, remove the failing drive, and plug the newly cloned disk back into place where it was. When brought back online, Windows should see a drive with the same signature in the same slot where it thought the failed disk was.

6) Then it's just a matter of rebuilding the array (another 12 hours) and bringing it back online;

sel disk 3
online disk
sel vol 0
online vol
recover

12 hours later...

sel vol 0
assign letter=e

Then I rebooted because it was easier than re-starting all the services I'd stopped (namely Hyper-v);

c:\> shutdown /r /t 0

By the time I looked again, with a healthy disk Hyper-V was now running and my VMs were restored. It seems HyperV won't run VMs on failing disk arrays. It looks like I may have drive corruption on one of the VHDs but that's another story.

It's incredulous that the RAID recover process isn't a little smarter but I've noticed that a lot even with Windows backup products over the years from Windows Backup to ISA Backup/Restore - they seem to make the assumption that you'll be recovering the exact same hardware, even if that hardware is faulty - which makes the backup next to pointless.

For now I'm back up and running - I hope this transcript helps someone else in a similar position.

cirrus
  • 121
  • 6
  • This time it was too late for cloning. However I managed to get the software RAID recovery to work as it should. The trick is to add the new disk, make it dynamic in diskpart and then just run recover. It'll ask you which disk to use to repair and off it goes. You can later remove the M0 missing disk. It seems that it does indeed work the way it should, it's just that the documentation seems to be confusing for some, including me up until now. – cirrus Mar 17 '17 at 20:09