Good way to simulate a disk failure in vSphere

Question

I need to be able to demonstrate SQL Server recovery procedures and since our DBs are all virtualized and using a SAN for backend storage, I need to know what the best way is to simulate a disk failure and recovery using a tail-log backup.

I've tried removing the data volume from the VM in vSphere but SQL doesn't seem to be aware that the MDF file is no longer accessible and doesn't trigger a fault, and none of the other options I think of can be done "live". I'm just not enough of a vSphere guy to know my options here.

@ewwhite We do nightly/weekly VM backups based on server importance, and we use Avamar to do our SQL backups. Avamar isn't really relevant to the discussion at hand though because the problem space shouldn't be any different if we were to just use SQL Agent to do backups. — Drew Lanclos, Apr 13 '15 at 16:44
Do you care if all the disks go offline for that VM, or only the one housing the MDFs? Are you running a standalone host, or a cluster? Remote or local storage? — GregL, Apr 13 '15 at 16:44
@GregL It's non-clustered SQL, storage is in a SAN. I would need only the disk housing the MDFs to go down, yes. If the whole SAN fails, then our procedure would be to use our off-site replication to rebuild the VM, and then restore from the SQL backups done in Avamar. — Drew Lanclos, Apr 13 '15 at 16:46
If you're not worried about the whole SAN going down since you already have a procedure for that, what case can there be to test a single disk failing? If it's going to fail, it's all going to fail. — GregL, Apr 13 '15 at 16:49
@GregL I think this would be a case where perhaps the failure is in the array and not the entire device. In considering what you're saying further, I think you're basically saying "The SAN should handle moving things off of an unsafe array to redundant storage anyway, so it's not really useful to account for a single disk/volume failure." Is that about right? — Drew Lanclos, Apr 13 '15 at 16:50
Unless you've split up the VMDKs for this VM across multiple VMFSs, you shouldn't so much be worried that a single VMDK is going to drop out, but rather the whole LUN. And generally, if a whole LUN drops out it's cause something serious went sideways and you'll probably have bigger problems. — GregL, Apr 13 '15 at 16:52
This kind of fault isn't really a situation that's going to happen in the Real World, hence my questions about why you're looking to simulate it. — GregL, Apr 13 '15 at 16:54
@GregL That is precisely the scenario I have to demo for. The VMDKs are distributed across multiple LUNs. And the reason I have to demo it is largely political but it is still useful information for me. — Drew Lanclos, Apr 13 '15 at 16:54

score 1 · Accepted Answer · answered Apr 13 '15 at 17:13

I can really only see a couple of options here.

They both assume that the only thing on this datastore is the VMDK that's housing the SQL Server's MDFs. No VMX files, no log files, no datastore heartbeating.

The first would be to unmount the volume on the host in question. You can do it via the vSphere clients, esxcli or it's PowerCLI equivalent. I don't actually know if vSphere will let you do this while it's got a running VM attached, but it might.

The second is to just un-present the volume from the SAN or mask it on the host in question. It's a really drastic thing to do and might cause more pain than it's worth since it's likely to cause the host to throw a PDL or APD error, but it will give you what you want if the first option isn't possible cause VMware blocks it (which is actually a good thing).

If you're attempting the second option, you should read up on this article about PDL/APD and the articles listed in the 'See Also' section.

Oh, and don't do it to a production box.

And have backups before trying any of this.

Good way to simulate a disk failure in vSphere

1 Answers1