ESXi 5.0 - Software RAID 1

Question

Environment:

Storage: HP P2000 MSA G3 SAS Array with 24 300GB 10k SAS Disks
- Two Storage Controllers with redundant SAS connections to each host
Hosts: Three HP DL380 G7s with a 10GB SD Card, CPU, Ram, etc...

ESXi 5.0 is installed on the SD card in each host, this is the only local storage on each host. I have the P2000 split into two vDisks, each using 12 of the 24 disks. Call them LUN1 and LUN2. Each LUN is its own RAID6 volume.

I spoke with HP support over the phone about layering RAID1 over my two RAID6 arrays. This is not possible, so I'm trying to figure out what the best way might be to implement mirroring. I was looking at OpenFiler and FreeNAS, but honestly don't know how those solutions would work in a mission critical application.

Why do you want RAID1 on top of your RAID6 arrays? What problem are you trying to solve and what is your desired result? — ewwhite, May 25 '13 at 20:58
Desired result is ability to survive many disk failures. RAID6 gives me the ability to lose 2 disks per vDisk and still be okay. Applications that the servers run cannot go down. This is most important. Performance isn't much of an issue because the system is small. Right now I'm using Storage vMotion to move VMs off of a LUN with a bad disk, so that disk reconstruction performance doesn't affect performance. — Lucretius, May 25 '13 at 21:06
RAID 10 is not preferable to RAID 6 because losing a 2nd disk, if it so happens to be the right disk, would cause data loss of the entire RAID, and multiple VMs on it. A hot-spare is not preferable because the raid reconstruction performance hit during peak hours would cause such high latency applications would be effectively unavailable. The stress induced by storage vMotion to move a virtual machine from LUN1 to LUN2 (assuming a disk on LUN1 went bad) can be controlled and conducted during non-peak hours. Layering a RAID1 on top of these would make the data effectively indestructible. — Lucretius, May 25 '13 at 21:18
RAID 1+0 with a hot-spare won't have a performance hit during period in which the array is degraded. A hot spare reduces the chance of another drive failure impacting the array. — ewwhite, May 25 '13 at 21:24
The hot-spare doesn't though, if a drive fails during peak hours and the RAID begins automatic reconstruction with the hot-spare you've placed a higher load on the most vulnerable drive in the array during peak hours. — Lucretius, May 25 '13 at 21:28
I'd rather have a RAID rebuild cause array stress than a storage vMotion. — ewwhite, May 25 '13 at 21:45
@Lucretius Most RAID controllers do a good job of making the load from disk resilvering a lower priority than real I/O work, so that the impact is minimal; have you seen actual problems from this? Is it realistic that you'll actually have 3 disk failures before you have an opportunity to get a spare rebuilt into the array? — Shane Madden, May 25 '13 at 21:56
I'm wondering *why* you need this RAID setup. The chances of triple drive failure are in the realms of winning the lottery twice in one day then being struck by lightning twice. Have a couple cold-spares handy and decent monitoring. — Nathan C, May 28 '13 at 15:15
RAID1 on top of RAID6 is not a recommended configuration. RAID6 on top of RAID1 would provide much faster rebuild and in some cases tolerate a larger number of disk failures. — kasperd, May 19 '15 at 09:09

ewwhite · Answer 1 · 2014-05-04T11:21:22.267

9

There's no software RAID option for the setup you've described.

VMware won't support it. If your hosts were Linux/Windows, you'd have some additional options.

If your concern is system stability, you could have used RAID 1+0 and/or designated hot-spare drives in your setup.

If performance isn't a concern (e.g. the use of RAID6), why worry about the potential impact of a RAID6 rebuild? RAID6 on enterprise SAS drives is usually deemed overkill (versus RAID5) because of the quick(er) rebuild times than nearline/7,200rpm disks. However, you're also doing this across a larger group of disks than normal (12 drives is a lot for that RAID level).

Why stress the system with a Storage vMotion away from the faulted LUN? These are standard HP SAS disks. They don't fail so often that you can't get a replacement in place in a timely manner.

But the best insurance here is to have a hot-spare disk configured and maybe a cold-spare drive handy to reduce the amount of time it takes to replace a disk.

Have you had a drive fail on this array before?

edited May 04 '14 at 11:21

answered May 25 '13 at 21:16

ewwhite

194,921
91
434
799

When I say "performance isn't much of an issue" I'm talking about the performance between RAID 10 and RAID 6. RAID 10 outperforms RAID 6, but I feel like RAID 6 is safer than RAID 10. – Lucretius May 25 '13 at 21:19
I've had 3 drives fail on the array within a span of about 5 months. I'm at work today doing some firmware upgrades during non-peak hours. – Lucretius May 25 '13 at 21:20
8

We've all had disks fail before. It's why you a) have backups, and b) redundant systems, and c) load balancers for truly mission-critical services. Because while drives are important, it doesn't do you a bloody bit of good when the backplane shorts on your single-server application. – Magellan May 25 '13 at 22:33
The only reason Raid6 even exists is because corporate entities couldn't put up with the risk of a drive failure during Raid5 rebuild. The rationale is that if you received a bad batch of disks then two consecutive failures are very possible. Not a whole lot of storage folks I know take care to load their arrays with disks from different mfg batches. I agree that disk failures on enterprise disks are not frequent, but still, Raid6 provides a lot more protection than Raid5. – Reality Extractor May 25 '13 at 22:49
@Adrian, I'm not sure what you're talking about with a single server backplane shorting out. The P2000 has two storage controllers (redundant) and with the cluster running in HA DRS mode each virtual machine can run on any of the three servers in the cluster. – Lucretius May 28 '13 at 13:57
@ewwhite, Storage vMotion doesn't place the same type of stress on the array as a raid reconstruction. It's much less of an impact. Storage vMotion will simply read from the affected array and write to the unaffected array, while a reconstruction of a RAID6 volume involves a lot of distributed writes and parity calculations. I would have accepted your answer, but you added on a lot of unecessary lecturing and bad information. – Lucretius May 28 '13 at 14:01
@Lucretius You're being too literal. They were general examples. – Magellan May 28 '13 at 14:03
1

@Lucretius unless your datastore is significantly smaller than the array you've created, it is the same kind of stress - all functioning disks are read once, the replaced disk is written once. The parity calculation does not have any impact on the drives' failure risk and should not impact the performance of the controller too badly either. – the-wabbit May 28 '13 at 14:50
1

@Lucretius Sorry to appear rude. It just seemed like you were complicating the situation by focusing on the wrong solution. A spare drive is way handier than the manual work of svMotion or adding another abstraction layer. Disk failures are easy to prepare for and remedy. – ewwhite May 28 '13 at 14:56
1

@Lucretius if you really *need* mirroring across the two vDisks, the way to go seemingly would be to expose both of them as data stores to vSphere, create virtual disks on both and use software mirroring in the guests' OS instances. vSphere/ESXi is rather forgiving about datastore outages in terms that it would not switch off or stall your VMs, but the scenario would require thorough testing nonetheless. If you have an environment which is already in production, this might not be the time for testing though. – the-wabbit May 28 '13 at 15:02
@syneticon-dj - massively overcomplicated but probably the best answer to his 'not-actually-a-question-question', if you put that in an answered I'd upvote it. – Chopper3 May 28 '13 at 15:18
@syneticon-dj That's fraught with risk, too. I'd fire an engineer who rigged something like that up with VMware :) – ewwhite May 04 '14 at 11:19
@ewwhite why? With the 2TB size boundary for virtual disks setting up more than one virtual disk and glueing them together in a VM has been a rather common practice. – the-wabbit May 04 '14 at 13:35
I wouldn't deem that common. But that's not what's trying to be solved here. Dude is trying to do a RAID 61 on a platform that doesn't officially support software RAID. – ewwhite May 04 '14 at 14:23

ESXi 5.0 - Software RAID 1

1 Answers1