18

Why are the nested RAID levels 1+5 or 1+6 almost unheard of? The nested RAID levels Wikipedia article is currently missing their sections. I don't understand why they are not more common than RAID 1+0, especially when compared to RAID 1+0 triple mirroring.

It is apparent that rebuilding time is becoming increasingly problematic as drive capacities are increasing faster than their performance or reliability. I'm told that RAID 1 rebuilds quicker and that a RAID 0 array of RAID 1 pairs avoids the issue, but surely so would a RAID 5 or 6 array of RAID 1 pairs. I'd at least expect them to be a common alternative to RAID 1+0.

For 16 of 1TB drives, here are my calculations of the naïve probability of resorting to backup, i.e. with the simplifying assumption that the drives are independent with even probability:

RAID | storage | cumulative probabilities of resorting to backup /m
 1+0 |     8TB | 0, 67, 200, 385, 590, 776, 910, 980, 1000, 1000, 1000
 1+5 |     7TB | 0,  0,   0,  15,  77, 217, 441, 702,  910, 1000, 1000
 1+6 |     6TB | 0,  0,   0,   0,   0,   7,  49, 179,  441,  776, 1000
(m = 0.001, i.e. milli.)

If this is correct then it's quite clear that RAID 1+6 is exceptionally more reliable than RAID 1+0 for only a 25% reduction in storage capacity. As is the case in general, the theoretical write throughput (not counting seek times) is storage capacity / array size × number of drives × write throughput of the slowest drive in the array (RAID levels with redundancy have a higher write amplification for writes that don't fill a stripe but this depends on chunk size), and the theoretical read throughput is the sum of the read throughputs of the drives in the array (except that RAID 0, RAID 5, and RAID 6 can still be theoretically limited by the slowest, 2nd slowest, and 3rd slowest drive read throughputs respectively). I.e., assuming identical drives, that would be respectively 8×, 7×, or 6× maximum write throughput and 16× maximum read throughput.

Furthermore, consider a RAID 0 quadruple of RAID 1 triples, i.e. RAID 1+0 triple mirroring of 12 drives, and a RAID 6 sextuple of RAID 1 pairs, i.e. RAID 1+6 of 12 drives. Again, these are identical 1TB drives. Both layouts have the same number of drives (12), the same amount of storage capacity (4TB), the same proportion of redundancy (2/3), the same maximum write throughput (4×), and the same maximum read throughput (12×). Here are my calculations (so far):

RAID      | cumulative probabilities of resorting to backup /m
1+0 (4×3) | 0, 0, 18,  ?,   ?,   ?,   ?,   ?, 1000
1+6 (6×2) | 0, 0,  0,  0,   0,  22, 152, 515, 1000

Yes, this may look like overkill, but where triple mirroring is used to split-off a clone for backup, RAID 1+6 can just as well be used, simply by freezing and removing 1 of each drive of all but 2 of the RAID 1 pairs, and while doing so, it still has far better reliability when degraded than the degraded RAID 1+0 array. Here are my calculations for 12 drives degraded by 4 in this manner:

RAID      | cumulative probabilities of resorting to backup /m
1+0 (4×3) | (0, 0, 0, 0), 0, 143, 429, 771, 1000
1+6 (6×2) | (0, 0, 0, 0), 0,   0,  71, 414, 1000

Read throughput, however, could be degraded down to 6× during this time for RAID 1+6, whereas RAID 1+0 is only reduced to 8×. Nevertheless, if a drive fails while the array is in this degraded state, the RAID 1+6 array would have a 50–50 chance of staying at about 6× or being limited further to 5×, whereas the RAID 1+0 array would be limited down to a bottleneck. Write throughput should be pretty unaffected (it may even increase if the drives taken for backup were the limiting slowest drives).

In fact, both can be seen of as ‘triple mirroring’ because the degraded RAID 1+6 array is capable of splitting-off an additional RAID 6 group of 4 drives. In other words, this 12-drive RAID 1+6 layout can be divided into 3 degraded (but functional) RAID 6 arrays!

So is it just that most people haven't gone into the maths in detail? Will we be seeing more RAID 1+6 in the future?

James Haigh
  • 348
  • 1
  • 2
  • 10
  • 2
    Your thoughput calc doesn't seem to have taken into account the write amplification to create the parity. – JamesRyan Mar 11 '15 at 12:25
  • 1
    @JamesRyan: Yes, I have indeed considered that the parity needs writing. That's what the “storage capacity / array size” is for – the reciprocal of this is the write amplification factor, not including further write amplification associated with solid-state drives. Note that this includes the write amplification of the RAID 1 redundancy as well. Basically, the write amplification factor is equal to the reciprocal of 1 minus the proportion of redundancy. So 50% redundancy gives a write amplification factor of 2; 62.5% (10/16) redundancy gives a write amplification factor of ~2.67 (16/6). – James Haigh Mar 12 '15 at 03:49
  • 1
    no that is incorrect. Each RAID6 write takes 6 IOs and each RAID1 write takes 2 IOs, these are multiplicative. So in RAID 1+6 each write will take 12 IOs, for RAID 10 is it 2 IOs. Write throughput on 12 drives will be 1x for RAID1+6 and 6x for RAID10! – JamesRyan Mar 12 '15 at 11:02
  • @JamesRyan: Oh, I see where you're going with this now – for writes that are less than a full stripe, the write amplification factor can double for RAID 1+6 thus halving the maximum write throughput. For a full stripe, yes there are 12 writes in the 6×2 example, but you forget that this is for 4 chunks worth of data. For 4, 3, 2, 1 chunks-worth respectively, the write amplification factors are (6×2)/4 = 3, (5×2)/3 = ~3.33, (4×2)/2 = 4, (3×2)/1 = 6, giving maximum write throughputs of 4×, 3.6×, 3×, 2×. For RAID 1+0 4×3 it's (4×3)/4, (3×3)/3, (2×3)/2, (1×3)/1 giving a constant 4×. … – James Haigh Mar 12 '15 at 14:09
  • …However, this depends on chunk size and the size of writes, which I've now noted in the question. If writes are typically not filling the stripes then it may be worth decreasing the chunk size. – James Haigh Mar 12 '15 at 14:14
  • no writing 4 chunks at once does not magically increase drive throughput by 4 :) Less than optimal chunksize can only lower throughput from the theoretical maximum. – JamesRyan Mar 12 '15 at 14:26
  • “the write amplification factor can double for RAID 1+6” – Well this can apply to RAID levels with parities in general, depending on the number of drives. On an infinite number of drives, single parity converges to an absolute maximum of double the write amplification for single-chunk writes (worst case) compared to writes that fill a stripe; double parity converges to triple in worst case. – James Haigh Mar 12 '15 at 14:41
  • I think you are quoting something without understanding it. What you are saying specifically about write amplification is technically correct only because RAID6 requires 3 writes the same as a 2nd mirror. But what you are missing is that parity in RAID6 also requires reads. The disk IOs limit throughput, not just the disk writes. – JamesRyan Mar 12 '15 at 15:00
  • @JamesRyan: Not [_magically_](https://en.wikipedia.org/wiki/Any_sufficiently_advanced_technology_is_indistinguishable_from_magic), no. If you're getting 1/3 of 12×, i.e. writing a 4-chunk stripe occupies all 12 drives with a chunk each, then yes, you get 4×. Writes that fill a stripe only have to write the parities once, rather than writing each chunk separately which would require the parities to be rewritten for each chunk (and would also require reads if the other chunks aren't cached somewhere). This will lower from theoretical maximum, but can be avoided be choosing a sensible chunk size. – James Haigh Mar 12 '15 at 15:04
  • “But what you are missing is that parity in RAID6 also requires reads.” – Only when not filling a stripe. Again, this depends on chunk size and the size of writes. – James Haigh Mar 12 '15 at 15:07
  • you don't get 4 chunks written for the price of one. Throughput per drive is dependent on the amount of data, so that has already been cancelled out. The number of reads does not depend on chunksize, it is always required to calculate parity. It might be helpful if you linked where you are getting these ideas from. – JamesRyan Mar 12 '15 at 15:11
  • I think basically you have read someone generalising the throughput calculation and are trying to apply it as if it is a rule when it isn't. And it certainly doesn't work for nested situations that are multiplicative. – JamesRyan Mar 12 '15 at 15:26
  • @JamesRyan: What? These are my own calculations! Based on the examples that _I've_ given! I have taken into consideration the nesting (which of course I'm aware of given that I'm asking about why a particular nested layout isn't used over other nested layouts) and I understand the multiplicative nature. You're just wasting my time now. And yours. And possibly other people's. I still need to complete the calculations for 5 more probabilities in the 2nd example (which are more tricky than the others), but it's clear that they'll all be greater than 0.018 (1/11*2/10) and less than 1. – James Haigh Mar 12 '15 at 15:52
  • 2
    You stated based on your calculations that RAID1+6 has the same write throughput as RAID10 with triples. **In reality RAID1+6 has not even remotely the write throughput of RAID10 so your calculations** or the assumptions they are based on **are wrong**. I was trying to help you understand why, if you refuse to listen then we might be wasting our time but it is you who is wasting it. – JamesRyan Mar 12 '15 at 16:12
  • @JamesRyan: “no that is incorrect.” – Actually **we were both incorrect**, but there's some truth in what we both say. It turns out that write amplification is far more complicated when considering chunk and write sizes. I was talking solely about write _throughput_ amplification (regarding large writes) whereas you may be talking about write _seek_ amplification (significant for small writes). Sorry for the confusion! I hadn't noticed 'til now that these 2 things behave very differently mathematically – see [my answer](https://serverfault.com/questions/674462/why-is-raid-/675430) for details. – James Haigh Mar 14 '15 at 08:10
  • “For 4, 3, 2, 1 chunks-worth respectively, the write amplification factors are (6×2)/4 = 3, (5×2)/3 = 3.33̰, (4×2)/2 = 4, (3×2)/1 = 6, giv…” – Correction: This was forgetting reads; taking the reads into consideration, the write throughput amplification factors are (6×2 + 0)/4 = 3, (5×2 + 1)/3 = 3.67̰, (4×2 + 2)/2 = 5, (3×2 + 2)/1 = 8, giving maximum write throughputs of 4×, 3.27̰×, 2.4×, 1.5×. I derive the general RAID 1+6 write throughput amplification factor formula to be `((v + 2)n₁ + min [2, n₆ - (v + 2)])/v` where v is the virtual number of chunks, and n₆×n₁ is the RAID 1+6 layout (6×2). – James Haigh Mar 15 '15 at 00:21

6 Answers6

17

Generally I'd say RAID 1+0 will tend to be more widely used than 1+5 or 1+6 because RAID 1+0 is reliable enough and provides marginally better performance and more usable storage.

I think most people would take the failure of a full RAID 1 pair within the RAID 1+0 group as a pretty incredibly rare event that's worth breaking out the backups for - and probably aren't too enthusiastic about getting under 50% of their physical disk as usable space.

If you need better reliability than RAID 1+0, then go for it! ..but most people probably don't need that.

Shane Madden
  • 112,982
  • 12
  • 174
  • 248
  • 1
    The issue that I have with RAID 1+0 is that it has a bad ratio of reliability to storage. If RAID 6 was arbitrarily extensible to any number of parities (below n - 1) then for the same drives you could achieve both increased storage _and_ better reliability than RAID 1+0. For the example above, if it was possible to have RAID 6 with 4 parities, you'd have 50% more storage and maximum write throughput than RAID 1+0 yet have exceptionally higher reliability. RAID 6 with 3 or 4 parities would have a good reliability–storage trade-off. – James Haigh Mar 11 '15 at 03:27
  • 4
    @JamesHaigh RAID 6 vs RAID 1+0 is a much different discussion than RAID 1+6 vs RAID 1+0, you kinda changed the subject. ZFS's raidz3 seems like it'd be up your alley? Anyway, to your point, there are some performance advantages that RAID 1+0 maintains over RAID 6, such as small single-block writes needing to touch a far smaller number of drives (and back to raidz3, ZFS handles this intelligently by writing multiple full copies instead of writing to all disks for small writes) – Shane Madden Mar 11 '15 at 04:01
  • Sorry, yes, I think that this is really what I'm chasing. Since that last comment I've been writing a [new question specifically about RAID with 3 or more parities](https://serverfault.com/questions/674478/can-raid-6-support-3-or-more-parities). That would be better than RAID 1+6 I think. It would also be more flexible and simpler to get the desired trade-off. You may want to continue this over on that question. – James Haigh Mar 11 '15 at 04:34
  • 3
    RAID 6 can't be linearly extended, because it doesn't work that way. The syndrome computation for second parity won't trivially scale to a third party. But you can quite easily do smaller RAID 6 groups - there's no real reason you need to do 14+2, and could instead do 2+2 or 4+2 and gain a lot of reliability. – Sobrique Mar 11 '15 at 12:15
  • [Backblaze](https://www.backblaze.com/blog/vault-cloud-storage-architecture/) has used RAID6 for a long time because has little overhead for store the parity. – fgbreel Mar 12 '15 at 17:43
  • 1
    @JamesHaigh What you seem to be wanting is a 12-way raidz8. Based on the logic that goes into parity calculations, that's going to peg out processors in perpetuity even with trivial data. Single parity is essentially XOR (easy). Dual parity is something to do with squares (not hard, but not easy). Triple parity is cube based or similar (hard). 4, 5, 6, 7, or 8 parity demands even larger (by an exponential scale) calculations (which might need quantum computers to keep up with). Just remember that as the shape grows, there is ZERO increase in IOPS. For media, who cares? For VMs, it kills. – killermist Mar 17 '15 at 18:47
  • @killermist You're wrong in at least two points. The computational complexity does not grow, except for one exception: XOR is a trivial case, much simpler than the other functions involved. Error-correcting codes are well-known and in use since many decades. The load is pretty trivial for a contemporary CPU and even more so since Carry-less Multiplication is a CPU instruction. – maaartinus Jan 12 '18 at 00:22
  • @maaartinus Prove your assertion. Let's start with the most simple case. 3-drive raidz (raid5). A xor B = C. How do you CPU trivial add more parity (D)? Square and sqrt are almost trivial, but far more complicated than xor. Then how would you calculate a third parity (E)? And how could that possibly be processor trivial? With only immaculate A+E, you must be able to recreate the immaculate data, and then be able to reproduce B, C, and D. A xor B xor C = 0. If D=0, A xor D = [not data], and A [something] D [MUST equal] [data]. Let's not even get into E or F. How's that trivial? – killermist Nov 01 '18 at 22:54
  • @killermist I'd bet, there's no squaring and no sqrt in any parity calculation and no floating point operations, though I don't know all [types of ECC](https://en.wikipedia.org/wiki/Error_correction_code#Types_of_ECC). Most of them use some Galois field multiplication. As a trivial second (and third etc.) parity, you could use [mode dimensions](https://en.wikipedia.org/wiki/Multidimensional_parity-check_code), but that's not optimal. You don't have to guess how expensive is sqrt as you don't use it. – maaartinus Nov 02 '18 at 01:21
  • @maaartinus Please elaborate. A [something-1] D = [data] This something-1 ain't trivial. A [something-2] E = [data] something-2 very totally isn't trivial. – killermist Nov 04 '18 at 02:31
16

The practical answer lies somewhere at the intersection of hardware RAID controller specifications, average disk sizes, drive form-factors and server design.

Most hardware RAID controllers are limited in the RAID levels they support. Here are the RAID options for an HP ProLiant Smart Array controller:

[raid=0|1|1adm|1+0|1+0adm|5|50|6|60]

note: the "adm" is just triple-mirroring

LSI RAID controllers support: 0, 1, 5, 6, 10, 50, and 60

So these controllers are only capable of RAID 50 and 60 as nested levels. LSI (née Dell PERC) and HP comprise most of the enterprise server storage adapter market. That's the major reason you don't see something like RAID 1+6, or RAID 61 in the field.

Beyond that consideration, nested RAID levels beyond RAID 10 require a relatively large number of disks. Given the increasing drive capacities available today (with 3.5" nearline SAS and SATA drives), coupled with the fact that many server chassis are designed around 8 x 2.5" drive cages, there isn't much of an opportunity to physically configure RAID 1+6, or RAID 61.

The areas where you may see something like RAID 1+6 would be large chassis software RAID solutions. Linux MD RAID or ZFS are definitely capable of it. But by that time, drive failure can be mitigated by hot or cold-spare disks. RAID reliability isn't much of an issue these days, provided you avoid toxic RAID level and hardware combinations (e.g. RAID 5 and 6TB disks). In addition, read and write performance would be abstracted by tiering and caching layers. Average storage workloads typically benefit from one or the other.

So in the end, it seems as though the need/demand just isn't there.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • 1
    There's a demand in the form of array replication. I know several sites that do multi-site DR, which is practically speaking RAID 10 or 5 or 6 replicated to a remote (RAID 10 or 5 or 6) remote site. In no small part - beyond a certain level of disk reliability, your processors, controllers, networks, power, aircon, datacentre-catching-fire are bigger threats to your reliability. – Sobrique Mar 11 '15 at 12:29
  • 1
    I don't think the OP even considered replication or multi-site use. – ewwhite Mar 11 '15 at 12:35
  • 1
    No, probably not. As you say - there's just no demand because it's overkill. It's the only use-case I can think of where it isn't overkill though :) – Sobrique Mar 11 '15 at 12:44
  • I've (briefly) configured something like a raid 6+1- a Netapp local syncmirror will create an identical copy of itself and multiplex reads across both plexes, while mirroring writes. It's mostly used for migrating Netapp V-Series to new backend LUNs, however if I wanted to double my reliability, I could do that with this. – Basil Mar 12 '15 at 19:55
12
  • You have diminishing returns on reliability. RAID 6 is pretty unlikely to compound failure even on nasty SATA drives with a 1 in 10^14 UBER rate. On FC/SAS drives your UBER is 1 in 10^16 and you get considerably more performance too.

  • RAID group reliability doesn't protect you against accidental deletion. (so you need the backups anyway)

  • beyond certain levels of RAIDing, your odds of a compound failure on disks becomes lower than compound failure of supporting infrastructure (power, network, aircon leak, etc.)

  • Write penalty. Each incoming write on your RAID 61 will trigger 12 IO operations (naively done). RAID 6 is already painful in 'low tier' scenarios in terms of IOPs per TB random write. (and in higher tier, your failure rate is 100x better anyway)

  • it's not '25% reduction' it's a further 25% reduction. Your 16TB is turning into 6TB. So you're getting 37.5% usable storage. You need 3x as many disks per capacity, and 3x as much datacentre space. You would probably get more reliability by simply making smaller RAID6 sets. I haven't done the number crunching, but try - for example the sums of RAID 6 in 3x 3+2 sets (15 drives, less storage overhead than your RAID10). Or doing 3 way mirrors instead.

Having said that - it's more common than you think to do it for multi-site DR. I run replicated storage arrays where I've got RAID5/6/DP RAID groups asynchronously or synchronously to a DR site. (Don't do sync if you can possibly avoid it - it looks good, it's actually horrible).

With my NetApps, that's a metrocluster with some mirrored aggregates. With my VMAXes we've Symmetrix Remote Data Facility (SRDF). And my 3PARs do remote copy.

It's expensive, but provides 'data centre catching fire' levels of DR.

Regarding triple mirrors - I've used them, but not as direct RAID resilience measures, but rather as full clones as part of a backup strategy. Sync a third mirror, split it, mount it on a separate server and back that up using entirely different infrastructure. And sometimes rotate the third mirror as a recovery option.

The point I'm trying to make is that in my direct experience as a storage admin - in a ~40,000 spindle estate (yes, we're replacing tens of drives daily) - we've had to go to backups for a variety of reasons in the last 5 years, but none of them have been RAID group failure. We do debate the relative merits and acceptable recovery time, recovery point and outage windows. And underpinning all of this is ALWAYS the cost of the extra resilience.

Our array all media scrub and failure predict, and aggressively spare and test drives.

Even if there were a suitable RAID implementation, cost-benefit just isn't there. The money spent on the storage space would be better invested in a longer retention or more frequent backup cycle. Or faster comms. Or just generally faster spindles, because even with identical resilience numbers, faster rebuilding of spares improves your compound failure probability.

So I think I would therefore offer the answer to your question:

You do not see RAID 1+6 and 1+5 very often, because the cost benefit simply doesn't stack up. Given a finite amount of money, and given a need to implement a backup solution in the first place, all you're doing is spending money to reduce your outage frequency. There are better ways to spend that money.

Sobrique
  • 3,697
  • 2
  • 14
  • 34
  • “RAID group reliability doesn't protect you against accidental deletion. (so you need the backups anyway)” – I didn't imply that this makes backups unnecessary (I'm well aware that [RAID is not a backup](https://serverfault.com/questions/2888/why-is-raid-not-a-backup)). I actually imply the converse by saying “cumulative probabilities of resorting to backup” – I'm taking it as given that backups are standard practice. I agree with this point, however, it is presented as countering my reasoning about RAID 1+6, which doesn't make sense. – James Haigh Mar 12 '15 at 04:07
  • “RAID 61” – RAID 6+1 would be a RAID 1 array of RAID 6 arrays. That's a reversed nesting, and I think it would have much less reliability. I.e., what happens if 3 drives fail in the same nested RAID 6 array? Doesn't that whole nested RAID 6 array need rebuilding? The same drives nested as RAID 1+6 would sustain those same 3 drive failures without taking offline any working drives. – James Haigh Mar 12 '15 at 04:23
  • “beyond certain levels of RAIDing, your odds of a compound failure on disks becomes lower than compound failure of supporting infrastructure (power, network, aircon leak, etc.)”; “it's a _further_ 25% reduction” – True and true, it's an overkill nesting layout. But then why one Earth would anyone use a RAID 0 array of RAID 1 triples? Thanks for reminding me about RAID 1+0 triple mirroring! “I haven't done the number crunching”; “Or doing 3 way mirrors instead.” – You really should do some calculations before giving a supporting case as a counterexample. These calculations should be explored… – James Haigh Mar 12 '15 at 04:51
  • 1 - because your _risk_ is no longer 'data loss' as 'temporary outage'. Which means the cost-benefit is reduced. 2 - Take your pick. I have used mirrored R6 groups for DR. 3 - because your approximations no longer hold. They're fine for per drive comparisons, but as your failure rate tends to zero, then other factors become more significant. These are much harder to model naively, so I'm not even going to try. – Sobrique Mar 12 '15 at 09:40
  • Sure, these risks for RAID 1+6 are much lower than many other risks, but what I'm trying to get at is that it's more ‘risk-efficient’ as compared to other layouts. I've added the 12-drive RAID 1+0 triple mirroring example to the question – this is an excellent example because it is a square-on comparison of the probabilities that demonstrates the better risk efficiency of RAID 1+6, with the performance characteristics pretty much identical providing that it's implemented/configured correctly. If people feel the need for triple mirroring then maybe RAID 1+6 isn't overkill for some applications. – James Haigh Mar 12 '15 at 15:31
  • 1
    My direct experience is this - I have 40,000 spindles in my estate, in a variety of configurations. We have not had a raid group failure in the last 5 years. I have used triple mirrors, but not for resilience - they're for making clone copies for backup reasons. I have used multi-site replicas for DR reasons - which I've used - but none of these have been needed for RG failures either. – Sobrique Mar 12 '15 at 15:38
  • Oh, I see what you mean. The triple mirrors allow you to simply remove a clone and do a backup. Okay, I didn't think of that. I was only thinking about backups done at a higher level such as taking an atomic snapshot on a copy-on-write filesystem then copying that snapshot elsewhere. Can you not use RAID 1+6 for multisite mirroring in the same way that you did for RAID 6+1? I.e. have multisite RAID 1 arrays in a RAID 6 array. (Btw., I've been interpreting ‘DR’ as ‘disaster recovery’, but it could be a couple of other things. I take it that by ‘RG’ you mean ‘RAID group’. Can you confirm these?) – James Haigh Mar 12 '15 at 16:19
  • Yes, those are what I meant. A RAID5 or 6 that you then mirror to another RAID5 or 6 makes sense. I'm not sure how you'd do it the other way around, as you'd need multiple sites involved in the RAID parity calc. – Sobrique Mar 12 '15 at 16:22
  • I've just realised that RAID 1+6 also allows you to split-off for backup, simply by freezing and removing 1 of each drive of all but 2 of the RAID 1 pairs, and while doing so, it still has far better reliability when degraded than the degraded RAID 1+0 array! :-D (Also, I've just noticed and read your edit.) – James Haigh Mar 12 '15 at 16:48
  • Actually you'd need all your RAID 1 pairs, to avoid having to reconstruct your RAID group. And you still have the RAID-6 write penalty - 12 now, because of mirroring, rather than the 3 of a triple mirror. We already eschew RAID6 for a lot of scenarios because of write penalty - it's just not worth it unless you're running low grade disks. – Sobrique Mar 12 '15 at 18:50
  • If a RAID 6 array of 6 identical 1TB drives is unable to write all 4TB of storage data in the time that it takes for the slowest drive to write its 6th of the array data, then this is not a theoretical limitation but a limitation in the _implementation_. Maybe the RAID controller is the write throughput bottleneck due to not computing parity fast enough. Software RAID on a slow processor would also incur this write penalty. This is a different matter because not all implementations will have a processing bottleneck (especially if the parity computations are done on a dedicated ASIC). – James Haigh Mar 12 '15 at 19:12
  • 1
    You're misunderstanding what write penalty is. It's that for a single overwrite, you must read from your two parity devices, compute parity, write back to you twp parity devices and your target block. Thus 6 IOs per 'write'. This is not a software or implementation limitation. You partially mitigate with good write caching, but only partially. – Sobrique Mar 12 '15 at 19:17
  • So synonymous with _write amplification_. Please see my comments in the question's main comment section about that. Particularly [this one](https://serverfault.com/questions/674462/why-is-raid-16-not-a-more-common-layout#comment827054_674462), i.e. when writing a full stripe you can just clobber the whole stripe, including the parity. – James Haigh Mar 12 '15 at 19:36
  • I'm aware of the purpose of write caching. The assumption that you can ignore the problem however, is mistaken. It's a _partial_ solution to the problem. – Sobrique Mar 12 '15 at 19:55
3

Modern and advanced systems don't implement shapes like that because they're excessively complicated, completely unnecessary, and contrary to any semblance of efficiency.

As others have pointed out, the ratio of raw space to usable space is essentially 3:1. That is essentially three copies (two redundant copies). Because of the calculation cost of "raid6" (twice over, if mirrored), and the resulting loss of IOPS, this is very inefficient. In ZFS, which is very well designed and tuned, the equivalent solution, capacity-wise would be to create a stripe of 3-way mirrors.

As an example, instead of a mirror of 6-way raid6/raidz2 shapes (12 drives total), which would be very inefficient (also not something ZFS has any mechanism to implement), you would have 4x 3-way mirrors (also 12 drives). And instead of 1 drive worth of IOPS, you would have 4 drives worth of IOPS. Especially with virtual machines, that is a vast difference. The total bandwidth for the two shapes may be very similar in sequential reads/writes, but the stripe of 3-way mirrors would definitely be more responsive with random read/write.

To sum up: raid1+6 is just generally impractical, inefficient, and unsurprisingly not anything anyone serious about storage would consider developing.

To clarify the IOPS disparity: With a mirror of raid6/raidz2 shapes, with each write, all 12 drives must act as one. There is no ability for the total shape to split the activity up into multiple actions that multiple shapes can perform independently. With a stripe of 3-way mirrors, each write may be something that only one of the 4 mirrors must deal with, so another write that comes in doesn't have to wait for the whole omnibus shape to deal with before looking at further actions.

killermist
  • 139
  • 6
2

Since noone said it directly enough: Raid6 write performance is not marginally worse. It is horrible beyond description if put under load.

Sequential writing is OK and as long as caching, write merging etc. is able to cover it up, it looks ok. Under high load, things look bad and this is the main reason a 1+5/6 setup is almost never used.

Florian Heigl
  • 1,440
  • 12
  • 19
  • I agree, but that's primarily because what you said is just a super summed-up version of what I said. And of course I agree with myself. – killermist Mar 15 '15 at 20:10
1

Seek times

The issue is that, the write seek amplification behaves very differently to the write throughput amplification. The minimal write throughput amplification with parity occurs when an entire stripe is written at once (let's call this adjective ‘full-stripe’) yet the minimal write seek amplification occurs, conversely, when the entire write following a seek in the virtual device fits into a single chunk. Before going into detail, the relationships are much easier to convey in tabulated form:

RAID | write throughput amplification factor | write seek amplification factor
     | full-stripe (e.g.) | single-chunk     | full-stripe  | single-chunk
   0 | 1           ;  1   | 1           ;  1 | n       ; 12 | 1           ;  1
   1 | n           ; 12   | n           ; 12 | n       ; 12 | n           ; 12
   5 | n/(n - 1)   ; ~1.1 | min [3, n]  ;  3 | n       ; 12 | min [3, n]  ;  3
   6 | n/(n - 2)   ;  1.2 | min [5, n]  ;  5 | n       ; 12 | min [5, n]  ;  5
*1+0 | n₁          ;  3   | n₁          ;  3 | n       ; 12 | n₁          ;  3*
 1+5 | n/(n₅ - 1)  ;  2.4 | expr₁       ;  5 | n       ; 12 | expr₁       ;  5
*1+6 | n/(n₆ - 2)  ;  3   | expr₂       ;  8 | n       ; 12 | expr₂       ;  8*
expr₁ = 2n₁ + min [1, n₅ - 2]
expr₂ = 3n₁ + min [2, n₆ - 3]

where n is the total number of drives, n₁ is the number of drives in the RAID 1 groups, and n₅ and n₆ are the number of groups in the RAID 5 or RAID 6 arrays respectively. Examples relate to the 12-drive example in the question (relevant rows are ‘*bolded*’); examples for RAID levels 1+0, 1+5, 1+6 are 4×3, 6×2, 6×2 respectively.

Note that only the full-stripe write throughput amplification factor is directly related to the proportion of redundancy. The single-chunk cases are more complicated for those with parity. They arise because writing a single chunk requires reading whichever is easiest of the parity chunks or the other data chunks, before writing the parity chunks along with the new data chunk. (They are not directly multiplicative because the induced reads must instead be multiplied by the respective read throughput/seek amplification factor for RAID 1, both being 1; see below.)

Unfortunately, choosing a chunk size that minimises this extra write throughput amplification has the side-effect of actually maximising the write seek amplification. For tiny writes with a negligible write time compared to the seek time, the write performance of striping with a very small chunk size (to be full-stripe) is only 1×, like mirroring, as it requires all of the drives to seek to the chunks for each write and the gained throughput from mobilising all of these drives is irrelevant. It has divided the ratio of write time to seek time by the number of drives in the array, but for tiny writes this was already negligible. It wouldn't make sense to use a chunk size so small as to make even tiny writes be full-stripe. For writes small enough to feel the effects of seeking, it is best that they fit within a single chunk.

RAID | large contiguous write throughput    | concurrent tiny writes throughput
     | full-stripe    | single-chunk        | full-stripe | single-chunk
   0 | n×       ; 12× | n×          ; 12×   | 1×     ; 1× | n×          ; 12×
   1 | 1×       ;  1× | 1×          ;  1×   | 1×     ; 1× | 1×          ;  1×
   5 | (n - 1)× ; 11× | max[n/3, 1]×;  4×   | 1×     ; 1× | max[n/3, 1]×;  4×
   6 | (n - 2)× ; 10× | max[n/5, 1]×;  2.4× | 1×     ; 1× | max[n/5, 1]×;  2.4×
*1+0 | n₀×      ;  4× | n₀×         ;  4×   | 1×     ; 1× | n₀×         ;  4×  *
 1+5 | (n₅ - 1)×;  5× | expr₃×      ;  2.4× | 1×     ; 1× | expr₃×      ;  2.4×
*1+6 | (n₆ - 2)×;  4× | expr₄×      ;  1.5× | 1×     ; 1× | expr₄×      ;  1.5×*
expr₃ = n/(2n₁ + min [1, n₅ - 2]) = max [n/(2n₁ + 1), n/(2n₁ + n₅ - 2)]
expr₄ = n/(3n₁ + min [2, n₆ - 3]) = max [n/(3n₁ + 2), n/(3n₁ + n₆ - 3)]

Note: The middle 2 throughput columns can be ignored given a sensible chunk size that is bigger than the writes for which the seek time is significant, but small enough such that large writes are full-stripe. The large chunk size of the 2nd throughput column is more akin to spanned drives. A ‘tiny’ write is where the effect of throughput is negligible.

Having an inappropriately small chunk size also increases the effect of seek amplification for reads, though not as much and only in the full-stripe case.

RAID | read throughput amplification factor | read seek amplification factor
     | full-stripe      | single-chunk      | full-stripe (e.g.) | single-chunk
   0 | 1                | 1                 | n      to n;    12 | 1
   1 | 1                | 1                 | 1      to n;  1–12 | 1
   5 | 1                | 1                 | n - 1  to n; 11–12 | 1
   6 | 1                | 1                 | n - 2  to n; 10–12 | 1
*1+0 | 1                | 1                 | n₀     to n;  4–12 | 1           *
 1+5 | 1                | 1                 | n₅ - 1 to n;  5–12 | 1
*1+6 | 1                | 1                 | n₆ - 2 to n;  4–12 | 1           *

Note: The ‘to n’ is because when there is only one read happening concurrently, it is theoretically possible to mobilise all drives to seek to appropriate places and collectively read the data for maximum large contiguous read throughput.

RAID | large contiguous read throughput | concurrent tiny reads throughput
     | full-stripe (e.g.)| single-chunk | full-stripe         | single-chunk
   0 | n×          ; 12× | n×     ; 12× | 1×          ;  1×   | n×     ; 12×
   1 | n×          ; 12× | n×     ; 12× | n×          ; 12×   | n×     ; 12×
   5 | n×          ; 12× | n×     ; 12× | n/(n - 1)×  ; ~1.1× | n×     ; 12×
   6 | n×          ; 12× | n×     ; 12× | n/(n - 2)×  ;  1.2× | n×     ; 12×
*1+0 | n×          ; 12× | n×     ; 12× | n₁×         ;  3×   | n×     ; 12×*
 1+5 | n×          ; 12× | n×     ; 12× | n/(n₅ - 1)× ;  2.4× | n×     ; 12×
*1+6 | n×          ; 12× | n×     ; 12× | n/(n₆ - 2)× ;  3×   | n×     ; 12×*

Note: Again, the middle 2 throughput columns can be ignored given a sensible chunk size. The 3rd throughput column is again closely connected to the proportion of redundancy.

However, a large enough chunk size means that tiny reads are never full-stripe. So given an efficient implementation and appropriate chunk size, read performance should be proportional to the number of identical drives when not degraded.

So really, ‘amplification factor’ is a lot more complicated than the formula in the question, where only full-stripe throughput amplification had been considered. In particular, the write performance of 6×2 RAID 1+6 for concurrent writes that are small enough to be seek-bound will be worse than that of 4×3 RAID 1+0. And for tiny writes, which are all seek, the performance can only be about a 3rd of that of 4×3 RAID 1+0 at absolute best (i.e. given a perfect implementation).

Having cleared that issue up, the 12-drive comparison does not have an outright winner:

                                  | 4×3 RAID 1+0 | 6×2 RAID 1+6
   number of identical 1TB drives | 12           | 12
                 storage capacity | 4TB          | 4TB
            redundancy proportion | 2/3          | 2/3
large contiguous write throughput | 4×           | 4×
 large contiguous read throughput | 12×          | 12×
concurrent tiny writes throughput |*4×           | 1.5×
 concurrent tiny reads throughput | 12×          | 12×
safe number of random drive loses | 2            |*5
    12 - 1 large write throughput | 4×           | 4×
     12 - 1 large read throughput | 8×           |*11×
    12 - 1 tiny writes throughput |*4×           | ~1.42×
     12 - 1 tiny reads throughput | 8×           |*~9.33×
  can split-off a copy for backup | yes[1]       | yes[1]
                  2-site failover | yes          | yes
    2-copy large write throughput | 4×           | 4×
     2-copy large read throughput |*8×           | 6×
    2-copy tiny writes throughput |*4×           | ~1.28×
     2-copy tiny reads throughput |*8×           | 6×
   2-copy safe random drive loses | 1            |*2
2-copy - 1 large write throughput | 4×           | 4×
 2-copy - 1 large read throughput | 4×           |*5× or 6×[2]
2-copy - 1 tiny writes throughput |*4×           | ~1.46× or 1.2×[2]
 2-copy - 1 tiny reads throughput | 4×           |*3.6x or 6×[2]
can be divided into 3 full copies | yes          | yes
                  3-site failover | yes          | yes
    1-copy large write throughput | 4×           | 4×
     1-copy large read throughput | 4×           | 4×
    1-copy tiny writes throughput |*4×           | ~0.85×
     1-copy tiny reads throughput |*4×           | 2×
   1-copy safe random drive loses | 0            | 0
                       complexity |*simple       | more complex

Note 1: A full copy of the stored data is respectively a RAID 0 quadruple or a 4/6 degraded RAID 6 array. Note 2: There's an even chance as to whether the drive failure offlines one of the 4 degraded RAID 1 pairs or degrades one of the 2 normal pairs.

Nevertheless, it would have double the read performance of a RAID 6 array of 6 drives and the tiny writes throughput should be 25% better (1.5/1.2) due to the required reads being divided between the RAID 1 pairs, and RAID 6 obviously does have suitable applications, so in high-availability applications that have larger writes or that are more concerned about read performance than write performance, maybe there is a niche for RAID 1+6 afterall. But that's not all…

Complexity

This is still just in theory so far (mostly combinatorics), in practice the complexity will mean that implementations of RAID 1+6 may have deficiencies that miss opportunities and don't achieve the theoretical outcomes. RAID 6 is already more complex, and nesting adds a bit more complexity atop of this.

For example, it's not immediately obvious that 6×2 RAID 1+6 can be abstracted as having 3 independent virtual read heads capable of simultaneously reading 3 contiguous large reads at 4× throughput each, just like 4×3 RAID 1+0. Simply nesting 6 RAID 1 pairs in a RAID 6 array using a software RAID may not be so elegant; the implementation may be stupid and thrash (I haven't tested this hypothesis yet though).

The complexity also presents an increased cost of development of implementations and tools. Even though there may be applications that could benefit from such a nesting, the improvements may not be worth the development costs.

James Haigh
  • 348
  • 1
  • 2
  • 10
  • Please state your source for this information. A practical test with large or tiny writes does not concur with the performance you have suggested. – JamesRyan Mar 15 '15 at 02:13
  • @JamesRyan: This is not second-hand information. The theoretical outcomes are derived from the fundamentals of how standard RAID levels work. All that's needed for the theory is an understanding of how RAID works and an understanding of logic and mathematical derivation. If these calculations were done by someone else then I would of-course state this and provide links for reference if possible. Note that there are many ways in which a practical RAID 1+6 implementation can be suboptimal, but different implementations will vary. What I'd like to know is _why_ your practical test doesn't concur. – James Haigh Mar 15 '15 at 03:11
  • @JamesRyan: Please could you give more details on what implementation you used, what drives you used, in which configurations, with what benchmarking methods? Did you try both a RAID 6 array of 6 RAID 1 pairs and a RAID 0 array of 4 RAID 1 triples with the same 12 drives and chunk size? Was it a software RAID? – James Haigh Mar 15 '15 at 03:16
  • Since you are proposing an untested theory that defies conventional wisdom why don't you detail your setup where you have shown it to work? I would guess that since your maths differs from every other source on this subject as well as real world tests the reason why it doesn't work is that your maths is wrong. – JamesRyan Mar 16 '15 at 14:31