Why does RAID5 with an odd number of data disks have poor write performance

Question

According to the comments on my other question, it's more than possible that the reason my MD RAID array is performing poorly is because my RAID5 array has 5 data disks. I've tried searching for information on why this is, but haven't found anything, so I'm looking for more information about why this is, and what sort of impact it can have versus having 4 data disks.

possible duplicate of [Samba running slowly when writing files](http://serverfault.com/questions/363784/samba-running-slowly-when-writing-files) — MDMarra, Mar 03 '12 at 03:01
Seriously, this should be addressed in your original question. No need to keep posting dupes with links to the original. — MDMarra, Mar 03 '12 at 03:02
@MDMarra: Then close the old question. Rewriting the entire question isn't productive either. The issue is nothing to do with Samba, and everything to do with the hardware. — Matthew Scharley, Mar 03 '12 at 03:04
You should edit your original question to be relevant. As you uncover more details to a problem, it's appropriate to refine your question to be more specific rather than discarding it and opening a new one. If you have a problem, you can open a question on [meta] about it. — MDMarra, Mar 03 '12 at 03:06
@MDMarra: I disagree; this is a separate question that can standalone, with a reference to a previous question for background if anyone wants more context. — womble, Mar 03 '12 at 03:08
@womble the initial question was about a performance issue that the OP initially attributed to Samba, but is actually possibly a problem with his disk geometry. This is the same exact problem as the original just without the (wrongful?) attribution to Samba. — MDMarra, Mar 03 '12 at 03:12
I disagree that the original question has garnered an answer that demonstrates that the problem isn't with Samba. A *comment* on that question has intimated that the problem *may* be RAID-related, and now a separate question has been asked about RAID performance issues. Think of it this way -- if the original question was rewritten to be this question, wouldn't it make the comments on that question (and any answers that the original question may have gotten) look like complete nonsense? If so, the question shouldn't be rewritten. — womble, Mar 03 '12 at 03:15
http://meta.serverfault.com/questions/3073/how-much-should-you-refine-a-question-before-opening-a-new-question — Matthew Scharley, Mar 03 '12 at 03:17

score 11 · Accepted Answer · answered Mar 03 '12 at 03:13

I have never heard of any sort of "odd/even" performance impact of a RAID5/6 array, and like you, I can't find anything useful from a quick web search.

There are potential issues with the number of disks and write performance in a RAID5/6 array, but they're related to "more disks == slower writes", because (depending on the implementation) the RAID system may want to read off all the data disks in the stripe in order to recalculate parity (so for a 6-disk RAID5, it would involve 4 reads -- one for each of the unchanged data blocks in the stripe -- and two writes -- one for the changed block, and one for the parity block). A good implementation will instead read the changed data block and the parity block, recalculate the parity, and write to the changed data block and the parity block, meaning two reads and two writes, regardless of the number of disks in the set.

David Schwartz · Answer 2 · 2012-03-03T03:46:05.207

1

It's the reverse. The issue is when the number of drives is even with RAID 5. With RAID 5, the ideal number of drives is one more than a power of 2, so 5 drives is one of the optimum sizes. This allows the implementation to make both the block size and the stripe size powers of two.

With five drives, the stripe size (amount of user data that must be written to the RAID array as a unit) will be four times the block size (amount of user data that must be written to a drive as a unit). The block size must be multiple of 512 bytes (or 4KB on newer drives), frequently it must be a power of two equal to or greater than the drive's native block size. So with five drives, the stripe size must be 2KB or more (16KB or more on 4KB drives).

As a general rule, the performance boost of adding an additional spinddle will exceed the performance cost of having a sub-optimal drive count. So an array with six drives will still typically outperform an array with five drives. On typical RAID 5 performance graphs, 3 drives and 4 drives will be right near each other, with 4 slightly on top. Then a bit up will be 5 drives and 6 drives right near each other with 6 slightly on top.

edited Mar 03 '12 at 03:46

answered Mar 03 '12 at 03:24

David Schwartz

31,215
2
53
82

I'm a bit confused. Isn't the stripe size independent of the number of disks (edit: and user-specifiable)? What do you mean by "block"? Disk sector size, which is 512B or 4KB? – Mark Wagner Mar 03 '12 at 03:41
No, the stripe size is not independent of the number of disks. If there are five disks, the stripe size must be four times the block size. That's how RAID 5 works and how it tolerates the loss of a disk. (It's `n-1` for RAID 5, `n-2` for RAID 6.) [See Wikipedia.](http://en.wikipedia.org/wiki/Standard_RAID_levels#RAID_5) I explained what I meant by block size, "amount of user data that must be written to a drive as a unit". It can be any multiple of the disk's native size. – David Schwartz Mar 03 '12 at 03:43
You are confusing the block size with the stripe size. You can specify the block size, and it's independent of the number of disks (but must be a multiple of the drive's native block size). But then the stripe size will be, for RAID 5, the block size times the number of disks minus one. That's how RAID 5 works. – David Schwartz Mar 03 '12 at 03:48
Blocks have nothing to do with it. Typical stripe factors are 64-512KB. Recent versions of `mdadm` default to 512KB so a 4 disk array will have a stripe size of 2 MB, and a 5 disk will be 2.5 MB. Whether it is 2 or 2.5 MB makes no difference; there is no reason it should be an even power of two. – psusi Mar 03 '12 at 05:36
How can a 4 disk RAID 5 array have a stripe size of 2MB? That would be 2/3MB per disk. If it writes, say 512KB per disk, that's 512KB*3 of data and 512KB of parity for each stripe. – David Schwartz Mar 03 '12 at 07:29
@psusi: Whether it is 2MB or 2.5MB makes a reasonably significant difference. Requests from higher levels tend to be for aligned blocks. With 2.5MB blocks having to go out to the RAID, a higher fraction of writes won't be complete blocks and more writes will require a preceding read. Let me put it to you this way -- if we did a survey of write sizes from major applications, do you think more write clusters (sequential writes from the same source) would be multiples of 2MB in size or multiples of 2.5MB in size? Even if you use, say, a 1MB buffer, two buffers will fit perfectly. – David Schwartz Mar 03 '12 at 09:00
2

this is nonsense, any layer above the controller treats disks as a continuous range of sectors. there's no need to make the stripe size a power of two. besides, a simple big file transfer would be mostly linearly writing whole stripes. the write amplification factor (two reads and two writes per write command) applies only for small writes. long writes just overwrite the whole stripe, so the factor is just (N+1)/N writes instead of 2N reads+2N writes – Javier Mar 03 '12 at 09:47
So you're saying that bulk writes with sizes that are even multiples of 2.5MB are just as common as bulk writes that are multiples of 2MB? That seems a bit implausible to me. – David Schwartz Mar 03 '12 at 09:52
it doesn't matter a bit. the IO schedulers split and join operations as needed. long (whole file size) writes are the perfect opportunity for that. – Javier Mar 03 '12 at 14:12
Oops, yes, I was off by one, yes, it's 5 disks for 2 MB. The buffer size the application uses doesn't matter since it just goes to/from the kernel cache and is transfered to/from the disks from there. As long as the read-ahead size is set appropriately, the disks will be kept saturated no matter what the stripe width is. – psusi Mar 03 '12 at 18:30
After they split and join operations, do you think they wind up with more 2MB operations or 2.5MB operations? – David Schwartz Mar 03 '12 at 21:44
the final scheduling is per device, written sequentially and aligned. – Javier Mar 05 '12 at 00:26
Right, and when that final scheduling is done, do you think there are more operations that align on multiples of 2MB or operations that align (start or end) on multiples of 2.5MB? That is, which underlying stripe size do you think will require the greatest number of reads-before-writes? Stripe sizes that are powers of two better adapt to buffers that are powers of two and most software uses buffer sizes that are powers of two. – David Schwartz Mar 05 '12 at 00:33
1

The final scheduling is done at the drive level, so neither 2MB nor 2.5 MB multiples matter at all. The individual drives just see requests that are as large as possible, given the original request, the readahead length, and how the request is split across the drives. As long as the original request + readahead is larger than the stripe size, the drives will all get a sufficiently large request for optimal throughput. So if the application asks for 1 MB and the readahead and stripe width are 2.5 MB, then all drives get requests for 512k+. – psusi Mar 06 '12 at 03:53
@psusi: That's nonsense. Whether an operation is sent to the drive as a pure write or a "read-then-write" depends on whether it can be aggregated to a multiple of the stripe size. So that aggregation has to be done before drive level, at the RAID level. – David Schwartz Mar 06 '12 at 04:02
1

For partial stripe writes, md has to read the parity block, correct it for the new data, then write the new data plus the new parity block. Neither reading nor writing whole stripes is required, and even if it did, there still would be no reason to keep stripes an even power of two. – psusi Mar 06 '12 at 16:39
@psusi: The reason would be that fewer reads are required to write because fewer partial stripe writes are necessary. Fewer partial stripe writes are necessary because after aggregation, at the RAID layer, more writes will be even multiples of 2MB than of 2.5MB. – David Schwartz Mar 06 '12 at 20:43
Aggregation does not care about even powers of two. It aggregates as much data as it can. If your load tends to be small random IO, then the more disks you add, the more partial writes you will get, but for sequential IO, the application can write one byte at a time if it wants to, and the kernel will aggregate it into whole stripes. In either case, whether the stripe size is an even power of two does not matter. – psusi Mar 06 '12 at 22:12
@psusi: I don't understand why you keep taking us in circles. Yes, there's aggregation. But **after aggregation**, if the write is an even multiple of the stripe size, no read before write is needed. Otherwise, a read is needed before the write. Do you think more writes will be even multiples of 2MB or will be even multiple of 2.5MB? If more writes are even multiples of 2MB than 2.5MB, then a RAID with a stripe size of 2MB will perform better since less data will need to be read in order to write. – David Schwartz Mar 06 '12 at 22:25
If you are writing a continuous stream of data, the aggregated writes will be on the order of 100mb, so whether the underlying stripe size is 2 or 2.5 mb doesn't matter; each stripe will be completely written as a whole until the cache size is reduced enough, then writes will stop for a bit to allow more to be aggregated into several additional whole stripes. It's not like the kernel stops aggregating at 2 mb then tries to write a partial stripe every time. – psusi Mar 07 '12 at 04:11
@psusi: You're taking us in circles. Right, we all agree on that. Again, the question is: After this aggregation, do you think more writes will be an even multiple of 2MB or more writes will be an even multiple of 2.5MB? – David Schwartz Mar 07 '12 at 04:13
Neither. As I said before, the smaller size will be more prevalent for random IO, simply because it is smaller, not because it is an even power of two. For sequential IO, you will end up with full stripe writes whether the stripe size is 1, 2, 2.5, or 3 MB. – psusi Mar 07 '12 at 04:17
let us [continue this discussion in chat](http://chat.stackexchange.com/rooms/2716/discussion-between-david-schwartz-and-psusi) – David Schwartz Mar 07 '12 at 05:03

Why does RAID5 with an odd number of data disks have poor write performance

2 Answers2