8

For a Dell R920 with 24 x 1.2TB disks (and 1TB RAM), I'm looking to set up a RAID 5 configuration for fast IO. The server will be used to host KVM VMs that will be reading/writing files of all sizes, including very large files. I am not terribly interested in data safety because if the server fails for any reason, we'll just re-provision the server from bare metal after replacing the failed parts. So, performance is the main concern. We're considering RAID 5 because it allows us to distribute data over multiple spindles and therefore gives us better performance and, while not our main concern, also gives us some data protection. Our NIC is dual 10Gbps.

I'm limiting this question to RAID 5 only because we think this will give the best performance. Only if there is a compelling performance reason will we consider something else. But, I think I'd prefer answers that are related to RAID 5 configurations.

Okay, with the above stated, here is our present configuration thoughts for:

  • 24 Hard Drives: RMCP3: 1.2TB, 10K, 2.5" 6Gbps
  • RAID Controller: H730P, 12Gbps SAS Support, 2GB NV Cache
  • 1 Hot Spare (just to give us some longer life if a drive does fail)
  • 23 Data Drives (of which 1 is accounted as Parity and 22 left for Data)
  • Stripe Size: 1MB (1MB/22 data drives = ~46.5KB per disk--or, do I misunderstand stripe size)?
  • Read Policy: Adaptive Read Ahead
  • Write Policy: Write Back
  • Disk Cache Policy: Enabled

If the stripe size is the TOTAL across the data drives, then I figured ~46.5KB per drive will give us very good throughput. If the stripe size is per spindle, then I've got this all wrong.

Does the stripe size also the size that a single file takes? For example, if there is a 2KB file, would choosing a stripe size of 1MB mean that we're wasting nearly an entire megabyte? Or can multiple files live within a stripe?

Lastly, when we install CentOS 6.5 (or latest), will we need to do something special to ensure that the filesystem uses RAID optimally? For example, mkfs.ext4 has an option -E stride that I'm told should correspond to the RAID configuration. But, during a CentOS installation, is there any way to have this done?

Many thanks for your thoughts on a configuring RAID 5 for fast IO.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
Steve Amerige
  • 403
  • 2
  • 5
  • 11
  • 4
    RAID 5 is what you do *not* want to use if you want performance...its write speeds can be terrible. – Nathan C Jul 11 '14 at 13:32
  • 1
    Can you provide some context on the read/write workload and the application for this storage solution? – ewwhite Jul 11 '14 at 13:33
  • 1
    if you want performance, dont use HDDs at all.. you most likely can achieve more performance with a good SSD storage system, or even PCIe storage solutions. if you dont care if data is lost go for SSDs or even PCIe storage solutions. – Dennis Nolte Jul 11 '14 at 13:49
  • 1
    Raid 5 is not for performance. Use Raid 1+0, or just 0 if you really don't care about the data. By the way: Those Dell controllers don't do more than 16 disk per raid-group if I remember correctly. – Tonny Jul 11 '14 at 13:54
  • 2
    @Tonny I verified this. 16-disk maximum for that controller. – ewwhite Jul 11 '14 at 13:55
  • 7
    Everyone, please remember that you weren't born experts, everyone learns at some point. Please be nice to those who know less than you do. – Chris S Jul 11 '14 at 14:03
  • 1
    "I'm limiting this question to RAID 5 only because we think this will give the best performance." Why are you thinking? Instead of thinking, test! You'll only see the best performance on bulk write workloads. – MikeyB Jul 11 '14 at 14:12
  • Don't think - test, evaluate and make decisions based on your workload. – user9517 Jul 11 '14 at 14:19
  • 1
    @ChrisS This site is supposed to be for professionals. This is pretty basic 101 stuff. I'm not surprised that some of us sound a bit acerbic when we see a question like this come up again. But you are right: Because we ARE professionals we ought to be more professional in our answers/comments. P.S. Did you edit my comment? If so, thank you :-) – Tonny Jul 11 '14 at 14:48
  • close, since it's clear "unclear what you're asking" case – poige Jul 12 '14 at 05:34

4 Answers4

12

Please use RAID 1+0 with your controller and drive setup. If you need more capacity, a nested RAID level like RAID 50/60 could work. You can get away with RAID 5 on a small number of enterprise SAS disks (8 drives or fewer) because the rebuild times aren't bad. However, 24 drives is a terrible mistake. (Oh, and disable the individual disk caching feature... dangerous)

There are many facets to I/O and local storage performance. There are I/O operations/second, there's throughput, there's storage latency. RAID 1+0 is a good balance between these. Positive aspects here are that you're using enterprise disks, a capable hardware controller and a good number of disks. How much capacity do you require?

You may run into limits to the number of drives you can use within a virtual disk group. PERC/LSI controllers traditionally limited this to 16 drives for single RAID levels and RAID 1+0. The user guide confirms this. You wouldn't be able to use all 24 disks in a single RAID 5 or a single RAID 1+0 group.

Another aspect to consider, depending on your workload, is that you can leverage SSD caching using the LSI Cachecade functionality on certain PERC controllers. It may not be available for this, but understanding your I/O patterns will help tailor the storage solution.


As far as ext4 filesystem creation options, much of it will be abstracted by your hardware RAID controller. You should be able to create a filesystem without any special options here. The parameters you're referring to will have more of an impact on a software RAID solution.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • Some very useful information here. We are looking for performance first, total disk space second. So, we're not considering RAID 10 because we'd lose half of our available 24TB of disk space. Our needs are very data heavy and we churn through TB of data quite easily. Because the consequence of server failure is a low priority for us (because we can just rebuild the server from the ground up after fixing bad hardware--that is, data safety is NOT our most important consideration), we want to maximize disk IO (both local and NFS/another story) and overall processing speed. – Steve Amerige Jul 11 '14 at 15:18
  • @SteveAmerige How much disk space do you need? You can't USE 24 disks in one RAID group, so there are some additional design considerations for your environment. Can you tell us what this system is here to do, what type of data is involved, and what the real performance requirements are? – ewwhite Jul 11 '14 at 15:19
  • The servers will be running VMs that, in turn, are doing heavy analytics on large quantities of data. The VMs are "throw away" as the results of the analytics are stored elsewhere. Essentially: bring data in, work on it, put data out. We're always short on disk space, so we want to lose as little disk space as possible, but we're willing to give some consideration to data safety (but not a LOT of consideration). So, from what you're saying we could, for example, create 4 disk groups, each having 6 drives, right? Or 3 disk groups, each having 8 drives. – Steve Amerige Jul 11 '14 at 15:26
  • So: 3 x 8 drives, or 2 x 12 drives in either RAID 0 or RAID 5. RAID 0 would give us no data safety, so we'd get 100% of our drives. RAID 5 would give us some data safety at the expense of effectively 1 hard drive per group, right? So, a 2 x 12 drive configuration, we'd lose 1/12 of the space for parity considerations = ~8.3% of our total disk space and some CPU and IO to do the additional writes that are needed for RAID 5, right? Importantly, with a 2 x 12 drive configuration, we'd be doing 12 reads/writes in parallel, giving us the performance boost we want, also right? – Steve Amerige Jul 11 '14 at 15:32
  • 1
    @SteveAmerige More detail! [The RAID 5 is a non-starter. You just should NOT use it in 2014.](http://www.reddit.com/r/sysadmin/comments/ydi6i/dell_raid_5_is_no_longer_recommended_for_any/) What type of data is this? What would the virtualization technology be? KVM? VMware? I think the design here really needs some refinement, especially before investing in so much hardware... Do you know what the size of the "working set" of data will be per-VM? In instances where that value is known, you can cache and optimize around that. Tiered storage. SSDs. Is the workload read-biased or write-biased? – ewwhite Jul 11 '14 at 15:43
  • Thanks... We're using KVM virtualization with hypervisors running CentOS 6.5 and VMs running RHEL 6.5 (for licensing reasons). The VMs tend to be around 200GB each, and process data ~50GB (copying that much into the server and then writing around 20GB of results). The equipment is already purchased. I'm at the configuration stage for the hardware. – Steve Amerige Jul 11 '14 at 15:46
  • 1
    `RAID 5 would give us some data safety at the expense of effectively 1 hard drive per group, right?` No, not right. With drives these size, and that many disks in a group, RAID 5 effectively gives you 0 data safety. May as well just throw the disks into two 12-disk RAID 0 arrays. – HopelessN00b Jul 11 '14 at 15:52
  • That link about RAID5 is a bit deceptive. There is noting wrong with RAID 5 provided you're being sensible about unrecoverable error rates. Consumer drives with an unrecoverable bit error rate of 1 in 10^14 means about 1 in 12TB. This is why there is a problem, because with 24 1TB drives... the odds get eyewateringly high when you have to rebuild. (e.g. read the whole lot to recalculate parity) However, decent quality drives have an UBER spec of 1 / 10^16. (1.2PB). Still - large raidsets mean long rebuilds and higher odds of compound failures. Wouldn't run more than 7+1 RAID5. – Sobrique Jul 11 '14 at 15:54
  • @HopelessN00b You can safely use RAID5 with enterprise SAS disks... even 900GB and 1.2TB... but in groups of 8 or less. But here, the OP still hasn't answered exactly how much usable space is required. – ewwhite Jul 11 '14 at 15:55
  • @ewwhite We're trying to lose no more than about 10-15% of our available total disk space. So, our drives are ~1TB each x 24 = 24TB and 85-90% usability = 20.4-21.6TB of usable space. So, for round numbers, let's just say that for 24TB, we want at least 20TB of usable space (before formatting, OS use, etc.). – Steve Amerige Jul 11 '14 at 16:50
  • Is this data compressible at all? – ewwhite Jul 11 '14 at 16:53
  • No, not compressible--large chunks of binary data. – Steve Amerige Jul 11 '14 at 16:54
  • @HopelessN00b So, for 2 x 12-disk RAID 0 arrays, what stripe size would you pick, given that each array would total about 12TB of disk space? And the rest of the configuration would include: Read Policy: Adaptive Read Ahead, Write Policy: Write Back, Disk Cache Policy: Enabled. Thoughts? – Steve Amerige Jul 11 '14 at 17:12
  • @SteveAmerige Don't modify the stripe size from default on the controller. And definitely don't mess with the ext4 options. – ewwhite Jul 11 '14 at 17:13
  • 1
    In the end, I did the following configuration of the 24 physical drives: Disk Group 0, RAID 10 (4 drives): VD 0: BOOT 100GB; VD 1: ROOT 2134.5GB. Disk Group 1, RAID 0 (10 drives): VD 2: DATA1, 11172.5GB. Disk Group 2, RAID 0 (10 drives): VD 3: DATA2, 11172.5GB. It is possible that I might not have needed to have separate VD 0 BOOT and VD 1 ROOT Virtual Disks. I did it to ensure that the booting disk could do a standard (non-UEFI) boot. I used LVM later on so that I had / that exclusively used VD 0 and VD 1; and /data that used VD 2 and VD 3. Many thanks for all of the comments! – Steve Amerige Jul 14 '14 at 14:39
5

Do NOT use a single RAID 5 array across 24 1TB disks! I don't much care what you prefer to limit the answers to, it's a bad idea and you should look at other options.

The odds of a disk failing go up with each disk. So does the time it takes to rebuild. When a drive fails, and you replace it, it will use as much IO across all the disks as possible to build the data for the new one. It's very likely that one of your 23 remaining good disks will fail during this process, forcing you to restore the server from backups. Which you say you don't care about...but are you willing to accept doing that once a month? Once a week? As the disks age, it very well could get that bad.

Besides, if you want performance, RAID5 is leading you in the wrong direction. In many cases, RAID5 has worse performance than other options, because it has to calculate parity for every write, and then write that to a drive as well. RAID5 wasn't designed for performance.

If you REALLY don't care about your data, go with RAID 0. But even then, create a few separate arrays, not one giant 24 disk RAID 0.

If you want performance and some integrity, use RAID10. You'll lose some disk space, but get quite a performance boost.

Or you can look at things like ZFS that are designed from the ground up to work with huge amounts of data on disks.

Grant
  • 17,671
  • 14
  • 69
  • 101
  • 1
    FYI, I'm a software developer who manages our division's servers as a side job. That means I have lots of gaps in my knowledge. But, I've been doing this reasonably successfully for a couple of years now. This is the first time I'm building this big a server, so your feedback is very much appreciated. I very much appreciate your comment about the RAID 5 limitations. What we want is performance, and total available disk space, but we're willing to make some consideration for RAID configurations that give us some data safety (our last priority). – Steve Amerige Jul 11 '14 at 15:19
1

Your options:

  • RAID 0: This turns all your disks into a single unit with no redundancy. This has the highest read and write performance and the most usable space of any of the options, but the loss of a single disk means the loss of all data.

  • RAID 1+0: This turns all your disks into a single unit with all data present on two disks. The read speed is about the same as RAID 0, the write speed is halved (since you need to write each piece of data twice), and you only have half as much space available. The loss of a single disk has no impact on data availability and minimal impact on read/write speeds.

  • RAID 5: This turns all your disks into a single unit, with a parity value on one disk. The read speed is slightly lower than RAID 0, the write speed is much slower, possibly slower than the write speed of a single non-RAID disk (each write requires a read-modify-write cycle on at least two disks), and you lose one disk's worth of space for parity information. The loss of a single disk can cause a major reduction of read speed (reconstructing the data that was stored on it requires reading data from all the other disks), but has no impact on data availability.

  • RAID 6: This has essentially all the advantages and drawbacks of RAID 5, except that it stores a fancier checksum in addition to a parity calculation, and can handle the loss of two disks without data loss.

If data safety is truly irrelevant (this includes the time spent restoring the data from the original source, which may take days, and time lost re-doing interrupted calculations), I recommend RAID 0. Otherwise, if you have a workload that is almost exclusively reads and you want some reliability, I recommend RAID 6 (but note that performance will suffer when recovering from a failed disk). If you have a read-write workload, I recommend RAID 1+0.

Depending on the precise nature of your workload (ie. if a given task accesses a well-defined subset of your disk space), you may be able to set up multiple independent RAID arrays, so that the failure of one will not impact the others.

RAID 5 provides no benefits in your situation. It has a performance penalty (especially for writing) compared to RAID 0, and with the number of disks you have, it's virtually certain that a second disk will fail during recovery, giving no data safety benefit.

Mark
  • 649
  • 4
  • 10
1

Okay, just to one clear question -- stripe size. The bigger stripe size is better unless your RAID is dumb to always read/write the whole stripe of data as the minimum I/O chunk.

Why? -- small stripe size implies involving several disks into any lengthy I/O, the less it is the more chances to load several disk with one logical I/O. Big stripe means more chances for just one disk (or a few) to be involved into I/O. This might seem as deficiency cause there's no boost comparing to multiple disks, but then your almost random load jumps in and you realize the load would be spread across all of the disks more or less evenly.

More theory behind this can be found here: http://www.vinumvm.org/vinum/Performance-issues.html

poige
  • 9,171
  • 2
  • 24
  • 50