This is a Canonical Question about RAID levels.
What are:
- the RAID levels typically used (including the RAID-Z family)?
- deployments are they commonly found in?
- benefits and pitfalls of each?
This is a Canonical Question about RAID levels.
What are:
RAID stands for Redundant Array of Independent Disks (some are taught "Inexpensive" to indicate that they are "normal" disks; historically there were internally redundant disks which were very expensive; since those are no longer available the acronym has adapted).
At the most general level, a RAID is a group of disks that act on the same reads and writes. SCSI IO is performed on a volume ("LUN"), and these are distributed to the underlying disks in a way that introduces a performance increase and/or a redundancy increase. The performance increase is a function of striping: data is spread across multiple disks to allow reads and writes to use all the disks' IO queues simultaneously. Redundancy is a function of mirroring. Entire disks can be kept as copies, or individual stripes can be written multiple times. Alternatively, in some types of raid, instead of copying data bit for bit, redundancy is gained by creating special stripes that contain parity information, which can be used to recreate any lost data in the event of a hardware failure.
There are several configurations that provide different levels of these benefits, which are covered here, and each one has a bias toward performance, or redundancy.
An important aspect in evaluating which RAID level will work for you depends on its advantages and hardware requirements (E.g.: number of drives).
Another important aspect of most of these types of RAID (0,1,5) is that they do not ensure the integrity of your data, because they are abstracted away from the actual data being stored. So RAID does not protect against corrupted files. If a file is corrupted by any means, the corruption will be mirrored or paritied and committed to the disk regardless. However, RAID-Z does claim to provide file-level integrity of your data.
There are two layers at which RAID can be implemented on direct attached storage: hardware and software. In true hardware RAID solutions, there is a dedicated hardware controller with a processor dedicated to RAID calculations and processing. It also typically has a battery-backed cache module so that data can be written to disk, even after a power failure. This helps to eliminate inconsistencies when systems are not shut down cleanly. Generally speaking, good hardware controllers are better performers than their software counterparts, but they also have a substantial cost and increase complexity.
Software RAID typically does not require a controller, since it doesn't use a dedicated RAID processor or a separate cache. Typically these operations are handled directly by the CPU. In modern systems, these calculations consume minimal resources, though some marginal latency is incurred. RAID is handled by either the OS directly, or by a faux controller in the case of FakeRAID.
Generally speaking, if someone is going to choose software RAID, they should avoid FakeRAID and use the OS-native package for their system such as Dynamic Disks in Windows, mdadm/LVM in Linux, or ZFS in Solaris, FreeBSD, and other related distributions. FakeRAID use a combination of hardware and software which results in the initial appearance of hardware RAID, but the actual performance of software RAID. Additionally it is commonly extremely difficult to move the array to another adapter (should the original fail).
The other place RAID is common is on centralized storage devices, usually called a SAN (Storage Area Network) or a NAS (Network Attached Storage). These devices manage their own storage and allow attached servers to access the storage in various fashions. Since multiple workloads are contained on the same few disks, having a high level of redundancy is generally desirable.
The main difference between a NAS and a SAN is block vs. file system level exports. A SAN exports a whole "block device" such as a partition or logical volume (including those built on top of a RAID array). Examples of SANs include Fibre Channel and iSCSI. A NAS exports a "file system" such as a file or folder. Examples of NASs include CIFS/SMB (Windows file sharing) and NFS.
RAID0 (aka Striping) is sometimes referred to as "the amount of data you will have left when a drive fails". It really runs against the grain of "RAID", where the "R" stands for "Redundant".
RAID0 takes your block of data, splits it up into as many pieces as you have disks (2 disks → 2 pieces, 3 disks → 3 pieces) and then writes each piece of the data to a separate disk.
This means that a single disk failure destroys the entire array (because you have Part 1 and Part 2, but no Part 3), but it provides very fast disk access.
It is not often used in production environments, but it could be used in a situation where you have strictly temporary data that can be lost without repercussions. It is used somewhat commonly for caching devices (such as an L2Arc device).
The total usable disk space is the sum of all the disks in the array added together (e.g. 3x 1TB disks = 3TB of space).
RAID 1 (aka Mirroring) takes your data and duplicates it identically on two or more disks (although typically only 2 disks). If more than two disks are used the same information is stored on each disk (they're all identical). It is the only way to ensure data redundancy when you have less than three disks.
RAID 1 sometimes improves read performance. Some implementations of RAID 1 will read from both disks to double the read speed. Some will only read from one of the disks, which does not provide any additional speed advantages. Others will read the same data from both disks, ensuring the array's integrity on every read, but this will result in the same read speed as a single disk.
It is typically used in small servers that have very little disk expansion, such as 1RU servers that may only have space for two disks or in workstations that require redundancy. Because of its high overhead of "lost" space, it can be cost prohibitive with small-capacity, high-speed (and high-cost) drives, as you need to spend twice as much money to get the same level of usable storage.
The total usable disk space is the size of the smallest disk in the array (e.g. 2x 1TB disks = 1TB of space).
The 1E RAID level is similar to RAID 1 in that data is always written to (at least) two disks. But unlike RAID1, it allows for an odd number of disks by simply interleaving the data blocks among several disks.
Performance characteristics are similar to RAID1, fault tolerance is similar to RAID 10. This scheme can be extended to odd numbers of disks more than three (possibly called RAID 10E, though rarely).
RAID 10 is a combination of RAID 1 and RAID 0. The order of the 1 and 0 is very important. Say you have 8 disks, it will create 4 RAID 1 arrays, and then apply a RAID 0 array on top of the 4 RAID 1 arrays. It requires at least 4 disks, and additional disks have to be added in pairs.
This means that one disk from each pair can fail. So if you have sets A, B, C and D with disks A1, A2, B1, B2, C1, C2, D1, D2, you can lose one disk from each set (A,B,C or D) and still have a functioning array.
However, if you lose two disks from the same set, then the array is totally lost. You can lose up to (but not guaranteed) 50% of the disks.
You are guaranteed high speed and high availability in RAID 10.
RAID 10 is a very common RAID level, especially with high capacity drives where a single disk failure makes a second disk failure more likely before the RAID array is rebuilt. During recovery, the performance degradation is much lower than its RAID 5 counterpart as it only has to read from one drive to reconstruct the data.
The available disk space is 50% of the sum of the total space. (e.g. 8x 1TB drives = 4TB of usable space). If you use different sizes, only the smallest size will be used from each disk.
It is worth noting that the Linux kernel's software raid driver called md
allows for RAID 10 configurations with an odd amount of drives, i.e. a 3 or 5 disk RAID 10.
It is the reverse of RAID 10. It creates two RAID 0 arrays, and then puts a RAID 1 over the top. This means that you can lose one disk from each set (A1, A2, A3, A4 or B1, B2, B3, B4). It's very rare to see in commercial applications, but is possible to do with software RAID.
To be absolutely clear:
It provides no additional speed over RAID 10, but substantially less redundancy and should be avoided at all costs.
RAID 5 has been the most commonly-used RAID level for decades. It provides the system performance of all the drives in the array (except for small random writes, which incur a slight overhead). It uses a simple XOR operation to calculate parity. Upon single drive failure, the information can be reconstructed from the remaining drives using the XOR operation on the known data.
Unfortunately, in the event of a drive failure, the rebuilding process is very IO-intensive. The larger the drives in the RAID, the longer the rebuild will take, and the higher the chance for a second drive failure. Since large slow drives both have a lot more data to rebuild and a lot less performance to do it with, it is not usually recommended to use RAID 5 with anything 7200 RPM or lower.
Perhaps the most critical issue with RAID 5 arrays, when used in consumer applications, is that they are almost guaranteed to fail when the total capacity exceeds 12TB. This is because the unrecoverable read error (URE) rate of SATA consumer drives is one per every 1014 bits, or ~12.5TB.
If we take an example of a RAID 5 array with seven 2 TB drives: when a drive fails there are six drives left. In order to rebuild the array the controller needs to read through six drives at 2 TB each. Looking at the figure above it is almost certain another URE will occur before the rebuild has finished. Once that happens the array and all data on it is lost.
However the URE/data loss/array failure with RAID 5 issue in consumer drives has been somewhat mitigated by the fact that most hard disk manufacturers have increased their newer drives' URE ratings to one in 1015 bits. As always, check the specification sheet before buying!
It is also imperative that RAID 5 be put behind a reliable (battery-backed) write cache. This avoids the overhead for small writes, as well as flaky behaviour that can occur upon a failure in the middle of a write.
RAID 5 is the most cost-effective solution of adding redundant storage to an array, as it requires the loss of only 1 disk (E.g. 12x 146GB disks = 1606GB of usable space). It requires a minimum of 3 disks.
RAID 6 is similar to RAID 5 but it uses two disks worth of parity instead of just one (the first is XOR, the second is a LSFR), so you can lose two disks from the array with no data loss. The write penalty is higher than RAID 5 and you have one less disk of space.
It is worth considering that eventually a RAID 6 array will encounter similar problems as a RAID 5. Larger drives cause larger rebuild times and more latent errors, eventually leading to a failure of the entire array and loss of all data before a rebuild has completed.
RAID 50 is a nested level, much like RAID 10. It combines two or more RAID 5 arrays and stripes data across them in a RAID 0. This offers both performance and multiple disk redundancy, as long as multiple disks are lost from different RAID 5 arrays.
In a RAID 50, disk capacity is n-x, where x is the number of RAID 5s that are striped across. For example, if a simple 6-disk RAID 50, the smallest possible, if you had 6x1TB disks in two RAID 5s that were then striped across to become a RAID 50, you would have 4TB usable storage.
RAID 6 is to RAID 60 as RAID 5 is to RAID 50. Essentially, you have more than one RAID 6 that data is then striped across in a RAID 0. This setup allows for up to two members of any individual RAID 6 in the set to fail without data loss. Rebuild times for RAID 60 arrays can be substantial, so it's usually a good idea to have one hot-spare for each RAID 6 member in the array.
In a RAID 60, disk capacity is n-2x, where x is the number of RAID 6s that are striped across. For example, if a simple 8 disk RAID 60, the smallest possible, if you had 8x1TB disks in two RAID 6s that were then striped across to become a RAID 60, you would have 4TB usable storage. As you can see, this gives the same amount of usable storage that a RAID 10 would give on an 8 member array. While RAID 60 would be slightly more redundant, the rebuild times would be substantially larger. Generally, you want to consider RAID 60 only if you have a large number of disks.
RAID-Z is a bit complicated to explain since ZFS radically changes how storage and file systems interact. ZFS encompasses the traditional roles of volume management (RAID is a function of a Volume Manager) and file system. Because of this, ZFS can do RAID at the file's storage block level rather than at the volume's strip level. This is exactly what RAID-Z does, write the file's storage blocks across multiple physical drives including a parity block for each set of stripes.
An example may make this much more clear. Say you have 3 disks in a ZFS RAID-Z pool, the block size is 4KB. Now you write a file to the system that is exactly 16KB. ZFS will split that into four 4KB blocks (as would a normal operating system); then it will calculate two blocks of parity. Those six blocks will be placed on the drives similar to how RAID-5 would distribute data and parity. This is an improvement over RAID5 in that there was no reading of existing data stripes to calculate the parity.
Another example builds on the previous. Say the file was only 4KB. ZFS will still have to build one parity block, but now the write load is reduced to 2 blocks. The third drive will be free to service any other concurrent requests. A similar effect will be seen anytime the file being written is not a multiple of the pool's block size multiplied by the number of drives less one (ie [File Size] <> [Block Size] * [Drives - 1]).
ZFS handling both Volume Management and File System also means you don't have to worry about aligning partitions or stripe-block sizes. ZFS handles all that automatically with the recommended configurations.
The nature of ZFS counteracts some of the classic RAID-5/6 caveats. All writes in ZFS are done in a copy-on-write fashion; all changed blocks in a write operation are written to a new location on disk, instead of overwriting the existing blocks. If a write fails for any reason, or the system fails mid-write, the write transaction either occurs completely after system recovery (with the help of the ZFS intent log) or does not occur at all, avoiding potential data corruption. Another issue with RAID-5/6 is potential data loss or silent data corruption during rebuilds; regular zpool scrub
operations can help to catch data corruption or drive issues before they cause data loss, and checksumming of all data blocks will ensure that all corruption during a rebuild is caught.
The main disadvantage to RAID-Z is that it is still software raid (and suffers from the same minor latency incurred by the CPU calculating the write load instead of letting a hardware HBA offload it). This may be resolved in the future by HBAs that support ZFS hardware acceleration.
Because there's no central authority enforcing any sort of standard functionality, the various RAID levels have evolved and been standardized by prevalent use. Many vendors have produced products which deviate from the above descriptions. It's also quite common for them to invent some fancy new marketing terminology to describe one of the above concepts (this happens most frequently in the SOHO market). When possible, try to get the vendor to actually describe the functionality of the redundancy mechanism (most will volunteer this information, as there's really no secret sauce anymore).
Worth mentioning, there are RAID 5-like implementations which allow you to start an array with only two disks. It would store data on one stripe and parity on the other, similar to RAID 5 above. This would perform like RAID 1 with the extra overhead of the parity calculation. The advantage is that you could add disks to the array by recalculating the parity.
Also RAID ONE MILLION!!!!
128 Disks so reads would be fast, horrible writes but very reliable I'd imagine, oh and you'd get 1/128th the available space, so not great from a budgetary perspective. Don't do this with flash drives, I tried and set fire to the atmosphere...