write hole: which RAID levels are affected?

Question

In my journey to understanding the advantages of RAIDZ, i came across the concept of write hole.

As this page explains, a write hole is the inconsistency you get among the disks of the array, when the power is lost during a write. That page also explains that it affects both RAID-5/6 (if the power is lost after the data has been written, but before the parity has been calculated) and RAID-1 (data is written to one disk but not the others), and that it is an insidious problem that can only be detected during either a resync/scrub, or (disastrously) during the reconstruction of one of the disks...however, most of the other sources talk about it as it only affected parity-based RAID levels.

From what i understand, i think this could be a problem for RAID-1 too, as reads from the disks containing the hole would return garbage, so...is it a problem for every RAID level or not? Is it implementation-dependent? Does it affect software-RAID only, or also hardware controllers? (extra: how does mdadm fare in this regard?)

ZFS claims to have eliminated the write hole. Here's an interesting blog post from the debut of ZFS: https://blogs.oracle.com/bonwick/entry/raid_z — Andrew Henle, Apr 17 '17 at 09:28

score 7 · Answer 1 · answered Feb 10 '20 at 15:26

The term write hole is something used to describe two similar, but different, problems arising when dealing with non-battery-protected RAID arrays:

sometime it is improperly defined as any corruption in a RAID array due to sudden power loss. With this (erroneous) definition, RAID1 is vulnerable to write hole because you can not atomically write to two different disks;
the proper definition of write hole, which is the loss of an entire stripe data redundancy due to a sudden power loss during stripe update, is only applicable to parity-based RAID.

The second, and correct, definition of write hole needs some more explanation: let's assume a 3-disk RAID5 with 64K chunk size and 128K stripe size (+64K parity size for each stripe). If power is lost after writing 4K to disk #1 but during parity update on disk #3, we can have a bogus (ie: corrupted) parity chunk and an undetected data consistency issue. If, later, disk #2 dies and parity is used to recover the original data by xoring disk #1 and disk #3, the reconstructed 64K, originally residing on disk #2 and not recently written, will be nonetheless corrupted.

This is a contrived example, but it should expose the main problem related to write hole: the loss of untouched, at-rest, unrelated data sharing the same stripe with the latest, interrupted writes. In other word, if fileA was written years ago but shares the same stripe with the just-written fileB and the system loses power during fileB update, fileA will be at risk.

Another thing to consider is the write policy of the array: using read/reconstruct/write (ie: entire stripes are rewritten when partial write happens) versus read/modify/write (ie: only the affected chunk+parity are updated) expose to different kind of write hole.

From the above, it should be clear because RAID0 and RAID1 do not suffer from a proper write hole: they have no parity which can be "out-of-sync" invalidating an entire stripe. Please note that RAID1 mirror legs can be out-of-sync after an unclean shutdown, but the only corruption will be of the latest written data. Previously written data (ie: data at rest) will not face any trouble.

Having defined and scoped a proper write hole, how can be avoided?

HW RAID uses non volatile write cache (ie: BBU+DRAM or capacitory-backed flash module) to persistently store the to-be-written updates. If power is lost, the HW RAID card will re-issue any pending operation, flushing its cache to disk platters, when power is restore and system boot up. This protects not only from proper write hole, but from last-written data corruption also;
Linux MD RAID uses a write bitmap which records the to-be-written striped before updating them. If power is lost, the dirty bitmap is used to recalculate any parity data for the affected stripes. This protects from real write hole only; latest written data can be corrupted (unless backed by a fsync()+write barrier). The same method is used to re-sync out-of-sync portion of a RAID1 array (to be sure the two mirror legs are in-sync, albeit no write hole exists for mirrors);
newer Linux MD RAID5/6 should have the option to use a logging/journal device, partly simulating the non-volatile writeback cache of proper HW RAID card (and, depending on the specific patch/implementation, protecting from both write hole and last-written data corruption or from write hole only);
finally, RAIDZ avoid both write hole and last-data corruption using the most "elegant", but performance-impacting, method: by only writing full-sized stripes (and journaling any synchronized write in the ZIL/SLOG).

Useful links:
https://neil.brown.name/blog/20110614101708
https://www.kernel.org/doc/Documentation/md/raid5-ppl.txt
https://www.kernel.org/doc/Documentation/md/raid5-cache.txt
https://lwn.net/Articles/665299/

Basil · Answer 2 · 2017-04-16T21:11:58.443

3

This is why a cache battery or some other method of cache consistency validation is required for raid. All raid cards should have battery backed cache, and all storage controllers should have mirrored cache. For software raid, I don't think there is a good answer. I think even raid Z can fail on a power loss.

edited Apr 16 '17 at 21:11

answered Apr 16 '17 at 21:02

Basil

8,811
3
37
73

2

The question is tagged software-raid. You cannot implement a battery in software. The problem can be fixed in software without needing any battery, but I don't know if it has been done. – kasperd Apr 16 '17 at 21:43
Thank you for answering, but as @kasperd noted i'm interested in the software-RAID solutions. Do mind though that RAID-Z was explicitly designed to eliminate the write hole problem. – Mario Vitale Apr 17 '17 at 08:01

Javier Miqueleiz · Answer 3 · 2020-02-10T17:15:11.903

I think there are 2 possible definitions for what it's a RAID array "write hole".

The page you mention is taking "write hole" to mean RAID array inconsistency. To understand this, you should take in consideration how a RAID array works. The write operations are sent to the different discs of the array. But as the discs are independent, there is not guarantee about in what order the write operations are really committed (by the discs) to physical media. In other words, when you write blocks to a RAID array, the write operations are not atomic. This is not a problem in the normal operation of the array. But it could be in cases like power-loss events or any other critical failure.

Internal inconsistency of a RAID array can happen in every RAID level that has some sort of data redundancy: RAID 1, 4, 5, 6, etc. RAID 0 is not subject to inconsistency issues, as there is no redundant data that needs to be synchronized among the different discs of the array.

There are several possible strategies to deal with RAID array inconsistency issues:

Linux MD software RAID uses, by default, a "sync" strategy when assembling a RAID array that is marked as "dirty". I.e., for RAID 1 arrays, one of the discs is taken as the master and its data is copied to the other discs. For RAID 4/5/6, the data blocks are read. Then the parity blocks are regenerated and written to the discs. The sync process can be very lengthy. In order to make it much faster, there is feature called the write-intent "bitmap", that keeps tracks of the hot chunks of the array. This bitmap feature would reduce, significantly, the sync process duration, in exchange for some performance loss during write operations.
Hardware RAID arrays with battery-backed memories use a 2-step strategy. First, the data blocks to be written are committed to the memory, that acts as a journal. After this step, the data blocks are sent to the discs. In case of a power-loss event or any other failure, the RAID controller will check that all the data blocks in the memory are really committed to the discs.
There is also a CoW (Copy on Write) strategy, that I will further explain a bit later.

The other possible definition of "write hole" refers to data loss issues in RAID 4/5/6 under certain circumstances (RAID levels 1 and 10 are not subject to this kind of "write hole"). I'm quoting Neil Brown definition of the problem in question:

"The write hole is a simple concept that applies to any stripe+parity RAID layout like RAID4, RAID5, RAID6 etc. The problem occurs when the array is started from an unclean shutdown without all devices being available, or if a read error is found before parity is restored after the unclean shutdown."

I.e., you have, for example, a RAID 5 array and there is a power-loss event. The RAID will try to bring the array to a consistent state. But one of the discs doesn't work any more or some of its sectors cannot be read. Therefore, the parity cannot be regenerated from the data blocks, as some of them are missed. You could say: yes, but we have redundancy in the array. So we could use the parity to regenerate the missing data blocks, no? The answer is no. If you do this, you could get garbage data, potentially, in some data blocks. This is a very serious issue. It's not that some data blocks were written or not (modern journaled filesystems don't have any real issue with this). It's that some data blocks of the array are lost or (if regenerated) they are garbage. Either way, there is a serious issue here.

If we take this stricter definition of "write hole", we see that is a special corner case, that only happens under some circumstances. There must be a critical failure like a power-loss event. And, additionally, some disc has to fail (either completely or partially). But for RAID 4/5/6 (the levels with parity blocks), the risk is there.

This risk can be prevented by using a 2-step write strategy (or write with journal technique that was previously explained). With the help of the journal, all data blocks can be safely written to the discs, even under those corner cases. Hardware RAID with battery backed batteries, if well implemented, is not subject to any "write hole" issues. Linux MD software RAID got also a write with journal feature some years ago, that effectively prevents the "write hole" issue.

I'm not so familiar with ZFS, but I think it uses a CoW (Copy on Write) technique in RAID-Z arrays to avoid any "write hole" issues. It would write all the data plus parity to some unused space, and then it would update the virtual reference to these physical blocks. By using this 2-step process, the write operations are guaranteed to be atomic. So that the write hole issue is effectively prevented.

your definition of zfs is spot on, i think i watched that in a zfs presentation video on youtube, thanks for the definition(s) of a write hole — nwgat, Jul 19 '22 at 20:41

score 2 · Accepted Answer · answered Apr 17 '17 at 16:33

The write hole can affect every RAID level but RAID-0; both striped (RAID-4/5/6) and mirrored (RAID-1) configurations may be vulnerable, simply due to the fact that atomic writes are impossible in 2 or more disks.

I say "may" because the problem is implementation-dependent. Leaving aside next-gen filesystem solutions such as RAID-Z, also classic software-RAID implementations have found ways to tackle this: mdadm has relatively recently introduced a journal feature that uses dedicated cache disks to avoid it, and even if you choose not to use this feature, it also forces a resync after every unclean shutdown, thus catching and resolving the write-hole as soon as it happens.

Thanks to the #zfs irc channel for the help!

how would mdadm "resolve" a write hole symptom, when not using a dedicated cache? — hbogert, Mar 12 '19 at 13:53

write hole: which RAID levels are affected?

4 Answers4

Linked