16

Wikipedia says "RAID 2 is the only standard RAID level, other than some implementations of RAID 6, which can automatically recover accurate data from single-bit corruption in data."

Does anyone know if the RAID 6 mdadm implementation in Linux is one such implementation that can automatically detect and recover from single-bit data corruption. This pertains to CentOS / Red Hat 6 if those are different from other versions. I tried searching online but didn't have much luck.

With SATA error rates being 1 in 1E14 bits, and a 2TB SATA disk containing 1.6E13 bits, this is especially relevant to preventing data corruption.

EDIT 17-Jun-2015

I believe this is less of a concern that I originally thought - see Hard disk / SSDs - detection and handling of errors - is silent data corruption reliably prevented? for more details

sa289
  • 1,308
  • 2
  • 17
  • 42

4 Answers4

16

Linux software RAID is not going to protect you from bit corruption and silent data corruption is a well known issue with it. In fact, if the kernel is able to read the data from one disk it would never know that it is bad. The RAID only kicks in if there is an I/O error when reading the data.

If you are worried about data integrity you should consider using a file system like Btrfs or ZFS that ensure data integrity by storing and verifying checksums. These file systems also take care of the RAID functionality, so you don't need the kernel software raid if you go that way.

chutz
  • 7,569
  • 1
  • 28
  • 57
  • Thanks. In case it's helpful to anyone, I got some more search ideas from chutz's reply and saw that the maintainer of mdadm (I believe) said on Feb 17, 2011 that he has no plans to add the ability to force parity checking on every read. See http://www.spinics.net/lists/raid/msg32816.html – sa289 May 23 '12 at 18:44
  • This seems not entirely incorrect, mdadm has a `checkarray` that can check consistency between disks - which surely can help detect bit corruption? – Chris Stryczynski Jan 19 '21 at 00:25
  • Is `checkarray` the same as the data scrubbing described in @vy32 answer below? https://serverfault.com/a/454043/114782 – chutz Jan 19 '21 at 10:32
3

RAID5 and RAID6 can detect and usually correct bit corruption if you verify parity of the entire drive. This is called "scrubbing" or "parity checking" and typically takes 24-48 hours on most production RAID system. During that time performance may be significantly degraded. (Some systems allow the operator to prioritize scrubbing over read/write access or below it.) RAID6 has a higher chance of correcting it, because it can correct it if you have two drive failures, whereas RAID5 can only handle 1 drive failure, and drive failures are more likely when you are scrubbing because of the increased activity.

vy32
  • 2,018
  • 1
  • 15
  • 20
  • 1
    It is not universally true that performance will be noticably degraded during a RAID scrub. If the scrub uses all available system resources and is "dumb" then it will. But, all SANs and I imagine most RAID controllers will run the scrub at a lower or "nice" priority, adjusting the resource utilization dynamically so that it doesn't consume resources needed to maintain production performance. – Jeremy Nov 30 '12 at 22:36
  • You are correct. I edited the answer to add nuance. – vy32 Nov 30 '12 at 22:41
  • if your mdadm raid 6 array is /dev/md1 then is the command to make it verify the parity and attempt repair of single bit corruption "echo check > /sys/block/md1/md/sync_action" – BeowulfNode42 Nov 18 '13 at 01:20
  • 2
    They don't "protect against bit corruption", they *detect bit corruption* if you scrub. See my question [here](http://unix.stackexchange.com/questions/137384/raid6-scrubbing-mismatch-repair) for details. –  Jan 17 '15 at 04:40
  • I suggest changing the answer to "RAID5 and RAID6 is able to repair bit corruption" – Waxhead Jun 07 '15 at 13:18
  • 1
    **You can't recover from bit flip in RAID5** but can with RAID6. Note that it is possible **double flip that is not detected by RAID5 at all**, but can be detected by RAID6 without possibility to recover reliably. – gavenkoa Jul 07 '17 at 14:51
3

All the answers above are incorrect regarding the capabilities of RAID 6. RAID 6 algorithms operate byte-by-byte just as RAID 5, and if a single byte on any one drive is corrupt, even with no error indicated by the drive, it can be detected AND CORRECTED. The algorithm for doing so is completely explained in

https://mirrors.edge.kernel.org/pub/linux/kernel/people/hpa/raid6.pdf

In order to perform this check, the parity P and Q drives must also be read along with the data drives. If the computed parity P' and Q' differs with no drive errors, an analysis can pinpoint which of the drives is incorrect, and correct the data.

In addition, if the drive identification is to a drive that is not present (such as drive 137 if there are only 15 drives), more than one drive is providing corrupted data FOR THAT BYTE, signaling an uncorrectable error error. When there are much fewer than 256 drives in the set, this is detected with high probability per byte, and since there are many bytes in a block, with extremely high probability per block. If the drive identification is not consistent for all bytes within the RAID block, again, more than one drive is providing corrupted data, and generally one might reject the condition, but so long as all the drive identifications are valid, the block need not necessarily be rejected.

It takes longer than the usual verification time to perform this correction, but it only needs to be performed with the syndrome (P and Q) calculation shows an error.

All this being said, however, I have not examined the mdadm code to determine whether single-byte corruption is handled. I am aware that mdadm reports RAID6 syndrome errors on the monthly scan, but from the error message it is not clear whether they are being corrected - it does not stop the drive array nor identify any particular drive in the message.

Cafe Hunk
  • 31
  • 1
2

I would have added this as a comment but I don't have sufficient reputation; I wanted to clarify: RAID5 can DETECT bit corruption but it doesn't know which drive has the corruption without a read error. As a result, a scrub couldn't fix this without a read error - it would most likely just log it and update the parity bit to match. RAID6's algorithm is position-dependant so it can detect which drive contained the error and correct the bit corruption.

sbingner
  • 21
  • 3
  • That would be great if it's true! Can you please provide any links on where is it documented?? – Alek_A Mar 29 '18 at 20:39