39

A friend is talking with me about the problem of bit rot - bits on drives randomly flipping, corrupting data. Incredibly rare, but with enough time it could be a problem, and it's impossible to detect.

The drive wouldn't consider it to be a bad sector, and backups would just think the file has changed. There's no checksum involved to validate integrity. Even in a RAID setup, the difference would be detected but there would be no way to know which mirror copy is correct.

Is this a real problem? And if so, what can be done about it? My friend is recommending zfs as a solution, but I can't imagine flattening our file servers at work, putting on Solaris and zfs..

scobi
  • 879
  • 3
  • 13
  • 17
  • 1
    Here's an article on it: https://web.archive.org/web/20090228135946/http://www.sun.com/bigadmin/content/submitted/data_rot.jsp – scobi Oct 23 '09 at 17:27
  • I just had a nice S.M.A.R.T. error crop up on an old 200GB Seagate disk. The bits, they have rotted too much :-( It's six months short of the 5-year warranty, so I'll probably get a replacement without much fuss. – ThatGraemeGuy Oct 23 '09 at 20:41
  • Here is the official Oracle [doc on scrub and bit rot](https://openzfs.github.io/openzfs-docs/man/8/zpool-scrub.8.html) – theking2 Mar 13 '22 at 09:16

9 Answers9

27

First off: Your file system may not have checksums, but your hard drive itself has them. There's S.M.A.R.T., for example. Once one bit too many got flipped, the error can't be corrected, of course. And if you're really unlucky, bits can change in such a way that the checksum won't become invalid; then the error won't even be detected. So, nasty things can happen; but the claim that a random bit flipping will instantly corrupt you data is bogus.

However, yes, when you put trillions of bits on a hard drive, they won't stay like that forever; that's a real problem! ZFS can do integrity checking every time data is read; this is similar to what your hard drive already does itself, but it's another safeguard for which you're sacrificing some space, so you're increasing resilience against data corruption.

When your file system is good enough, the probability of an error occurring without being detected becomes so low that you don't have to care about that any longer and you might decide that having checksums built into the data storage format you're using is unnecessary.

Either way: no, it's not impossible to detect.

But a file system, by itself, can never be a guarantee that every failure can be recovered from; it's not a silver bullet. You still must have backups and a plan/algorithm for what to do when an error has been detected.

  • 1
    Ok, according to wikipedia (http://en.wikipedia.org/wiki/Error_detection_and_correction) modern hard drives use CRC's to detect errors and try to recover using compact disc style error recovery. That's good enough for me. – scobi Oct 23 '09 at 22:21
  • 1
    But if the CRC is stored in the same location (sector) as the data this won't help for all error cases. E.g. if there is a head positioning error data could be written to a wrong sector - but with a correct checksum => you wouldn't be able to detect the problem. That's why checksums in ZFS are stored separately from the data they protect. – knweiss Dec 04 '09 at 08:22
  • Does ZFS has a maintenance like Windows has now? That basically rewrites the data regularly to refresh the magnetic coding. – TomTom Nov 29 '16 at 15:31
  • Modern hard drives do not use CRCs, they use Hamming code which is very different. It's the same thing that ECC memory uses. One-bit flip errors can be corrected, two-bit flip errors can be detected but not corrected, three or more bits flipping and the data is actually damaged. In any case, there is no replacement for data backups. ZFS and other filesystems do not provide any better protection than the Hamming code on a drive's platters does. If the data is damaged then ZFS won't save you. – Jody Bruchon Mar 07 '17 at 14:03
  • @JodyLeeBruchon You got a source on Hamming code being used predominantly now? What info gathering I've been doing lately has indicated that drive makers are still using CRC-RS. [1](https://web.archive.org/web/20170613124503/https://www.hgst.com/sites/default/files/resources/IDRC_WP_final.pdf) [2](https://www.researchgate.net/post/What_kind_of_error_erasure_detection_and_correction_codes_are_used_in_modern_hard_disk_drives) – Erin Schoonover Apr 13 '19 at 16:52
  • 1
    @IanSchoonover No, and now that you mention it, I don't know where I got that info from anymore. It's been over two years since I wrote that. It is not quite correct, but I can no longer edit it to correct it. – Jody Bruchon Apr 26 '19 at 02:05
17

Yes it is a problem, mainly as the drive sizes go up. Most SATA drives have a URE (uncorrectable read error) rate of 10^14. Or for every 12TB of data read statistically the drive vendor says the drive will return a read fail (you normally can look them up on the drive spec sheets). The drive will continue to work just fine for all other parts of the drive. Enterprise FC & SCSI drive generally have a URE rate of 10^15 (120TB) along with a small number of SATA drives which helps reduce it.

I've never seen to disks stop rotating at the exact same time, but I have had a raid5 volume hit this issue (5 years ago with 5400RPM consumer PATA drives). Drive fails, it's marked dead and a rebuild occurs to the spare drive. Problem is that during the rebuild a second drive is unable to read that one little block of data. Depending upon whos doing the raid the entire volume might be dead or just that little block may be dead. Assuming it's only that one block is dead, if you try to read it you'll get an error but if you write to it the drive will remap it to another location.

There are multiple methods to protect against: raid6 (or equivalent) which protects against double disk failure is best, additional ones are a URE aware filesystem such as ZFS, using smaller raid groups so statistically you have a lower chance of hitting the the URE drive limits (mirror large drives or raid5 smaller drives), disk scrubbing & SMART also helps but is not really a protection in itself but used in addition to one of the above methods.

I manage close to 3000 spindles in arrays, and the arrays are constantly scrubbing the drives looking for latent URE's. And I receive a fairly constant stream of them (every time it finds one it fixes it ahead of the drive failure and alerts me), if I was using raid5 instead of raid6 and one of the drives went completely dead... I'd be in trouble if it hit certain locations.

  • 3
    What units are you speaking in? "10^14" is not a "rate". – Jay Sullivan Feb 12 '15 at 22:01
  • 2
    The unit would be e.g. "10^14 bits read per error", which equals 12 TB read per error. – Jo Liss Jul 02 '15 at 16:46
  • 3
    And of course, keeping in mind that the error rate is normally quoted in terms of full sector errors per bits read. So when a manufacturer states URE rates at 10^-14, what they really mean is that the probability of any random sector read hitting a URE is 10^-14 and if it does, then the whole sector comes back as unreadable. That and the fact that this is statistics; in the real world, UREs tend to come in batches. – user Nov 05 '15 at 12:46
9

Hard drives do not generally encode data bits as single magnetic domains -- hard drive manufacturers have always been aware that magnetic domains could flip, and build in error detection and correction to drives.

If a bit flips, the drive contains enough redundant data that it can and will be corrected the next time that sector is read. You can see this if you check the SMART stats on the drive, as the 'Correctable error rate'.

Depending on the details of the drive, it should even be able to recover from more than one flipped bit in a sector. There will be a limit to the number of flipped bits that can be silently corrected, and probably another limit to the number of flipped bits that can be detected as an error (even if there is no longer enough reliable data to correct it)

This all adds up to the fact that hard drives can automatically correct most errors as they happen, and can reliably detect most of the rest. You would have to be have a large number of bit errors in a single sector, that all occurred before that sector was read again, and the errors would have to be such that the internal error detection codes see it as valid data again, before you would ever have a silent failure. It's not impossible, and I'm sure that companies operating very large data centres do see it happen (or rather, it occurs and they don't see it happen), but it's certainly not as big a problem as you might think.

Ian Clelland
  • 762
  • 1
  • 5
  • 7
  • 2
    Actually, I regularly have bit-rot errors (in parts I don't read much), which the system silently recovers from (incorrectly). If at least it notified me there was bit-rot, I could re-read the data to recover it before it became unrecoverable; and if unrecoverable, I'd be able to compare it to the other hard drive. – Alex Nov 26 '14 at 14:39
  • 1
    Alex, please check your HDD SMART data, and system RAM to verify there is not another issue causing the corruption. Bit rot/random corruption is extremely rare, so there may be something else going on with your machine. – Brian D. Jul 12 '16 at 21:18
  • @BrianD. One issue was, I kept the hard drives inside their (insulated) packing material; this was causing hard drives to heat over 60°C while working, for days on end. Does that sound like a legitimate reason why bit rot might have occurred? – Alex Dec 31 '16 at 22:20
  • It's definitely not recommended, as most HDDs have small air holes in them which should not be covered to operate properly. Whether your issue was bit-rot or something else, I would run a full diagnostic on the PC to verify everything is working correctly. – Brian D. Jan 03 '17 at 17:05
5

Modern hard drives (since 199x) have not only checksums but also ECC, which can detect and correct quite a bit "random" bit rot. See: http://en.wikipedia.org/wiki/S.M.A.R.T.

On the other hand, certain bugs in firmware and device drivers can also corrupt data in rare (otherwise QA would catch the bugs) occasions which would be hard to detect if you don't have higher level checksums. Early device drivers for SATA and NICs had corrupted data on both Linux and Solaris.

ZFS checksums mostly aim at the bugs in lower level software. Newer storage/database system like Hypertable also have checksums for every update to guard against bugs in filesystems :)

Waxhead
  • 791
  • 8
  • 15
obecalp
  • 326
  • 1
  • 3
3

Theoretically, this is cause for concern. Practically speaking, this is part of the reason that we keep child/parent/grandparent backups. Annual backups need to be kept for at least 5 years, IMO, and if you've got a case of this going back farther than that, the file is obviously not that important.

Unless you're dealing with bits that could potentially liquify someone's brain, I'm not sure the risk vs. reward is quite up to the point of changing file systems.

user
  • 4,267
  • 4
  • 32
  • 70
Kara Marfia
  • 7,892
  • 5
  • 32
  • 56
  • 2
    I don't see how child/parent/grandparent backups helps. There's no way to know with that system if a bit is flipped because a user intended to change it or if the drive did it on its own. Not without a checksum of some kind. – scobi Oct 23 '09 at 17:47
  • Having multiple backups won't help if you don't know that the data in them is good. You can manually checksum your files, but ZFS does so much more automatically and makes filesystem management easy. – Amok Oct 23 '09 at 17:55
  • 1
    Having backups that go back farther than a week/month increases your chance of having a good copy of the file. I probably could've been clearer about that. – Kara Marfia Oct 23 '09 at 19:14
  • 1
    The problem is: how do you know you have a bad copy? And how do you know which copy that is backed up is the good one? In an automated way. – scobi Oct 23 '09 at 19:34
  • I've seen maybe one file every few years fall to corruption that may be a result of bit rot, but I may be suffering from Small Fish Syndrome. I could understand talk of backups being useless, and I'll delete if it's offensive. It was time well spent reading the other answers, regardless. ;) – Kara Marfia Oct 23 '09 at 20:00
  • Oh, no reason to delete, this is a good discussion. The problem I've got is the "seen" in "seen maybe one file". It requires going back and validating by hand every file, in order to notice that something is wrong. The backups will definitely help if you do have the multi level backup, and you've noticed a bad file. You can go back until you find one that's good. I'm just concerned about these increasingly enormous data stores we're building having hidden demons that show up right when it's time to ship a product. That's when everything feels like it goes wrong. – scobi Oct 23 '09 at 21:49
2

Yes it is a problem.

This is one of the reasons why RAID6 is now en vogue (as well as increasing HD sizes increase time to rebuild an array). Having two parity blocks allows for an additional backup.

RAID systems now also do RAID Scrubbing that periodically reads disk blocks, checks against the parities, and replaces it if it finds a block to be bad.

Matt Rogish
  • 1,512
  • 6
  • 25
  • 41
  • 1
    Be careful, data integrity is not a feature of all RAID systems. – duffbeer703 Oct 23 '09 at 21:13
  • 1
    With terabyte drives, there are so many bits sharing fate, and the physical storage area of a bit is so small, that this problem becomes more important. At the same time, the probability of failure increases so much with terabyte drives that RAID6 is not enough unless you are putting lots of drives in the pool, say 8 or more. With smaller numbers of drives it is better to use a stripe of mirrors aka RAID 10. Both RAID 6 (raidz2) and RAID 10 (zpool create mypool mirror c0t1d0 c0t2d0 mirror c0t3d0 c0t4d0) are possible on ZFS. – Michael Dillon Oct 23 '09 at 21:46
  • RAID can't tell which data is good and which isn't so it can't fix errors, it can just detect them. – Amok Oct 30 '09 at 16:23
  • Amuck: Not as part of the "RAID Standard", per se, but advanced RAID systems (firmwares, etc.) do that – Matt Rogish Oct 31 '09 at 23:29
  • 1
    @ Michael Dillion - RAID6 reliability does not increase as you increase the number of drives. For all data there is only the original data + 2 parity. Increasing drive number is worse for reliability as it increases the possible drive failure rate without increasing redundancy of any data. The only reason to increase drive numbers, is to increase your available storage size. – Brian D. Jul 12 '16 at 21:04
  • @ Amok - RAID6 has 2 parity locations, so data is represented in 3 locations. Every controller I have seen can easily detect and fix volume data integrity issues when using RAID6 due to this fact. – Brian D. Jul 12 '16 at 21:14
1

Yes, bitrot is a problem.

I wrote a tool called chkbit to help detect bitrot:

Any cloud or local storage media can be affected by data corruption and/or bitrot. While some filesystems have built in protection, this protection is limited to the storage media.

chkbit will create an hash that follows your data from local media to cloud or backup. This enables you to verify the integrity of your data wherever it is moved.

  • run chkbit on your system
  • move the data to a new system (backup/restore)
  • verify that everything is OK with chkbit
laktak
  • 626
  • 2
  • 9
  • 16
1

In regards to the OP's statement about RAID not understanding what data is good vs. bad.

RAID controllers use at the very least, (odd/even) parity bits on every stripe of data. This is for everything; the data-on-disk stripes and the parity (backup) data stripes.

This means that for any RAID type that has striping for redundancy (RAID 5/6) the controller can accurately tell if the original data stripe has changed, as well as, if the redundancy data stripe has changed.

If you introduce a second redundant stripe like RAID6, you have to have 3 data stripes, on three different drives become corrupted, that all correspond to the same actual file data. Remember that most RAID systems use relatively small data stripes (128kb or less) so the chances of the "bit rot" lining up to the same 128kb, of the same file, is practically impossible.

Brian D.
  • 469
  • 3
  • 11
0

It's a real world problem, yes, but the question is if you should worry about it or not.

If you only got a hdd full of pictures it might not be worth the effort. It's its full of important scientific data it might be another kind of story, you got the idea.

Marc Stürmer
  • 1,894
  • 12
  • 15