1

When an error occurs on a drive, is it correct to assume that it will always be detected and reported to the OS (if software RAID such as mdadm) or RAID controller (if hardware RAID) as a failed read (i.e. it won't silently return corrupted data), and then the the RAID software/controller will take that fact and use the other drive(s) in the RAID to read the data instead (assuming it's a RAID type that has redundancy)?

From what I understand, modern enterprise-grade drives have error detection schemes in place, so I'm assuming this is the case, but had trouble finding out anything conclusive online. I imagine this answer hinges to a degree around the quality of the error detection built into the drive, so if it matters, I'm most interested in this with regards to the Intel DC S3500 series SSDs.

EDIT 5-Jun-2015 - clarification:

Specifically, I'm wondering if the algorithms used today for detection of errors bulletproof. In a simple example, if error detection was just doing an XOR on all the bits in the sector, then if two bits got flipped, the error wouldn't be detected. I imagine they are way more advanced than that, but I wonder what the odds of an error going undetected is and if it's so low that we need not even worry about it, and if there's some authoritative source or trustworthy article on this somewhere that could be cited.

EDIT 10-Jun-2015

Updated the question title and the question body to make it more generic to the idea of disk errors (not centered around mdadm like it originally was).

sa289
  • 1,308
  • 2
  • 17
  • 42

3 Answers3

6

Hard drives do have a multitude of error correction methods in place to prevent data corruption. Hard drives are divided into sectors, from which some may become completely unwritable / unreadable or return wrong data through data corruption - let's call the first bad sector corruption and the latter silent data corruption.

Bad Sector Corruption

The first corruption is already handled by the drive itself through a multitude of ways. At the factory, every manufactured drive is tested for bad sectors, which are put into a Primary Defect List (p-list). During the normal usage of the drive, the internal systems may find more bad sectors through the normal wear and tear - these are put into the Grown Defect List (g-list). Some drives have even more lists, but these two are the most common ones.

The drive itself counters these errors by remapping the access to the hard drives sectors to spare sectors without notifying the operating system. However, every time a remap happens, the appropriate values in the hard drives SMART system are increased, thus indicating a growing wear of the hard drive. The indicator to look for is SMART 5 - Reallocated Sector Count, while other important ones are 187 (Reported Uncorrectable Errors), 197 (Current Pending Sector Count) and 198 (Offline Uncorrectable).

To find bad sectors, hard drives use internal error correction codes (ECC), which can be used to determine the integrity of data in a specific sector. That way, it can check for write and read errors in a sector and update the g-list if necessary.

Sources

Silent Data Corruption

Since we do have quite a lot of internal data integrity checking, silent data corruption should be very uncommon - after all, since hard drives have the task of reliably persisting data, they should do that one job correctly.

To keep the amount of silent data corruption outside of a user requested read or write minimal, RAID systems periodically check the ECCs of the complete drives to update the g-list (data scrubbing). If an error occurs, the data is reconstructed from another RAID member after checking the sectors ECC.

However, all the data correction and integrity checking has to be done somewhere - the firmware. Errors in these low-level programs may still lead to problems, as might mechanical problems and false positives ECC sums. An example would be an unchecked write, where the firmware erroneously reports a successful write, while the actual harddrive write did not occur or was faulty (an identity discrepancy).

There are some studies on the statistical occurence of these failures, where a file system data integrity check did report a failure without the underlying drive reporting a problem, thus showing a silent data corruption.

TLDR: less than 0.3% in consumer disks and less than 0.02% in enterprise disks on average contained such identity discrepancies over a 17 month time span with 1.5 million disks checked (365 disks in total had identity discrepancies) - see Table 10 and Section 5 in this publication.

Sources

Lars
  • 484
  • 5
  • 19
0

Yes, mdadm will detect such errors, mark the failed drive as defective and drop it from the working array which will continue to function in degraded mode if redundancy is available.

But AFAIK mdadm does this at the 'software' level, based on errors it receives from the drive in response to its generic I/O requests (which works with any drive), not by querying drive-specific error-detection capabilities.

Dan Cornilescu
  • 327
  • 4
  • 10
  • Are you sure the drive would be marked defective if there was a single unrecoverable error? I may be wrong but I thought just that sector would be marked bad. Yeah, I'm not thinking mdadm wouldn't query the drive's error detection capabilities, but rather wondering if those capabilities would reliably cause a read to a sector with an error to report a failure to the OS / mdadm. Thx – sa289 Jun 03 '15 at 16:56
  • Definitely for read errors - saw it on my system. For write errors there *might* be retries at the lower sw levels (I'm not sure), but if it bubbles up as *unrecoverable* at mdadm level I'd think it'd also cause the drive to be marked defective. Most (if not all) modern drives take care of the bad sector management in their firmware, tho. – Dan Cornilescu Jun 03 '15 at 18:23
  • Okay - good to know. Then I guess the other part of the question applies - is it safe to assume that unrecoverable read errors will always be detected by modern enterprise-grade drives, or may some have error detection systems that allow errors to slip through undetected and silent corruption result in that case? – sa289 Jun 04 '15 at 20:20
  • Pretty safe, I'd say, at least from the companies with good quality track record. Bugs are theoretically possible, but due to large production quantities they'd normally be found fairly quickly. If you're worried about that stick with mature models that have been sold for a while and have gathered good reviews, not the latest and greatest which have yet to prove themselves. – Dan Cornilescu Jun 04 '15 at 22:53
  • Have you ever come across anything authoritative on this? I tried searching, but maybe I just wasn't using the right terms. The closest thing I think I found was that the DC S3700 drives get tested in a particle accelerator, but those are quite expensive drives if you don't need the super high write endurance. Thanks – sa289 Jun 05 '15 at 17:56
  • That's about physical drive failures, I was talking about firmware bugs - good fw should handle physical failures properly - thus safe. Personally I take stats (but with a grain of salt, usage patterns can skew them), amend them with price, warranty terms and RMA speed & convenience and leave the rest into mdadm's hands (sooner or later they'll fail anyways) :) – Dan Cornilescu Jun 05 '15 at 18:49
  • That's a good point about firmware bugs - I got hit by the Intel 8MB bug personally which wiped all my data (fortunately I had a bare metal backup solution). I guess I was more thinking of if there are no bugs, are the algorithms used today for detection of errors bulletproof. In a simple example, if error detection was just doing an XOR on all the bits in the sector, then if two bits got flipped, the error wouldn't be detected. I imagine they are way more advanced than that, but I wonder what the odds of an error going undetected is and if it's so low that we need not even worry about it. – sa289 Jun 06 '15 at 00:51
0

Well, things are a bit more complex.

Modern hard drives don't just detect errors, they have some spare sectors and smart controllers that try to relocate bad sectors. That is, when you try to read some logical sector and it doesn't read at first time, the controller tries to read it several times, and sometimes it can read it after some retries; then it writes the data back to the spare sector, remaps logical sector to the new one and marks old sector as bad, and finally gives you your data. All those processes are completely transparent to the reader, you wouldn't notice any error. However this will normally be reflected in S.M.A.R.T statistics, and if this happens more and more often, you can see that the drive is going to fail before it actually fails. That's why it's really important to use SMART monitoring tools on your system.

When a sector doesn't read at all, or the controler runs out of spare sectors, read error will be returned by the drive. Error detection is now pretty bulletproof, it uses some kind of CRC for sector data. When read error is returned, mdadm will see it, mark the drive as unusable and switch an array into degraded mode.

Eugene
  • 491
  • 1
  • 3
  • 11
  • Have you been able to find anything authoritative on error detection being pretty bulletproof these days? I wasn't able to find as such when searching, though like you, I believe it to be the case. – sa289 Jun 09 '15 at 16:24
  • HDD error detection in the vast majority of models is based on Reed-Solomon code (LDPC is basicly a variation of it). It can detect single, double and triple bit errors in any circumstances, and detect burst errors with any number of bits (when corrupted bits appear one after one). It can miss multi-bit errors distributed over the block, but in most of real failure scenarios (physical damage of the magnetic surface, bad cabling, etc) this is close to impossible, it will most likely produce burst error. – Eugene Jun 11 '15 at 10:39
  • If you can understand heavy math that Reed-Solomon is based on, try to read: http://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction , https://math.berkeley.edu/~mhaiman/math55/reed-solomon.pdf and so on. – Eugene Jun 11 '15 at 10:41