6

This article states that RAID controllers are smart about unrecoverable read errors, and try to rewrite such sectors with the redundancy of the component drives. If the sector is bad, the disk's firmware will reallocate the sector transparently.

Does Linux MD RAID do something similar? Maybe my Google-Fu is bad, but I can't find anything about it.

Halfgaar
  • 7,921
  • 5
  • 42
  • 81
  • 3
    Note what you've said above: the **disk**'s firmware will transparently reallocate the sector. That happens irrespective of whether the HDD is talking to a motherboard controller or a hardware RAID controller. In short: the discs do that for you under both hardware and software RAID. – MadHatter Jul 25 '14 at 07:24
  • 3
    @MadHatter: The argument is that the RAID controller will not throw out the disk out of the array upon an URE, but use parity information to rebuild and remap it. To my knowledge, MD will not do this and fail the array instead. As many disks don't die outright but fail slowly with more and more UREs, this will indeed increase the likelihood of a RAID5 to survive an URE error, but if a disks dies outright, you are as screwed as ever, so I say this article is bollocks and RAID5 is as dead as ever. – Sven Jul 25 '14 at 07:30
  • That is not transparent remapping, and as you describe it it's not done by the HDD's firmware. Do you have any evidence of hardware controllers that do what you describe? The article you reference seems to claim this happens, but offers no model numbers or any other kind of documented proof. – MadHatter Jul 25 '14 at 07:39
  • @MadHatter how is that not transparent remapping? The MD will rewrite the sector and the drive will remap it if necessary. As for the evidence, as of yet, just that article. – Halfgaar Jul 25 '14 at 07:43
  • You said "*the disk's firmware will reallocate the sector transparently*"; that is not transparent remapping *by the controller*. I have probably misplaced a comma in my comment above, and should have said "*That is not transparent remapping as you describe it, and it's not done by the HDD's firmware*". The article, and your question, currently describe two different ways of dealing with UREs. Is that objection clearer? – MadHatter Jul 25 '14 at 07:45
  • @MadHatter I'm not saying the controller remaps anything. I'm saying the controller just tries to rewrite a sector it got an URE on. The drive will then remap it, and report success to the controller, without the controller knowing about the remap. – Halfgaar Jul 25 '14 at 07:47
  • @MadHatter: An URE (unrecoverable read error) happens when the disk really can't read the sector and the transparent remaps fails. I can imagine that a disk with specialized firmware in a high-end storage system can report this to the storage system and it will act by telling the disk to remap with the reconstructed data. Normal SATA/SAS disks lack this feature anyway, so MD can't do this. – Sven Jul 25 '14 at 07:49
  • @SvW actually, a URE is just an URE; sector can't be read. But when the drive encounters a *write* error, it still knows what to write and will just write it elsewhere and perform a transparent sector reallocation. That's why a broken disk shows a lot more reallocated sectors (with SMART) when `dd-ing` over it. That article states that RAID controllers make use of this feature to perform one-sector repairs. – Halfgaar Jul 25 '14 at 07:54
  • Halfgaar, ah, OK, that's much clearer, thank you. SvW, as I read this, there really is a corner-case here: the HDD has a URE that could be remapped if only the drive knew what data should go in the remapped sector. It can't, but a RAID controller can (in certain RAID configurations), so the controller traps the URE and rewrites the failed sector to the failing HDD in such a way that the remap happens. I certainly agree that MD doesn't do this, but to be honest, **I don't want it done**. If any of my RAID drives is starting to return UREs, **I want it out of my RAID**. – MadHatter Jul 25 '14 at 07:56

2 Answers2

2

SHORT ANSWER: mirroring and parity-based RAID layouts support repairing a bad sector with supposedly good data, both during normal reads and during scrubs. However classical RAID (both hardware and software based) can do nothing against silent data corruption, which requires stronger protection in the form of data checksum (provided, for example, by BTRFS and ZFS).

LONG ANSWER: the question and the provided answers conflate different concept about how disks, MDRAID and checksummed filesystems works. Let explain them one by one; anyway, please consider that the exact behaviors are somewhat firmware and implementation dependent:

  • the first line of defense is the disk's own internal ECC: when some bit goes bad the embedded ECC recovery kick in, correcting the affected error in realtime. A low ECC read rate will generally not cause an automatic sector repair/reallocation; however, if ECC errors accumulate and grow up, the disk's firmware will finally reallocate the affected sector before it become unreadable (this will be counted as "Reallocated even count" by SMART attribute). Some enterprise disks periodically read all sectors to timely discover problematic sectors (see SAS/SATA surface scanning).

  • if the sector is only very rarely read and the disk does not "see" the gradual sector data corruption, a read can suddenly fail ("Pending Sectors" SMART attribute) and the affected data are lost. The disk will report a SATA READ ERROR to the operating system and moves on. If using a RAID 1/5/6 scheme the system has sufficient redundancy to reconstruct the missing data, overwriting the failing sectors and, depending on the disk firmware, forcing a sector reallocation. Traditionally, both hardware RAID cards and MDRAID (Linux software RAID) worked in this manner, relying on HDD own remapping feature. Newer HW RAID cards and MDADM relases further provide an internal remapping lists which kicks in if/when the HDD fails to remap the affected sector (ie: because no spare sectors are available); you can read more in md man page, especially the "RECOVERY" section. This obviously means the disk should be immediately replaced. To avoid discovering too many unreadable sectors too late, all RAID implementations support a "scrub" o "patrol read" operation, where the entire array is periodically read to test the underlying disks.

  • the protection scheme described above only works when the read/write error is clearly reported to the RAID card and/or operating system. In the case of silent data corruption (ie: a disks returning bad data instead of a clear error), such approach is useless. To protect yourself from silent data corruption (which, by definition, are not reported by any SMART attribute), you need an additional checksum to validate the correctness of returned data. This additional protection can be hardware-based (ie: SAS T10 extension), block-device software-based (ie: dm-integrity) or a fully integrated checksum filesystem (BTRFS and ZFS). Speaking about ZFS and BTRFS, they support a "scrub" operation similar, but not identical (ie: scanning only actual allocated space/data), to their RAID conterparts.

NOTE: RAID6 or 3-way RAID1 layouts can theoretically offer some added protection against bitrot compared to RAID5 and 2-way RAID1 by using some form of "majority vote". However, as it would command a massive performance hit, I never saw such behavior in common implementation. See here for more details.

shodanshok
  • 44,038
  • 6
  • 98
  • 162
  • Nice answer. It may need a source though. I should have looked at this before, here is the source code in MD RAID1 that is responsible for this: [click](https://github.com/torvalds/linux/blob/16fbf79b0f83bc752cee8589279f1ebfe57b3b6e/drivers/md/raid1.c#L2498). How well this works may also depend on if your drive has TLER. – Halfgaar Mar 23 '20 at 18:56
  • Other than in source code, the [md man page](http://man7.org/linux/man-pages/man4/md.4.html) has plenty of details. I'll add the reference to the answer above. – shodanshok Mar 23 '20 at 19:27
0

The linux md raid in its strictest sense can't do that, but the dmraid (device mapper, the kernel side of the lvm) has a bad block remapper module.

Of course dm and md can be used parallel. In its most popular configuration, a raid array has an lvm volume group on it. This can be extended with the badblock mapper as well.

I must note: the current hard disk controllers have a such badblock mapper functionality in their firmware.

Most professional system administrator doesn't work with disks with bad blocks, but they throw it out after the first problem. They explain that with cost-risk calculations, but it is not true. The truth is, that they are simply lazy. There is very good badblock handling in most operating systems (especially in linux), you can use a such hard disk without any fear.

peterh
  • 4,914
  • 13
  • 29
  • 44
  • 1
    'bad block mapper' doesn't sound like what I mean; it sounds like a software implementation. I agree with the idea mentioned earlier that when disks are starting to reallocate sectors I want them out of my RAID. I even more dislike the idea of disks that are so bad that the disk firmware's sector reallocation doesn't work sufficiently anymore, and *software* is used to mark bad blocks as well... Besides, how can software bad block marking even work still, seeing as how a drive's firmware can remap sectors? – Halfgaar Jul 25 '14 at 08:34
  • can you elaborate on that very good bad block handling? As I said in my first comment, how can software be expected to handle bad blocks when it can't even be sure which block is which (because of transparent sector reallocations; or even SSDs with wear levelling)? I think software bad block handling is a thing of the past, and I don't understand why man pages for tools like `bad-blocks` don't warn against use on modern hardware. – Halfgaar Jul 25 '14 at 10:09
  • 1
    The way "software bad block remapping" works is like this: first the software finds the block number which is bad and it asks the disk firmware to write something to that block. The disk firmware will remap the block (transparently - the block number does not change, only the physical location) and return a successful write status. In case of software raid configuration, the software can retrieve the correct data that is supposed to go to that block from parity or mirror. Modern file systems such as ZFS or BTRFS do exactly this on a scrub command (only) – Alecz Sep 01 '17 at 16:02