What is the granularity of a hard disk URE (unrecoverable read error)?

8

1

tl;dr in case a URE occurs on a hdd, will I loose 1bit, 1Byte, or the size of a sector (512Bytes, or 4096 Bytes AF)? and if possible explain why so?

Background: The question here arises when a hard disk has a problem reading data. Surely a disk can fail completly leaving all its data lost (DISK FAIL), but the case I ask about here about is that when just a smaller part of it is lost (URE, an uncorrectable read error).

Even though I have looked for information regarding URE, I have found out little for certain. This might have its cause in that what happens internally in the drive, i.e. what is hidden from direct user interaction like ECCs-correction, is for me hard to relate to what I access as a user - the sectors.

Let us imagine that the hdd has trouble reading data.

In that situation, surely this must mean either that:

  • (a) some bits of the sector cannot be read, or
  • (b) all bits are can be read, yet they do not pass a checksum test (off course expecting trouble a sector 4096 Byte is not just 8*4096 bits, but some additional bits/byte for error checking/correction (i.e. parity bits) (c) ????

No my believe is that when we are in the situation in which a combination of (a) and (b) occured and a relyable reconstrution of the 4096 sector's bytes cannot be done, then it is excessive to assume that necessarily all of them are garpage, actually if we were aware of the interal hdd error correction logic we might instead say "look something does not check out, and with a good change at least 1,2,3,n bits/bytes of the block data is "wrong"". If we were redundantly saving "hello,hello.....,hello" ASCII byte strings in this sector we actually might still have a fair succession of "hello,hello...." before there will be a "...Uellohello..." (i.e. "e" -> "U").

So what is the granularity of an URE?

UPDATE: there has been a comment inputing the idea of bad sector (and suggesting that this reflects the granularity of an URE event. It is not absurd, to suggest it and maybe can be used in answering the question. Yet I just read another related question asking about pending unreadable sectors (here https://unix.stackexchange.com/questions/1869/how-do-i-make-my-disk-unmap-pending-unreadable-sectors) which leads me to think that in some scenarios there is indeed a more blurry line in between the data lost in case of an URE.

humanityANDpeace

Posted 2015-09-08T09:56:34.413

Reputation: 642

Usually it is tens of thousands of blocks damaged at a time in the case of a crashed head. If it is dust, etc. accessing near blocks can spread the damage. So its rarely as simple as part of a larger area can be reconstructed. – JamesRyan – 2015-09-08T16:36:46.643

@JamesRyan good hint, it can always be worse. Maybe I was simply inquiring about the least bad case possible (that is only to loose a sector, or as it partly was resolved in the good answers, a part of the sectors data, depending on the type inside of it). maybe knowing more about the genesis of unreadable errors (and their persistence i.e. random bit rot, vs. head crash impact) will have to be considered. But we want answerable questions here, so I did not needlessly complicate the question any more – humanityANDpeace – 2015-09-09T04:49:44.523

Answers

8

The error correction code on a hard drive is an additional chunk of data that's associated with each hardware sector. During writing the drive firmware calculates this data and writes it along with the user's data. During reading the firmware reads the ECC along with the data and checks them together.

For a traditional hard drive the hardware sector is 512 bytes. For an Advanced Format drive it's 4K bytes (it doesn't matter whether the drive is presenting 512-byte or 4K-byte sectors at the interface, i.e. 512e vs. 4kn).

The result of the check after a read has basically three possible results:

  • sector was read without error. This is actually not completely common on modern hard drives; the bit densities are such that they depend on ECC working.

  • sector was read with correctable errors. As implied above this is not uncommon; it is expected. The drive returns the data, with error correction applied, to the user.

  • sector was read but there were too many "wrong bits"; the errors could not be corrected.

In the latter case the drive does not normally return any contents whatsoever; it just returns a status indicating the error. This is because it is not possible to know which bits are suspect, let alone what their values should be. Therefore the entire sector (ECC bits and all) is untrustable. It is impossible to determine which part of the bad sector is bad, let alone what its contents should be. The ECC is a "gestalt" that is calculated across the entire sector content, and if it doesn't match, it's the entire sector that isn't matched.

SpinRite works by simply trying to read the bad sector over and over again, using a "maintenance read" function that returns the data (but without ECC bits) even though the drive says "uncorrectable error". As said in the description linked by DavidPostill, it may succeed with an error-free (actually "correctable" is more likely) read; or it may be able to deduce, essentially by averaging the returned bits together, a reasonable guess at the sector contents. It has no more ability to precisely correct errors using the ECC than the drive does; that's mathematically impossible.

Jamie Hanrahan

Posted 2015-09-08T09:56:34.413

Reputation: 19 777

Is it still mathematically impossible if the data inside the 4096Byte payload was itself a compination of a 4000Bytes payload and another 96Byte ECC on top? (for instance because I was willing to sacrafice capacity for recoverablility in the data store layout?). – humanityANDpeace – 2015-09-08T12:02:38.793

my guess is that it's only mathematically impossible under the implicit assumption that there was no further redundance inside of the data, right? - and also great answer! – humanityANDpeace – 2015-09-08T12:03:45.093

1Sure. At that point it's just another unreliable channel, but if there's enough redundancy in it.. The catch is that the OS's standard disk drivers won't give you the sector contents at all if the drive thinks the errors are uncorrectable. RAID-5 and similar parity schemes are doing the same thing at an "outer layer" rather than inside the data fields of existing sectors. – Jamie Hanrahan – 2015-09-08T19:00:24.733

"the catch" with the os drivers to give back (at request) all, even unverified data is a problem, as a non-windows user I asked about this specifically https://unix.stackexchange.com/questions/228254/how-can-force-hdd-to-give-the-bad-data-of-a-ure-sector-bad-sector

– humanityANDpeace – 2015-09-09T04:52:51.867

3

What is the granularity of an URE?

Unrecoverable read errors (URE) are sector read failures. If the sector cannot be read without error it doesn't matter whether it was just 1 byte or all of the sector's bytes.

The granularity is the sector size.

Even if only 1 byte failed you won't normally get any of the data from that sector back without using specialist sofware.


Can the data from a failed sector be recovered?

SpinRite says:

SpinRite is even able to recover most of the data in a sector that can never be perfectly read, and which any other utility software discards in full.

See How SpinRite Recovers Unreadable Data.


Disclaimer.

I am not affiliated with SpinRite in any way, and I've never used it.

DavidPostill

Posted 2015-09-08T09:56:34.413

Reputation: 118 938

1I tend to think this is a good answer, not because necessarily I agree that in case of an URE it is necessary to loose a sector (that is after all 4k of data) completetly, but because the hdd might discard even that share of the "bad sector" which would still be of value. The presentation of the SpinWrite arguments sustain this idea, so the answer also offers some more insight, great. – humanityANDpeace – 2015-09-08T11:27:15.010

2

There's no such thing as "can't read a bit", unless you have a really grievous hardware error like the head not being able to seek to the correct track, or the servo track is damaged and the correct sector can't be found. Obviously in either case you would have, at the very least, an entire unreadable sector.

Otherwise, you always get bits back, they're just possibly incorrect bits. This is where the error-correcting code comes in; it adds some number of extra ECC bits to every sector, such that any correct combination of data bits and ECC bits observes some algebraic rule. If all bits were read correctly, the code will validate and the data can be passed back directly. If a small number of bits were read incorrectly, the ECC code can be used to determine exactly which ones, and fix them, so all of the data is passed back correctly. If a larger number of bits was read incorrectly, the ECC code can detect that there was an error, but it no longer has enough information to figure out which bits are incorrect; this is an uncorrectable read error. If a very large number of bits is read incorrectly, then the code might validate correctly "by accident" and the drive will return corrupted data, but with enough ECC bits the probability of this happening can be made as small as you like.

So to answer the question I think you were getting at — if there was a partial read error but enough information was available to figure out where the error occurred, then it can also be corrected, and the computer won't see any error at all. This actually happens constantly. An uncorrected error happens when it's not possible to figure out which data bits are valid and which ones aren't, and since the error-correcting code is computed over a sector, this happens at sector granularity.

hobbs

Posted 2015-09-08T09:56:34.413

Reputation: 701

1

Having looked into it and inspired by the answer https://superuser.com/a/969917/160771 from https://superuser.com/users/337631/davidpostill

I would like to answer present an somewhat extending alternative answer. First it is true that the hard disc and its firmware are the origin of an URE event, that is the event that data cannot be read. Further it is true that the data is writen to disk in sectors of 512 or 4096 Bytes of usable data and some 50 or respective 100 bytes of extra data which should allow error checking and correction.

Speaking about an URE happens therefore naturally in the context of a hard disk sector. The term bad sector is surely somewhat linked, but not identical to the situation at hand when we have an URE sector.

An sector with some problems to be read without error, is not necessarily completely meaningless. It could be that indeed all 4096 of data have become corrupted, but it could also be that only 1 bit more than was correctible reliably (via the reduntant extra ECC data added to each sector) was corrupted.

In casese, in which only some very few bytes more than hdd was able to correct have been corrupted there are changes that fraction of the 4096 Bytes stil have meaningful data.

An example could be that the 4096 represents the ASCII charbytes of 2 sentences. Then it is possible that hat 1 sentence or more of is completely intact. Also it could be possible that every 2nd or 3rd letter has been delted. If the data of 4096 is lost in an URE event is hence up to the interpretation and dependent on the data. One could image that the data itself had another layer of ECC shell, which would allow for further recovery.

Therefore it is good that most firmwares do treat URE sectors differently from bad sectors:

Typically, automatic remapping of sectors only happens when a sector is written to. The logic behind this is presumably that even if a sector cannot be read normally, it may still be readable with data recovery methods. (from https://en.wikipedia.org/wiki/Bad_sector)

Or to extent on that, it might be that a part of the sector still contains usable data.

humanityANDpeace

Posted 2015-09-08T09:56:34.413

Reputation: 642

Note that the article is marked as "needs attention from an expert", "possibly contains original research" and that particular statement is marked as "citation needed". The way it's written ("presumably"??) also makes it sound very much like someone is speculating, rather than something that can be corraborated with high-quality source material. – a CVn – 2016-01-20T15:15:04.767