23

My Linux system has started throwing SMART errors in the syslog. I tracked it down and believe the problem is a single block on the disk. How do I go about easily getting the disk to reallocate that one block? I'd like to know what file got destroyed in the process. (I'm aware that if one block fails on a disk others are likely to follow; I have a good ongoing backup and just want to try to keep this disk working.)

Searching the web leads to the Bad block HOWTO, which describes a manual process on an unmounted disk. It seems complicated and error-prone. Is there a tool to automate this process in Linux? My only other option is the manufacturer's diagnostic tool, but I presume that'll clobber the bad block without any reporting on what got destroyed. Worst case, it might be filesystem metadata.

The disk in question is the primary system partition. Using ext3fs and LVM. Here's the error log from syslog and the relevant bit from smartctl.

smartd[5226]: Device: /dev/hda, 1 Currently unreadable (pending) sectors

Error 1 occurred at disk power-on lifetime: 17449 hours (727 days + 1 hours)
... Error: UNC at LBA = 0x00d39eee = 13868782

There's a full smartctl dump on pastebin.

Nelson
  • 353
  • 2
  • 4
  • 11
  • I thought the disk firmware will automatically re-map the bad block on read, so theoretically it has already been done. As stated below, run fsck (or the correct equiv for your FS) to make sure the overlaying FS is still stable. – BuildTheRobots Jan 20 '10 at 18:31
  • 2
    My understanding is disk firmware will only remap the block on *write*, not on read. So really I need to force a write to the block in question. – Nelson Jan 21 '10 at 16:05
  • 1
    I finally retired this disk. It ran fine for several months, but after the 5th read error I gave up on it. – Nelson Apr 18 '10 at 17:16
  • In case it helps anyone who comes across this page now, the Smartmontools Bad Block Howto in now located at: https://www.smartmontools.org/wiki/BadBlockHowto – Nathan Sep 14 '20 at 22:27
  • Thanks Nathan, I've updated the link in the question. – Nelson Sep 16 '20 at 14:59

6 Answers6

36

I used to write disk firmware for WD, and I once wrote the firmware which reassigned bad blocks.

First, most bad blocks are detected on reads, not writes. Writes are done blindly, meaning the data is written without being checked. Thus on a write if the media is bad, you won't know it until the host does a read to that sector. There is a small part of the sector (the sector header) which is read on writes to locate the correct sector, so that if there is an error in reading the sector header, the drive will reassign the sector and write it with the data received from the write command. But the vast majority of bad blocks are detected on reads, and just because a write succeeds to a sector doesn't mean the media is good or that the sector has been reassigned.

Now about bad block reassignment (also called reallocation). Yes, normally the drive will attempt to reassign a sector if the error is bad enough (i.e., the ECC failure is bad enough) but the drive still could recover the data after ECC correction. Usually this is done automatically. The only exception is that the host could have previously told the drive not to do automatic reallocations, but this is seldom done.

So what happens if the drive does a read and cannot recover the data? Nothing. The error is reported to the host, but no reassignment is done. The problem is that the drive could reassign the sector, but it doesn't have the slightest idea what data to write in the newly reassigned sector. If it just wrote a bunch of zeros, say, and then the sector was read again, it would return all the zeros without any indication that the data wasn't valid. This is essentially the same thing as data corruption. The drive can't count on the host keeping track of errors for a variety of reasons (for example, what if the drive was moved to a new host?), so the best course of action is to do nothing when the data can't be recovered.

Modern drives, however, will save the location of the bad sector when it can't be reallocated. The number of bad sectors waiting reallocation can be found in the SMART data. What happens is if a write is done to one of the bad sectors awaiting reallocation, the reallocation is done because the drive now has valid data to write to it after the reallocation. Thus when people say writing to a bad sector will reallocate it, that's really only half the story. The drive must be read first so the drive can discover all the bad sectors that can't be reallocated automatically. Thus you can write an entire drive, and the SMART data will say there are no bad sectors waiting reallocation, but you haven't necessarily cleared the drive of all bad sectors. So if you really want to clear a drive of all bad sectors, the best thing is to read the entire drive first, followed by writing the entire drive (of course, this will destroy all previous data on the drive).

There are other ways of dealing with bad blocks which can't be reallocated. If the drive is part of a redundant RAID configuration (i.e., anything but RAID 0), the RAID software should automatically recover the data for a bad sector from the other drives and write it to the reallocated sector. SCSI disks have an explicit reassign blocks command which the host can use to force the reassignment even when there is no valid data to write to the block, but its use is pretty low-level.

tenner
  • 741
  • 1
  • 7
  • 4
  • 2
    Might be worth mentioning too that at least some Seagate HDDs support Write-Read-Verify, which can be turned on using `hdparm -R` (assuming a reasonably recent hdparm). This comes at a significant write performance penalty (approximately halving write throughput and write IOPS, because every write now incurrs a subsequent read) but if your hardware supports it and your workload is read-heavy then this may be a very much workable *preventative* measure. – user Mar 09 '16 at 12:36
13

You could try hdparm --write-sector <LBA> /dev/ice.

I don't know any other way of doing this - you need to manually convert the LBA into filesystem blocks (as you've already found)

James
  • 7,553
  • 2
  • 24
  • 33
  • Ooh, that's a new flag! That will definitely take care of reallocating the bad block. Now all I need is an easy way to find what it will clobber. – Nelson Jan 25 '10 at 19:54
  • 3
    Having used this method to fix a disk, I can say this is the correct method. Forcing a write to the sector in question will force the drive to face up to the sector and either (a) obtain a successful write, or (b) end up with a permanent bad second along with a remap. – Avery Payne Mar 01 '10 at 08:36
  • Great! And so much easier than http://smartmontools.sourceforge.net/badblockhowto.html – Janning Jul 13 '12 at 19:07
  • It's strange that this iterative process (of looking for the next bad sector through SMART and forcing it to re-allocate) isn't automated with a simple utility!.. – imz -- Ivan Zakharyaschev Dec 10 '13 at 19:03
3

I think all you have to do is:

e2fsck -c /dev/hda1

assuming /dev/hda1 is the (unmounted) partition. Or:

e2fsck -c -c /dev/hda1

to do a (slower) non-destructive read-write test. It will still have to be unmounted. I don't think this will give you details on any lost data, though.

Matthew Flaschen
  • 868
  • 2
  • 7
  • 11
  • But it's a pity that that doesn't seem to use the information from SMART about the bad-blocks. I wonder why there is no fsck tool that would use the bad block information from SMART and try to avoid them or repair the affected files as described in http://smartmontools.sourceforge.net/badblockhowto.html or http://serverfault.com/a/106130/68972 ... – imz -- Ivan Zakharyaschev Jan 05 '14 at 18:26
2

Michael has it correct and under most cases I would say just replace the drive they are cheap. However if you don't have backups and can't get important data off the drive, or just want to attempt to repair the drive then you may want to try using spinrite, on the highest level.

I had a laptop drive that started making some noises a few years ago. Badblocks showed that the drive had 118 or so bad blocks visible to the end user. Since I already had a copy of SpinRite I decided to give it a try before buying a new drive. After running spinrite on the drive badblocks showed 0 bad blocks and the noises stopped. The drive had been working for over two years since then.

3dinfluence
  • 12,409
  • 2
  • 27
  • 41
  • Nelson are you just going to down vote every answer that isn't what you want to hear? A healthy drive will automatically remap a bad block. If you have to go out of your way to do anything to force this the drive is no longer healthy and should be replaced. – 3dinfluence Jan 20 '10 at 04:11
  • No, I only downvoted one response because it didn't answer my question. You suggested spinrite, thanks! My understanding is a healthy drive will *not* remap a bad sector until it's written to. I'm trying to find the simplest way to force a write. Going to Matthew's suggestion and see if fsck is smart enough to do it. – Nelson Jan 20 '10 at 15:39
  • Sorry I jumped to conclusions there after seeing 2 answers voted down quickly and you respond to the other answer I assumed that was you. – 3dinfluence Jan 20 '10 at 16:27
  • 2
    You are correct that the bad sector remap happens when a write fails to a block. If you just have a corrupted block as far as the file system is concerned then fsck may sort out your issue if the block in question is a metadata block. fsck really just scans and corrects errors in the metadata. So it makes no guarantees on the data itself. The next gen filesystems like BTRFS and ZFS can detect and if you have redundancy correct data errors. Spinrite would also force this as it reads, then writes the inverted data, rereads, then inverts the data back on every block as part of its scan. – 3dinfluence Jan 20 '10 at 16:31
1

If you have backups and you know this is a logical error and not phisical one, then the best way to go about this would be to zero out the disc.

I would use MHDD it is fairly easy to use and as long as you remember to set your HDD in Bios to IDE emulation and then back to AHCI when your work is done you have nothing to worry about.

Once you boot to MHDD pick your drive type in ERASE command and confirm your choice.

Get yourself coffie this might take a while.

After Drive is zeroed out run scan(f4) with Remap set to ON (default is off). If there still are issues with the drive (it would mean that there is a phisical damage on the platter and drive is on a stedy downwards slope) this option will "Fix" them by mapping damaged area to healthy parts of the drive.

If there are no UNC errors then congratulations you and your drive can still be friends for years to come.

Jahith
  • 11
  • 1
-1

If the disk is going bad, replace it. It's not worth the risk that it will fall apart more.

Michael Graff
  • 6,588
  • 1
  • 23
  • 36
  • I was explicit about knowing the disk is bad and having backups to avoid the risk. – Nelson Jan 20 '10 at 02:13
  • 2
    That just means you're willing to gamble. I don't think that means it should not be replaced, just that you're willing to ignore that advice. I doubt any backups can save your system from itself as the disk falls apart, and things will just get very flaky as things degrade. – Michael Graff Jan 20 '10 at 02:26
  • 3
    This answer should be a comment... The question is specific and exaustive. And therefore this isn't an answer. – Pitto Jan 18 '12 at 15:44