42

I just tried to run a test on my hdd and it doesn't want to complete a self test. Here is the result:

smartctl --attributes --log=selftest /dev/sda
smartctl 6.2 2013-07-26 r3841 [x86_64-linux-3.13.0-32-generic] (local build)

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       697
  3 Spin_Up_Time            0x0027   206   160   021    Pre-fail  Always       -       691
  4 Start_Stop_Count        0x0032   074   074   000    Old_age   Always       -       26734
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       28
  9 Power_On_Hours          0x0032   090   090   000    Old_age   Always       -       7432
 10 Spin_Retry_Count        0x0032   100   100   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   097   097   000    Old_age   Always       -       3186
191 G-Sense_Error_Rate      0x0032   001   001   000    Old_age   Always       -       20473
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       84
193 Load_Cycle_Count        0x0032   051   051   000    Old_age   Always       -       447630
194 Temperature_Celsius     0x0022   113   099   000    Old_age   Always       -       34
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       16
198 Offline_Uncorrectable   0x0030   100   253   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   100   253   000    Old_age   Offline      -       0

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed: read failure       90%      7432         92290592
# 2  Conveyance offline  Completed: read failure       90%      7432         92290596
# 3  Conveyance offline  Completed: read failure       90%      7432         92290592
# 4  Short offline       Completed: read failure       90%      7431         92290596
# 5  Extended offline    Completed: read failure       90%      7431         92290592

So is this disk failing?

Sven
  • 97,248
  • 13
  • 177
  • 225
Michel
  • 423
  • 4
  • 4
  • When I use the graphic tool it says self-test-failed – Michel Nov 24 '14 at 08:44
  • 3
    The repeated `read failure` messages usually indicate a failing disk, so yes... – HBruijn Nov 24 '14 at 08:48
  • 23
    Michel, welcome to SF, and thanks for a good first question. As you may see if you decide to stay around these parts (which I hope you will), a good first question is a rare and precious thing. You had a hypothesis appropriate to the site (*"my HDD is failing*"), you found the relevant tool and learned how to use it, but needed some help in interpreting the results. So you came here, gave us all the relevant information, no surplus rubbish, and asked a question that was a model of concision. Thank you - please stay around! – MadHatter Nov 24 '14 at 09:28
  • 3
    +1: Excellent first question. To make the most of Server Fault, please register your account, and check out some of the other sites on the [se] network. We hope to see you contribute more high-quality content to Stack Exchange. – bwDraco Nov 25 '14 at 06:40

6 Answers6

43

Your drive is very happy to do a self-test; from the summary, it has done more than five of them in the past hour. And all of them have failed, early on in the test, with read errors.

Yes, this hard drive is failing. As the famous Google Labs report said (though I can't put my hand on a link to it at the moment), if smartctl says your drive is failing, it probably is (I paraphrase).

Edit: don't try to save it. Get all the data off it, and replace it.

MadHatter
  • 78,442
  • 20
  • 178
  • 229
  • Ok Thanks a lot, is there anything I can do to save it or should I just backup everything I can and throw it away ? – Michel Nov 24 '14 at 08:52
  • 9
    If it's failing, it's failing. Repairing it may be technically possible, but extremely unlikely to be cost effective compared to the cost of a new drive. – Sobrique Nov 24 '14 at 09:43
  • 7
    @Michel An absence of a self-test error isn't proof that a drive *is not* failing, sadly, but the presence of a self test error should always be considered proof that it *is* failing. – Rob Moir Nov 24 '14 at 15:28
  • @Michel It will only get worse with time, and you don't know where the next error will be introduced so it is essentially gone unreliable. Replace it. – Thorbjørn Ravn Andersen Nov 24 '14 at 23:45
  • 1
    @Michel: You could try to replace the cables. Sometimes a drive can fail not because of problems in the drive, but because of bad power or data cables. – Thomas Padron-McCarthy Nov 25 '14 at 14:02
  • I'm sorry, but those self-test are not done automatically, you have to tell the drive to do them, you do that with smartctl -t test, test being one of offline, short, long, conveyance, select, etc... – Jorge Nerín Nov 25 '14 at 21:40
  • @JorgeNerín: yes indeed, and all the evidence is that (s)he has done exactly that. Look at parameter 9, Power-On Hours, and compare that value with the lifetime at which the last five tests were run: that is, all within the last hour or two. – MadHatter Nov 26 '14 at 08:53
  • I know, I was trying to explain that those self-tests are not initiated by the drive by itself. It's smartctl tool the one that sends a command to the hardrive to do a self-test (short, long, ...) "self-test" name is because the test it's done by the HD firmware, but not initiated by itself. It wasn't HD the one that did the selftest, it was the user the one who told the drive to the short & conveyance test. BTW the conveyance test does not make much sense now, "conveyance: This self-test routine is intended to identify damage incurred during transporting of the device.", I would tried long. – Jorge Nerín Nov 26 '14 at 09:11
  • 1
    @JorgeNerín: I think you make an excellent point, but the evidence is that both I and the OP already understand it - the OP must, for (s)he has initiated at least five of them in the past two hours. As for tests, I agree with you that a long test would be a better indicator that the drive is healthy, but when it fails both short and conveyance tests in the first 10% of the drive, I think we may reasonably conclude the drive is shot. What do you hope would be revealed by more extensive testing? – MadHatter Nov 26 '14 at 09:37
  • Nothing new, I'm pretty convinced that this is a drive that should be replaced ASAP, it has already lost data (16 sectors) by it being unreadable. I just wanted to clarify that it was not the drive itself the one that did the tests by its own, the first sentence of your answer. – Jorge Nerín Nov 26 '14 at 12:23
  • 2
    @JorgeNerín that makes sense! I only spoke so because the OP started off by anthropomorphising his drive: "*I just tried to run a test on my hdd and **it doesn't want to complete** a self test*". I don't think either of us thinks the drive is alive, nor that it schedules self-tests by itself! – MadHatter Nov 26 '14 at 12:53
  • 1
    I also wanted to add that when your drive is in warranty, most vendors will replace it for you if you can show them it has SMART error, so you have no excuse not to replace it! – Jens Timmerman Nov 28 '14 at 08:55
10

To answer your question, a failed SMART test is a surefire indication of imminent drive failure. You should back up your data and replace the drive as soon as possible to prevent potential data loss.

@sj0h mentioned the Load Cycle Count, which is very high at 447,630. (Most modern hard drives are designed to withstand 600,000 load/unload cycles.) This is typically caused by the Advanced Power Management (APM) feature, which tries to conserve power by parking the heads (unloading them from the platters) after several seconds of idle. The heads are loaded back onto the platters when needed. On most systems, where hard drives get intermittent, on-and-off activity, this can cause lots of load/unload cycles to occur. To turn APM off, run the following command at a root prompt:

smartctl -s apm,off /dev/sda

This command will need to be run each time the system is power-cycled or put to sleep or the drive is otherwise powered off, as this setting is not retained when the drive is turned off.

In my experience, doing this will dramatically reduce the number of load/unload cycles and consequently the chances you'll experience this sort of failure again in the future. Do note, however, that doing this increases power consumption and drive temperature. If the drive constantly runs at temperatures in excess of 50 °C, the risk of premature failure is increased, so you may want to leave APM on (or turn it on if it is off) during the warmer months.

bwDraco
  • 1,626
  • 2
  • 12
  • 25
2

Apart from the read failures, consider also the Load Cycle Count. At nearly 500,000 this may indicate a reason for failure, or at least high load cycle wear. There is a load cycle for every minute of power up time. After you replace the drive make sure that the new drive isn't doing this as well.

sj0h
  • 121
  • 1
2

Yes, you have 16 sectors unreadable, you have tried to do several tests that all have failed in roughly the same area of the drive, so, backup fast, but keep in mind that you have data already inaccessible by now lingering in the vicinity of sectors 92290592, 92290596.

You may have other problematic areas, you still don't know if those 16 sectors are consecutive or spreaded, if you want to play after backup you can do selective self-test with -t select,startlba-endlba.

Current_Pending_Sector means that the hard disk firmware has tried to read it, but cannot, it will try a few times more (whenever the OS asks for it) until failing and marking it as Offline_Uncorrectable or will substitute the damaged sector for another spare sector if the OS writes to it (thereby increasing Reallocated_Sector_Ct when doing so).

Jorge Nerín
  • 1,128
  • 8
  • 8
1

I would personally replace the drive. If you, for some reason do not want do to do that yet, but linger on with the drive a while yet, you need some way to ensure that you do not accidentially use the bad areas for new files.

I had such a drive on an old Mac just recording video, and decided that I did not want to change it yet, as the videos were just nice to have. So I needed to isolate the errors. First I created an empty folder only for bad files, and then I tried to read all existing files on the disk and any of those with an error in were moved to the bad-files-directory (hopefully only unimportant).

Then I created a lot of uniquely named one megabyte files to fill up the harddrive (so all empty space was now in one of these 1 MB files) and then repeated the procedure. All files with errors in them, was moved to the bad-files-directory, and those left were good and could be deleted to reclaim the bad space.

You can now use the drive a bit longer, but do not use it for important stuff. It will fail more and it will most likely be inconvenient when it happens.

1

This is not a very good sign. You should make sure that the contents of the disk are backed up, and not use the disk for anything important.

However, I have seen disks with failed sectors that reallocated them and remained operational for years, so you could keep it around for a while, e.g., for unimportant stuff, or additional backups.

One thing to do then would be to see which files were corrupted by the unreadable sectors, and write to these sectors to force reallocation by the disk (moving them from "Current_Pending_Sector" to "Reallocated_Sector_Ct"). If using Linux, see http://smartmontools.sourceforge.net/badblockhowto.html. Once the sectors have been reallocated, the self-test should either pass or report more unreadable sectors.

I disagree with most answers in that I do not think that bad sectors are necessarily an indication of imminent failure. As http://blog.mmueh.net/index.php/2010/12/09/luks-meets-badblocks/ says, "every harddrive starts to produce bad sectors at some point in its life".

a3nm
  • 859
  • 5
  • 11
  • while i do agree that the failure is not certain with a bad sector happening, the likelihood of a driver failing after one bad sector increases significally (i think that was in the google report aswell, but i cannot find the actual source currently) – Dennis Nolte Nov 26 '14 at 08:50