We have a small "homemade" server running fully updated Debian Wheezy (amd64). One hard drive installed: WDC WD6400AAKS. The motherboard is ASUS M4N68T V2.

The usual load:

  • CPU: an average of 20%
  • Each week about 50GB of additional space is occupied. About 47GB of uploaded files and 3GB of MySQL data.

I'm afraid that the hard drive may be about to fail. I saw Pre-fail on few places when I ran:

root@SERVER:/tmp# smartctl -a /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-4-amd64] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

Model Family:     Western Digital Caviar Blue Serial ATA
Device Model:     WDC WD6400AAKS-XXXXXXX
LU WWN Device Id: 5 0014ee XXXXXXXXXXXXX
Firmware Version: 01.03B01
User Capacity:    640,135,028,736 bytes [640 GB]
Sector Size:      512 bytes logical/physical
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   8
ATA Standard is:  Exact ATA specification draft version not indicated
Local Time is:    Mon Oct 28 18:55:27 2013 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x85) Offline data collection activity
                    was aborted by an interrupting command from host.
                    Auto Offline Data Collection: Enabled.
Self-test execution status:      ( 247) Self-test routine in progress...
                    70% of test remaining.
Total time to complete Offline 
data collection:        (11580) seconds.
Offline data collection
capabilities:            (0x7b) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    Offline surface scan supported.
                    Self-test supported.
                    Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 136) minutes.
Conveyance self-test routine
recommended polling time:    (   5) minutes.
SCT capabilities:          (0x303f) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       0
  3 Spin_Up_Time            0x0027   157   146   021    Pre-fail  Always       -       5108
  4 Start_Stop_Count        0x0032   098   098   000    Old_age   Always       -       2968
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   051    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   079   079   000    Old_age   Always       -       15445
 10 Spin_Retry_Count        0x0032   100   100   051    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   100   051    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   098   098   000    Old_age   Always       -       2950
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       426
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       2968
194 Temperature_Celsius     0x0022   111   095   000    Old_age   Always       -       36
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   160   000    Old_age   Always       -       21716
200 Multi_Zone_Error_Rate   0x0008   200   200   051    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%     15444         -

Error SMART Read Selective Self-Test Log failed: scsi error aborted command
Smartctl: SMART Selective Self Test Log Read Failed

In one tutorial I read that the pre-fail is a an indication of coming failure, in another tutorial I read that it is not true. Can you guys help me decode the output of smartctl?

It would be also nice to share suggestions what should I do if I want to ensure data integrity (about 50GB of new data each week, up to 2TB for the whole period I'm interested in). Maybe I will go with 2x2TB Caviar Black in RAID4?

Cristian Ciupitu
  • 6,226
  • 2
  • 41
  • 55
  • 33
  • 1
  • 5
  • Make sure your backups are working, up to date, and more importantly, restore properly. RAID is not a substitute for a backup. – afrazier Oct 29 '13 at 00:26
  • Good note! 10x. I was thinking of using two hard drives RAID1, and when the one is going to fail (or failed) to just remove it and continue with the other drive while waiting for new drive. I'm not sure if this is possible with our current motherboard however. – Glister Oct 29 '13 at 11:39
  • Individual hard drives are simply not that expensive. There's no reason not to have a 3rd drive as a hot spare or sitting on a shelf ready to be installed at a moment's notice. Then you're not in the uncomfortable place where you're hoping the last drive doesn't die while you're waiting for the replacement and rebuild. – afrazier Oct 29 '13 at 13:08

2 Answers2


"Completed without error"

The list mentioning prefail shows the types of errors that are possible with their thresholds. None of your worst values are past the thresholds.

Get 2 disks, use RAID1 or get 4 disks and use RAID10. RAID4 is not used.

  • 8,138
  • 2
  • 24
  • 36
  • 10x, can you suggest what drives to choose? Something in the price range of WD Blacks. – Glister Oct 28 '13 at 22:44
  • If you're going to use WD drives, make sure to get drives designed to run in a RAID, like the WD Reds. Google for TLER for more information. – afrazier Oct 29 '13 at 00:25
  • Thanks @JamesRyan. I marked the other answer because it was more detailed. However I'm new here and I'm not sure if I should mark the first answer or the best answer. If I'm wrong I'll fix myself, because I respect the people trying to help others. – Glister Oct 29 '13 at 20:15
  • The other answer is longer, personally I'm not sure it has added much useful and the votes reflect that. However you should accept the answer that was best for you. – JamesRyan Oct 30 '13 at 12:42

As was mentioned already, the "Prefail" text only indicates the type of the entry. And depending on the entry not every one is an error at all. Spin_Up_Time for example just counts how often the drive was started, Load_Cycle_Count counts how often the head was parked.

The value that gives me pause is UDMA_CRC_Error_Count at 21716. That could be caused by a bad/loose cable or electronic failure.

Reallocated_Sector_Ct, Current_Pending_Sector or Offline_Uncorrectable rise if there is a surface error. They are all 0, so the disk might be totally ok.

The following indicates that there is a test still running:

Self-test execution status:      ( 247) Self-test routine in progress...
                    70% of test remaining.

Once it's finished there should be another entry in the secition "SMART Self-test log structure". You should use the long test (smartctl -t long) to get meaningful results. The logged test was only a short one.

Regarding higher data integrity: RAID4 would need at least 3 disks and isn't used anymore (replaced by RAID5). But in your case either RAID1 with 2 disks or RAID10 with 4 disks would be fitting. Both modes halve the amount of space availabe but stay working if one drive fails. RAID10 has the advantage of being faster, as it distributes the load on more drives and depending on which drives fail survives even 2 failed drives. If one disk is fast enough, RAID1 will be fast enough too (write speed is the same, read speed is doubled for larger amounts of data).