13

I have an Intel X-25M drive that was marked "failed" twice in a ZFS storage array, as noted here. However, after removing the drive, it seems to to mount, read and write in other computers (Mac, PC, USB enclosure, etc.)

Is there a good way to determine the drive's present health? I feel that the previous failure in the ZFS solution was the convergence of bugs, bad error reporting and hardware. It seems like this drive may have some life in it, though.

ewwhite
  • 194,921
  • 91
  • 434
  • 799

4 Answers4

12

A good, but not infallible, way of checking any drive health is to check the SMART attributes.

Below is the SMART attribute set for an Intel X25-M G2 160GB disk, taken using smartctl v5.41. (The version is important, earlier versions of smartctl had different attribute-name mappings, and didn't actually correctly understand the specific table for this drive).

# ./smartctl -data -A /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-2.6.18-194.32.1.el5] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED     RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       1
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       4076
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       67
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       30
225 Host_Writes_32MiB       0x0030   200   200   000    Old_age   Offline      -       148418
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       755
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       49
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       16956537
232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0

This shows that the drive has had 1 reallocated sector, has used 1% of its available reserved space (attribute 232) and 2% of its projected program/erase cycles (attribute 233). It has had 148418 * 32MiB (attribute 225) written to it.

If the drive is showing any significant number of reallocated sectors, it may be a cause for concern, as this probably points to a failing flash chip (in the same way that a significant number of reallocated sectors on a spinning disk generally points towards surface errors). End to End are also bad - I've had a few X25-M G2 160GB disks fail with large (>1000) End to End errors reporting. There are only really two useful error condition attributes present for these disks though, as most of the useful SMART attributes for normal disks don't apply to SSDs.

However, SMART isn't generally regarded as 100% reliable. Google's study on disk failures found that while there were good correlations between the various SMART early warning indicators and drive failure, it wasn't a useful tool for predicting individual drive failure. For this reason I generally use SMART as a way of proving a drive is bad (if errors are showing, it's probably going to fail sometime soon), rather than proving a drive is still good.

Daniel Lawson
  • 5,426
  • 21
  • 27
  • 1
    Note that Google's study concentrated on pre-failure indications from SMART, which turned out to be less than reliable. Reporting on failure conditions is somewhat more accurate. – Chris S Jun 23 '11 at 03:43
2

Although its made for "traditional" hard drives the "badblocks" utility might be of some benefit since its meant to exercise all mappable sectors on the drive. With SSD fragmentation prevention and internal remappings it wont be able to tell you for sure that the drive is good. However, if it tells you the drive is bad I would surely toss out the drive as dead.

bot403
  • 412
  • 2
  • 8
1

HD Tune (and HD Tune Pro) are great tools for measuring the health and performance of your SSD drive. The free version (HD Tune) has a very limited feature set, but health analysis falls into that, so you lucked out. The Pro version has a 15 day trial period which I highly recommend to try, it will give you a great, in-depth analysis on how your SSD performs.

0

For me, when "Reallocated_Sector_Ct" is anything but zero, I replace the disk

Reallocated_Sector_Ct is a pool of sectors the disk reserves to swap out bad sectors. In the old days, a disk would always have a few bad sectors on day-1 and the disk could then swap them out and you had a 100% working disk

These days disks are far more complicated than that, so generally this swapping out only starts as the disk starts to fail.

This is a massively gross oversimplification, but you get the picture.

An alternative strategy would be to keep an eye on the number and check it isn't going up. But often when a disk starts going bad you are only a short walk from a catastrophic failure. So, given the price of disks these days, I just prefer to throw them than risks it

I've never lost data due to a disk failure.