12

We've been running an SSD (Intel X25-M) in a Linux (RHEL 5) server for a while, but never made any effort to figure out how much write load it was under for the past year. Is there any tool under Linux to tell us approximately how much has been written to the disk over time or (even better) how much wear it has accumulated? Just looking for a hint to see if it's near death or not...

JZeta
  • 159
  • 1
  • 1
  • 5

5 Answers5

16

Intel SSDs do keep statistics on total writes and how far through it's likely lifespan it is.

The following is from an Intel X25-M G2 160GB (SSDSA2M160G2GC)

# smartctl -data -A /dev/sda
smartctl 5.40 2010-10-16 r3189 [x86_64-redhat-linux-gnu] (local build)
Copyright (C) 2002-10 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 5
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  3 Spin_Up_Time            0x0020   100   100   000    Old_age   Offline      -       0
  4 Start_Stop_Count        0x0030   100   100   000    Old_age   Offline      -       0
  5 Reallocated_Sector_Ct   0x0032   100   100   000    Old_age   Always       -       1
  9 Power_On_Hours          0x0032   100   100   000    Old_age   Always       -       6855
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       68
192 Unsafe_Shutdown_Count   0x0032   100   100   000    Old_age   Always       -       30
225 Host_Writes_32MiB       0x0030   200   200   000    Old_age   Offline      -       148487
226 Workld_Media_Wear_Indic 0x0032   100   100   000    Old_age   Always       -       3168
227 Workld_Host_Reads_Perc  0x0032   100   100   000    Old_age   Always       -       1
228 Workload_Minutes        0x0032   100   100   000    Old_age   Always       -       1950295543
232 Available_Reservd_Space 0x0033   099   099   010    Pre-fail  Always       -       0
233 Media_Wearout_Indicator 0x0032   098   098   000    Old_age   Always       -       0
184 End-to-End_Error        0x0033   100   100   099    Pre-fail  Always       -       0

The Host_Writes_32MIB raw value shows how many 32MiB units of data have been written to this drive.

The Media_Wearout_Indicator value shows you a normalised percentage of how far through its useful wear-lifespan the drive is. This starts at 100 (or 099, I forget which), and proceeds down to 001, at which point Intel consider the drive to have exceeded its useful life. Intel use the MWI as part of warranty claims too - once the MWI reaches 001, the warranty is expired.

The MWI reaching 001 does not mean the drive will fail immediately however! Intel will have tolerance built in to deal with variances in flash units. I've seen drives last well past this point, and I'm actively wear-testing some Intel 320 series SSDs to see how much longer they last.

However, as the warranty expires when the MWI reaches 001, I'd replace any drives at that point.

Daniel Lawson
  • 5,426
  • 21
  • 27
  • 1
    For future reference, the `Media_Wearout_Indicator` starts at 100 for my Intel 520 Series SSD. – pableu Jan 07 '13 at 13:09
  • It's worth noting that even if the drive doesn't "fail" once it reaches 001, at some point afterwards (perhaps a long ways afterwards), some drives' ability to retain data when power is lost goes down to alarmingly short amounts of time. I think there have been some endurance tests posted online that have measured this. – sa289 Aug 21 '15 at 20:36
6

Corsair drives also export a similar percentage-life-left indicator. In their case it is attribute 231:

231 SSD_Life_Left           0x0013   100   100   010    Pre-fail  Always       -       0

(Note that if smartctl is displaying this as a Temperature you need to update your device database. On my Debian system that means running /usr/sbin/update-smart-drivedb)

A Corsair blog post seems to show that the value never goes below 10% so I presume it should be replaced at 10%.

I also have an OCZ drive with the same Sandforce controller which also exports the same SSD_Life_Left value.

Graham
  • 61
  • 1
  • 1
3

The Media_Wearout_Indicator is what you are looking for. For 100 means your ssd has 100% life, the lower number means less life left.

# smartctl -a /dev/sda | grep Media_Wearout_Indicator

Output from my laptop

233 Media_Wearout_Indicator 0×0032 100 100 000 Old_age Always – 0

If you want to see more details and full attributes from your drive, you can run

# smartctl -data -A /dev/sda

and the output

# smartctl -data -A /dev/sda
smartctl 5.41 2011-06-09 r3365 [x86_64-linux-3.2.0-49-generic] (local build)
Copyright (C) 2002-11 by Bruce Allen, http://smartmontools.sourceforge.net

=== START OF READ SMART DATA SECTION ===
SMART Attributes Data Structure revision number: 10
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0×0032 100 100 000 Old_age Always – 0
9 Power_On_Hours 0×0032 000 000 000 Old_age Always – 232959027031342
12 Power_Cycle_Count 0×0032 100 100 000 Old_age Always – 279
170 Unknown_Attribute 0×0033 100 100 010 Pre-fail Always – 0
171 Unknown_Attribute 0×0032 100 100 000 Old_age Always – 0
172 Unknown_Attribute 0×0032 100 100 000 Old_age Always – 0
174 Unknown_Attribute 0×0032 100 100 000 Old_age Always – 278
184 End-to-End_Error 0×0033 100 100 090 Pre-fail Always – 0
187 Reported_Uncorrect 0×0032 100 100 000 Old_age Always – 0
192 Power-Off_Retract_Count 0×0032 100 100 000 Old_age Always – 278
225 Load_Cycle_Count 0×0032 100 100 000 Old_age Always – 10752
226 Load-in_Time 0×0032 100 100 000 Old_age Always – 65535
227 Torq-amp_Count 0×0032 100 100 000 Old_age Always – 66
228 Power-off_Retract_Count 0×0032 100 100 000 Old_age Always – 65535
232 Available_Reservd_Space 0×0033 100 100 010 Pre-fail Always – 0
233 Media_Wearout_Indicator 0×0032 100 100 000 Old_age Always – 0
241 Total_LBAs_Written 0×0032 100 100 000 Old_age Always – 10752
242 Total_LBAs_Read 0×0032 100 100 000 Old_age Always – 21803
249 Unknown_Attribute 0×0013 100 100 000 Pre-fail Always – 357

http://namhuy.net/1024/how-to-check-ssd-life-left.html

1

Not really. If the drive doesn't keep statistics, you wouldn't know for sure. Even then the drive would abstract the write-leveling algorithms and such to try to optimize things under the hood, away from the system calls and interfaces. In other words, the drive could easily lie to you about where the data is actually written on the "media" so you wouldn't know what cells are getting activity.

That still doesn't guarantee when/if you'll see failures or errors. Drive could fail tomorrow, could fail in three years.

Best bet is to keep it in a RAID configuration and have a plan in place to replace it when it does fail (before the other drive fails) and makes sure your backups are current.

Bart Silverstrim
  • 31,092
  • 9
  • 65
  • 87
0

It seems one can't use the standard 'sudo smartctl -A /dev/sda" on Intel SSD's, at least not my Intel 545s 128GB SSD, as to see the proper info to convert to TBW one has to issue "sudo smartctl -q noserial -x /dev/sda" and find the section that says 'logical sectors written' and using that long number there you simply use the following formula to convert to TBW... long number * 512 / 1024^4 = TBW (or long number * 512 / 1024/1024/1024/1024 = TBW) which the output from here will give you the TBW.

for the record... it seems most decent SSD's have a official rated write life of at least 75TBW but will likely do well beyond that before failure occurs from writing data to a SSD.

because generally speaking... assuming a SSD only dies from writing data to the drive, which is typically how it wears out, then what I posted here is a good guideline.