I'm getting these errors in dmesg
after about half an hour after I turn on the computer:
[ 1355.677957] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1318420: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251700offset=0(0), inode=1802725748, rec_len=179136, name_len=32
[ 1355.677973] Aborting journal on device sda2-8.
[ 1355.678101] EXT4-fs (sda2): Remounting filesystem read-only
[ 1355.690144] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1318416: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251699offset=0(0), inode=2194783952, rec_len=53280, name_len=152
[ 1356.864720] EXT4-fs error (device sda2): htree_dirblock_to_tree: inode #1312795: (comm updatedb.mlocat) bad entry in directory: directory entry across blocks - block=5251176offset=1460(13748), inode=1432317541, rec_len=208208, name_len=119
/dev/sda
is an SSD, and it's using the noop scheduler.
/etc/fstab
entry:
UUID=acb4eefa-48ff-4ee1-bb5f-2dccce7d011f / ext4 errors=remount-ro,noatime,discard,user_xattr 0 1
System information:
$ cat /proc/mounts | grep /dev/sd
/dev/sda1 /boot ext2 rw,noatime,errors=continue 0 0
$ cat /etc/lsb-release
DISTRIB_ID=Ubuntu
DISTRIB_RELEASE=10.04
DISTRIB_CODENAME=lucid
DISTRIB_DESCRIPTION="Ubuntu 10.04.3 LTS"
$ uname -a
Linux leetpad 2.6.35-30-generic-pae #61~lucid1-Ubuntu SMP Thu Oct 13 21:14:29 UTC 2011 i686 GNU/Linux
Output of smartctl -a
:
smartctl version 5.38 [i686-pc-linux-gnu] Copyright (C) 2002-8 Bruce Allen
Home page is http://smartmontools.sourceforge.net/
=== START OF INFORMATION SECTION ===
Device Model: STT_FTM28GX25H
Serial Number: P637510-MIBY-706A009
Firmware Version: 1916
User Capacity: 128,035,676,160 bytes
Device is: Not in smartctl database [for details use: -P showall]
ATA Version is: 8
ATA Standard is: Exact ATA specification draft version not indicated
Local Time is: Thu Nov 24 20:53:48 2011 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
See vendor-specific Attribute list for marginal Attributes.
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 0) seconds.
Offline data collection
capabilities: (0x1d) SMART execute Offline immediate.
No Auto Offline data collection support.
Abort Offline collection upon new
command.
Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
No Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x00) Error logging NOT supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 0) minutes.
Extended self-test routine
recommended polling time: ( 0) minutes.
SMART Attributes Data Structure revision number: 16
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
1 Raw_Read_Error_Rate 0x0000 005 000 000 Old_age Offline In_the_past 0
9 Power_On_Hours 0x0000 141 002 000 Old_age Offline - 0
12 Power_Cycle_Count 0x0000 115 002 000 Old_age Offline - 0
184 Unknown_Attribute 0x0000 084 000 000 Old_age Offline In_the_past 0
195 Hardware_ECC_Recovered 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
196 Reallocated_Event_Count 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
197 Current_Pending_Sector 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
198 Offline_Uncorrectable 0x0000 002 107 000 Old_age Offline - 21198
199 UDMA_CRC_Error_Count 0x0000 063 003 000 Old_age Offline - 26957
200 Multi_Zone_Error_Rate 0x0000 099 124 000 Old_age Offline - 446
201 Soft_Read_Error_Rate 0x0000 024 154 000 Old_age Offline - 328
202 TA_Increase_Count 0x0000 115 254 000 Old_age Offline - 115
203 Run_Out_Cancel 0x0000 247 245 000 Old_age Offline - 83
204 Shock_Count_Write_Opern 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
205 Shock_Rate_Write_Opern 0x0000 016 039 000 Old_age Offline - 0
206 Flying_Height 0x0000 005 000 000 Old_age Offline In_the_past 0
207 Spin_High_Current 0x0000 055 015 000 Old_age Offline - 0
208 Spin_Buzz 0x0000 248 001 000 Old_age Offline - 0
209 Offline_Seek_Performnce 0x0000 095 000 000 Old_age Offline In_the_past 0
211 Unknown_Attribute 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
212 Unknown_Attribute 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
213 Unknown_Attribute 0x0000 000 000 000 Old_age Offline FAILING_NOW 0
Warning: device does not support Error Logging
Warning! SMART ATA Error Log Structure error: invalid SMART checksum.
SMART Error Log Version: 1
No Errors Logged
Warning! SMART Self-Test Log Structure error: invalid SMART checksum.
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
Device does not support Selective Self Tests/Logging
I've run memtest for 7 hours, it didn't found any memory errors.
Any obvious ideas what can go wrong in this case? The most reasonable thing I can imagine is that the SSD is silently dropping some write requests, which eventually leads to an EXT4 filesystem inconsistency (but no disk I/O errors). How can this happen? Is there a relevant configuration option I should ensure to be set correctly?
What tools should I use to diagnose the hardware failures? Would it be possible to diagnose the SSD failure without overwriting data?