3

I have a web server. About 100-150 virtual hosts. Just small websites. My first post is here: SATA hdd errors , but now I have new error on other disk. 1. I have problem with sata disk as described in the link. 2. I have put in other disk and... no luck, mkfs.ext3 - and on a new disk the same issue a lot of errors while mkfs is running. 3. Next step was to replace sata cable and this helped. after that I was able to format disk and have started file transfer from backup. So it is now 4 days since cable and disk change.. and now I see the following message in dmesg:

ata2.00: exception Emask 0x0 SAct 0x1 SErr 0x0 action 0x6 frozen
ata2.00: cmd 60/08:00:3f:25:db/00:00:01:00:00/40 tag 0 ncq 4096 in
         res 40/00:00:00:4f:c2/00:00:00:00:00/00 Emask 0x4 (timeout)
ata2.00: status: { DRDY }
ata2: hard resetting link
ata2: SATA link up 3.0 Gbps (SStatus 123 SControl 300)
ata2.00: configured for UDMA/133
ata2: EH complete
SCSI device sdb: 490350672 512-byte hdwr sectors (251060 MB)
sdb: Write Protect is off
sdb: Mode Sense: 00 3a 00 00
SCSI device sdb: drive cache: write back

In smart:

ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  1 Raw_Read_Error_Rate     0x002f   200   200   051    Pre-fail  Always       -       89
  3 Spin_Up_Time            0x0027   200   200   021    Pre-fail  Always       -       991
  4 Start_Stop_Count        0x0032   100   100   000    Old_age   Always       -       16
  5 Reallocated_Sector_Ct   0x0033   200   200   140    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x002e   200   200   000    Old_age   Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       1090
 10 Spin_Retry_Count        0x0032   100   253   000    Old_age   Always       -       0
 11 Calibration_Retry_Count 0x0032   100   253   000    Old_age   Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       15
192 Power-Off_Retract_Count 0x0032   200   200   000    Old_age   Always       -       8
193 Load_Cycle_Count        0x0032   200   200   000    Old_age   Always       -       7
194 Temperature_Celsius     0x0022   118   100   000    Old_age   Always       -       25
196 Reallocated_Event_Count 0x0032   200   200   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0032   200   200   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0030   200   200   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x0032   200   200   000    Old_age   Always       -       0
200 Multi_Zone_Error_Rate   0x0008   200   200   000    Old_age   Offline      -       0

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      1024         -
# 2  Conveyance offline  Completed without error       00%       978         -
# 3  Extended captive    Interrupted (host reset)      90%       977         -
# 4  Extended captive    Interrupted (host reset)      90%       977         -
# 5  Extended offline    Completed without error       00%       977         -
# 6  Short offline       Completed without error       00%       974         -

Question 1: What does it mean, and what can be the problem? (I found in Google that 99% it is cable problem)

Question 2: Raw_Read_Error_Rate increases by 20-30 every day. Is it Ok? On disk sda value is 7000 and no errors.

user46269
  • 41
  • 1
  • 1
  • 2
  • I've been having similar errors which I can't track down. Some guy on the internet said that after changing the cables more than 10 times (one of the last times with a more expensive cable), it was finally fixed. I can't really say anything beyond what you already know about those SATA errors, but I do know that on my two WD 1.5T EARS disks, the RAW read error rate is 0. Their power-on hours is 1000. BTW, I noticed you have a load_cycle_count param. You may want to read about intellipark and disable it. – Halfgaar Jul 25 '10 at 18:04

2 Answers2

1

Your hardware is defunctive, fix it.

Craig
  • 560
  • 3
  • 13
0

I didn't read (and think about) your problem thoroughly so I may be wrong about this, but you might want to rethink the IOPS limitation of the drive. I read it somewhere (forgot the link) that on average the current consumer SATA drives only have about 75-100 IOPS while the enterprise SATA drives could double the number (and then the SAS drives which could double the number again). IIRC it was an article about SSD where it is possible to have IOPS over 9000!

EDIT0 Add related Wikipedia article: http://en.wikipedia.org/wiki/IOPS

  • That would not result in timeouts, though, on the SATA cable. This is a driver / hardware layer which blows here. – TomTom Oct 25 '10 at 05:56