PERC MegaRAID S.M.A.R.T. Status is not matching smartctl's - looking for clues what's wrong with the HDD

Question

I'm getting strange SMART error from MegaCli on Dell R720xd and PERC H710P with five 4Tb SATA drives in RAID5

 /opt/MegaRAID/MegaCli/MegaCli64 -PDList -aALL

gives me some "Failure Seq Event Number"

Slot Number: 4
...
Last Predictive Failure Event Seq Number: 7309 
...
Inquiry Data:       PK2361PAGAZU8WHitachi HUS724040ALE640                 MJAOA3B0
...
Drive has flagged a S.M.A.R.T alert : Yes

but smartctl gives no clues what's wrong with the drive at all:

# smartctl -a -d sat+megaraid,4 /dev/sda
smartctl 5.42 2011-10-20 r3458 [x86_64-linux-2.6.32-279.19.1.el6.x86_64] (local build)
...
Serial Number:    PK2361PAGAZU8W     # Note same serial, no mistake
...
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
...
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
Warning: This result is based on an Attribute check.
...
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED   RAW_VALUE
  1 Raw_Read_Error_Rate     0x000b   100   100   016    Pre-fail  Always       -       0
  2 Throughput_Performance  0x0005   137   137   054    Pre-fail  Offline      -       79
  3 Spin_Up_Time            0x0007   100   100   024    Pre-fail  Always       -       426
  4 Start_Stop_Count        0x0012   100   100   000    Old_age   Always       -       7
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       0
  7 Seek_Error_Rate         0x000b   100   100   067    Pre-fail  Always       -       0
  8 Seek_Time_Performance   0x0005   114   114   020    Pre-fail  Offline      -       37
  9 Power_On_Hours          0x0012   100   100   000    Old_age   Always       -       4912
 10 Spin_Retry_Count        0x0013   100   100   060    Pre-fail  Always       -       0
 12 Power_Cycle_Count       0x0032   100   100   000    Old_age   Always       -       7
192 Power-Off_Retract_Count 0x0032   100   100   000    Old_age   Always       -       182
193 Load_Cycle_Count        0x0012   100   100   000    Old_age   Always       -       182
194 Temperature_Celsius     0x0002   176   176   000    Old_age   Always       -       34 (Min/Max 19/40)
196 Reallocated_Event_Count 0x0032   100   100   000    Old_age   Always       -       0
197 Current_Pending_Sector  0x0022   100   100   000    Old_age   Always       -       0
198 Offline_Uncorrectable   0x0008   100   100   000    Old_age   Offline      -       0
199 UDMA_CRC_Error_Count    0x000a   200   200   000    Old_age   Always       -       0

SMART Error Log Version: 1
No Errors Logged

No reason to frown upon in above..

Did short self-test and it has not reveal anything, now started long test:

Serial Number:    PK2331PAG7EENT
SMART Self-test log structure revision number 1
Num  Test_Description    Status                  Remaining  LifeTime(hours)  LBA_of_first_error
# 1  Short offline       Completed without error       00%      4911         -

While at the same time there's a disk in the same array with 39 reallocated sectors, and PERC doesn't flag it as soon-to-fail. smartctl output below:

Serial Number:    PK2331PAG7EENT
...
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   005    Pre-fail  Always       -       39

and MegaCli64 output for the same disk with 39 reallocated sectors:

Slot Number: 0
Last Predictive Failure Event Seq Number: 0
Inquiry Data:       PK2331PAG7EENTHitachi HUS724040ALE640                 MJAOA3B0
...
Drive has flagged a S.M.A.R.T alert : No

MegaRAID Storage Manager's reporting is not enlightening too:

ID = 113
SEQUENCE NUMBER = 7310
TIME = 11-07-2013 20:58:01
LOCALIZED MESSAGE = Controller ID:  0   Unexpected sense:   PD       =   -:-:4Hardware impending failure general hard drive failure,   CDB   =    0x03 0x00 0x00 0x00 0x40 0x00    ,   Sense   =    0xf0 0x00 0x00 0x00 0x00 0x00 0x00 0x0a 0x00 0x00 0x00 0x00 0x5d 0x10 0x00 0x00 0x00 0x00

ID = 96
SEQUENCE NUMBER = 7309
TIME = 11-07-2013 20:58:01
LOCALIZED MESSAGE = Controller ID:  0   PD Predictive failure:       -:-:4

So the disk seems healthy, any ideas how to reset the SMART alert? I don't think smart stats are enough to claim warranty for it..

PS: we've removed #4, plugged it as #5, it shows healthy, it was showing as "foreign" which is expected, assigned it as global hot spare now. Placed new drive as #4 and RAID rebuilt the volume. Dell support suggested to use omconfig to get more detailed controller log.

Can you dump the entire megacli event output? For SMART alerts, there should be an event code 0x60. Ideally, it will give some indication in the unexpected sense data above it which might shed some light on the sequence which led to SMART being indicated. — Jon Brauer, Jul 18 '13 at 17:37
I had to resolve it quickly, so I swapped drive in question with a new drive, rebuilt RAID, reused the one it complained about in different slot, rebuilt it again utilizing it, and the error disappeared. I reached out to Dell support and they've recommended to dump DSET from controller with OMSA live or DSET utility, but I didn't get back to it. So far it's up and running with no errors. Will investigate if it reappears. — kuz8, Aug 13 '13 at 05:13

PERC MegaRAID S.M.A.R.T. Status is not matching smartctl's - looking for clues what's wrong with the HDD

0 Answers0