0

I am using a hardware raid50 with PERC810 controller in my server and recently encountered a metric I am not sure about. Until now, I have been using a smartctl metric "Elements in grown defect list" as a hint that drive is failing and should be removed, but if I use perccli (or storcli/megacli) the drive is also showing a metric called "Media error count." The issue I am having with this is that, from what I've read about these metrics, they are basically the same thing - both shows reallocated sectors or physical defects on a disk. But some of my hdds are showing a number larger than zero at Elements in grown defects list, but a zero value at Media error count and vice versa.. For example this disk:

perccli /c0/e37/s7 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e37/s7 :
================

----------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
----------------------------------------------------------------------------
37:7     72 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
----------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


Drive /c0/e37/s7 - Detailed Information :
=======================================

Drive /c0/e37/s7 State :
======================
Shield Counter = 0
Media Error Count = 38
Other Error Count = 118063
Drive Temperature =  41C (105.80 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e37/s7 Device attributes :
==================================
SN = WMC1F0D41KD5
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01F55DD1
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01

Which shows Media Error Count = 3, but when I use smartctl for the same disk:

smartctl -a -d megaraid,72 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01f55dd0
Serial number:        WMC1F0D41KD5
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:14:51 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     41 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60298:10
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 0

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    2538437     9298     76289   2547735       9392     215124.761          94
write:   5550372  5405661   5407707  10956033    5405661     571404.363           0
verify:      184        0         0       184          0        352.277           0

Non-medium error count:   202249

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

It shows Elements in grown defect list: 0

Here is another example on the same server, just different hdd:

perccli /c0/e37/s4 show all
CLI Version = 007.1327.0000.0000 July 27, 2020
Operating system = Linux 4.19.0-0.bpo.9-amd64
Controller = 0
Status = Success
Description = Show Drive Information Succeeded.


Drive /c0/e37/s4 :
================

----------------------------------------------------------------------------
EID:Slt DID State DG     Size Intf Med SED PI SeSz Model            Sp Type 
----------------------------------------------------------------------------
37:4     63 Onln   1 3.637 TB SAS  HDD N   N  512B WD4001FYYG-01SL3 U  -    
----------------------------------------------------------------------------

EID=Enclosure Device ID|Slt=Slot No.|DID=Device ID|DG=DriveGroup
DHS=Dedicated Hot Spare|UGood=Unconfigured Good|GHS=Global Hotspare
UBad=Unconfigured Bad|Sntze=Sanitize|Onln=Online|Offln=Offline|Intf=Interface
Med=Media Type|SED=Self Encryptive Drive|PI=Protection Info
SeSz=Sector Size|Sp=Spun|U=Up|D=Down|T=Transition|F=Foreign
UGUnsp=UGood Unsupported|UGShld=UGood shielded|HSPShld=Hotspare shielded
CFShld=Configured shielded|Cpybck=CopyBack|CBShld=Copyback Shielded
UBUnsp=UBad Unsupported|Rbld=Rebuild


Drive /c0/e37/s4 - Detailed Information :
=======================================

Drive /c0/e37/s4 State :
======================
Shield Counter = 0
Media Error Count = 0
Other Error Count = 118060
Drive Temperature =  35C (95.00 F)
Predictive Failure Count = 0
S.M.A.R.T alert flagged by drive = No


Drive /c0/e37/s4 Device attributes :
==================================
SN = WMC1F0D222KF
Manufacturer Id = WD      
Model Number = WD4001FYYG-01SL3
NAND Vendor = NA
WWN = 50000C0F01352C35
Firmware Revision = VR08
Firmware Release Number = N/A
Raw size = 3.638 TB [0x1d1c0beb0 Sectors]
Coerced size = 3.637 TB [0x1d1b00000 Sectors]
Non Coerced size = 3.637 TB [0x1d1b0beb0 Sectors]
Device Speed = 6.0Gb/s
Link Speed = 6.0Gb/s
Write Cache = N/A
Logical Sector Size = 512B
Physical Sector Size = 512B
Connector Name = 01 


Drive /c0/e37/s4 Policies/Settings :
==================================
Drive position = DriveGroup:1, Span:1, Row:0
Enclosure position = 0
Connected Port Number = 0(path0) 
Sequence Number = 2
Commissioned Spare = No
Emergency Spare = No
Last Predictive Failure Event Sequence Number = 0
Successful diagnostics completion on = N/A
FDE Type = None
SED Capable = No
SED Enabled = No
Secured = No
Cryptographic Erase Capable = No
Sanitize Support = Not supported
Locked = No
Needs EKM Attention = No
PI Eligible = No
Certified = No
Wide Port Capable = No

Port Information :
================

-----------------------------------------
Port Status Linkspeed SAS address        
-----------------------------------------
   0 Active 6.0Gb/s   0x50000c0f01352c36 
   1 Active Unknown   0x0                
-----------------------------------------


Inquiry Data = 
00 00 06 12 5b 01 10 02 57 44 20 20 20 20 20 20 
57 44 34 30 30 31 46 59 59 47 2d 30 31 53 4c 33 
56 52 30 38 57 44 2d 57 4d 43 31 46 30 44 32 32 
32 4b 46 20 20 20 20 20 00 00 00 a0 0c 40 20 c0 
04 60 04 c0 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 
00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 

Where Media Error Count = 0, but smartctl:

smartctl -a -d megaraid,63 /dev/sdg
smartctl 7.2 2020-12-30 r5155 [x86_64-linux-4.19.0-0.bpo.9-amd64] (local build)
Copyright (C) 2002-20, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Vendor:               WD
Product:              WD4001FYYG-01SL3
Revision:             VR08
Compliance:           SPC-4
User Capacity:        4,000,787,030,016 bytes [4.00 TB]
Logical block size:   512 bytes
Rotation Rate:        7200 rpm
Form Factor:          3.5 inches
Logical Unit id:      0x50000c0f01352c34
Serial number:        WMC1F0D222KF
Device type:          disk
Transport protocol:   SAS (SPL-3)
Local Time is:        Fri Jan 28 14:39:52 2022 CET
SMART support is:     Available - device has SMART capability.
SMART support is:     Enabled
Temperature Warning:  Enabled

=== START OF READ SMART DATA SECTION ===
SMART Health Status: OK

Current Drive Temperature:     35 C
Drive Trip Temperature:        40 C

Accumulated power on time, hours:minutes 60299:24
Manufactured in week 46 of year 2014
Specified cycle count over device lifetime:  1048576
Accumulated start-stop cycles:  18
Specified load-unload count over device lifetime:  1114112
Accumulated load-unload cycles:  118
Elements in grown defect list: 44

Error counter log:
           Errors Corrected by           Total   Correction     Gigabytes    Total
               ECC          rereads/    errors   algorithm      processed    uncorrected
           fast | delayed   rewrites  corrected  invocations   [10^9 bytes]  errors
read:    4899063        1         1   4899064          1     215489.217           0
write:   6593514      494       496   6594008        499     571584.348           0
verify:      345        0         0       345          0        349.197           0

Non-medium error count:   202287

SMART Self-test log
Num  Test              Status                 segment  LifeTime  LBA_first_err [SK ASC ASQ]
     Description                              number   (hours)
# 1  Background long   Completed                   -      11                 - [-   -    -]

Long (extended) Self-test duration: 31120 seconds [518.7 minutes]

Shows Elements in grown defect list: 44

Can you please explain the difference between these two metrics and which one to go by in determining a faulty drive? Thank you.

chpZ
  • 1
  • 1

0 Answers0