Is this a critical RAID error?

Question

If I do the following

/opt/MegaRAID/MegaCli/MegaCli -LDInfo -Lall -aAll -NoLog  > /tmp/tmp
/opt/MegaRAID/MegaCli/MegaCli -LDPDInfo     -aAll -NoLog >> /tmp/tmp

then I see these errors

Media Error Count: 11
Other Error Count: 5

Question

What does they mean? Are they critical?

Full output:

Adapter 0 -- Virtual Drive Information:
Virtual Disk: 0 (target id: 0)
Name:Virtual Disk 0
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
Size:951296MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:5
Span Depth:1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default


Adapter #0

Number of Virtual Disks: 1
Virtual Disk: 0 (target id: 0)
Name:Virtual Disk 0
RAID Level: Primary-5, Secondary-0, RAID Level Qualifier-3
Size:951296MB
State: Optimal
Stripe Size: 64kB
Number Of Drives:5
Span Depth:1
Default Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Current Cache Policy: WriteBack, ReadAheadNone, Direct, No Write Cache if Bad BBU
Access Policy: Read/Write
Disk Cache Policy: Disk's Default
Number of Spans: 1
Span: 0 - Number of PDs: 5
PD: 0 Information
Enclosure Device ID: N/A
Slot Number: 0
Device Id: 0
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Raw Size: 238418MB [0x1d1a94a2 Sectors]
Non Coerced Size: 237906MB [0x1d0a94a2 Sectors]
Coerced Size: 237824MB [0x1d080000 Sectors]
Firmware state: Online
SAS Address(0): 0x1221000000000000
Connected Port Number: 0 
Inquiry Data: ATA     WDC WD2500JS-75N2E04     WD-WCANK9523610

PD: 1 Information
Enclosure Device ID: N/A
Slot Number: 1
Device Id: 1
Sequence Number: 2
Media Error Count: 11
Other Error Count: 5
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Raw Size: 238418MB [0x1d1a94a2 Sectors]
Non Coerced Size: 237906MB [0x1d0a94a2 Sectors]
Coerced Size: 237824MB [0x1d080000 Sectors]
Firmware state: Online
SAS Address(0): 0x1221000001000000
Connected Port Number: 1 
Inquiry Data: ATA     WDC WD2500JS-75N2E04     WD-WCANK9507278

PD: 2 Information
Enclosure Device ID: N/A
Slot Number: 2
Device Id: 2
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Raw Size: 238418MB [0x1d1a94a2 Sectors]
Non Coerced Size: 237906MB [0x1d0a94a2 Sectors]
Coerced Size: 237824MB [0x1d080000 Sectors]
Firmware state: Online
SAS Address(0): 0x1221000002000000
Connected Port Number: 2 
Inquiry Data: ATA     WDC WD2500JS-75N2E04     WD-WCANK9504713

PD: 3 Information
Enclosure Device ID: N/A
Slot Number: 3
Device Id: 3
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Raw Size: 238418MB [0x1d1a94a2 Sectors]
Non Coerced Size: 237906MB [0x1d0a94a2 Sectors]
Coerced Size: 237824MB [0x1d080000 Sectors]
Firmware state: Online
SAS Address(0): 0x1221000003000000
Connected Port Number: 3 
Inquiry Data: ATA     WDC WD2500JS-75N2E04     WD-WCANK9503028

PD: 4 Information
Enclosure Device ID: N/A
Slot Number: 4
Device Id: 4
Sequence Number: 2
Media Error Count: 0
Other Error Count: 0
Predictive Failure Count: 0
Last Predictive Failure Event Seq Number: 0
Raw Size: 238418MB [0x1d1a94a2 Sectors]
Non Coerced Size: 237906MB [0x1d0a94a2 Sectors]
Coerced Size: 237824MB [0x1d080000 Sectors]
Firmware state: Online
SAS Address(0): 0x1221000004000000
Connected Port Number: 4 
Inquiry Data: ATA     WDC WD2500JS-75N2E04     WD-WCANK9503793

score 10 · Accepted Answer · edited Sep 17 '13 at 11:50

10

You have problems with drive in slot 1. It's RAID 5, so your data is protected, but you've lost redundancy (one disk is not reliable). Media error means the drive run out of spare sectors to remap bad sectors to (~~http://kb.lsi.com/KnowledgebaseArticle15809.aspx~~ http://mycusthelp.info/LSI/_cs/AnswerDetail.aspx?inc=7468). If it was my data I'd be doubly scrupulous when backing up, remove the drive, replace it with a new one and synchronise the array. Some vendors (e.g. IBM) will accept RMA based on predictive failure indicators, some won't. If your vendor does not accept a disk with bad, un-remappable sectors as faulty, then take it out of the array and exercise in a test system. It should fail in reasonable time.

Edit:

Media events were non-zero only for disk with slot ID 1. In the log you've provided there's slot ID for each entry. The strange thing is, that the raid reports its state as optimal, despite media errors on the disk. Still', I wouldn't trust the disk.

RAID 5 made with n disks of the same size gives you capacity of (n-1) disks, because it stores one disks' worth of redundancy data. Therefore if you have six 250 GB disks and 1T of usable space, they are most likely divided into 5-disks RAID 5 (which gives you 4x250 GB of usable space) plus 1 spare disk.

edited Sep 17 '13 at 11:50

milan

360
3
11

answered Aug 16 '11 at 08:30

Paweł Brodacki

6,451
19
23

How were you able to tell, that it is the disk in slot 1, that is defect? – Sandra Aug 16 '11 at 12:35
Btw. is it possible from this output to tell, how the RAID5 is configured? I have 6x250GB disks, and 1TB usable. Does that mean I have 1 hot spare or 2 hot spare? – Sandra Aug 16 '11 at 12:35
1

The capacity of a RAID5 with n disks is (2/3)*n. For 2 bits of data, you store 1 bit of parity. That's one third gone. – Antoine Benkemoun Aug 16 '11 at 16:23
@Antoine Benkemoun : According to Wikipedia I get (n-1) and not (2/3)*n http://en.wikipedia.org/wiki/Raid5#RAID_5 – Sandra Aug 16 '11 at 21:59
@Paweł Brodacki : Excellent. It is possible to tell if the spare disk have kicked in, so I still have full RAID5 or perhaps if the spare disk already was in use? – Sandra Aug 17 '11 at 08:03
@Sandra, you are perfectly right. I got it all wrong in my head. Even though what I previously said could make sense, the real thing is for n-1 bits of data you store 1 bit of parity. Which does give you (n-1) capacity allowing for just one drive to fail independent on the number of drives in your array. Makes you wonder about the reliability of large RAID5 arrays... – Antoine Benkemoun Aug 17 '11 at 08:44
1

@Sandra I'm afraid I'm not able to tell if the hot spare is in use. There are 2 ways I see: 1) to read the MegaCLI reference manual and see if you can check it or 2) go to the physical box and look on the lights (if these are hot-swap disks). Failed disks usually announce their state with red/orange lights. – Paweł Brodacki Aug 17 '11 at 08:54
The array is still optimal, which means that the array thinks that it can still recover from a drive failure. The reality of this may be different depending on whether you have run a media verify on the array yet. In my case, I was getting this error and smartctl (see the other answer) says there were 115 uncorrected read errors. It could be that it got a read error, calculated the data from the other discs and re-wrote that missing data, correcting the device marginal block. – Sean Reifschneider Oct 08 '13 at 15:14

score 7 · Answer 2 · answered Aug 16 '11 at 19:30

7

actually smartctl can provide you detailed information about every disk in MegaRaid raid. to get information about physical disk #0 run:

smartctl -a -d megaraid,0 /dev/sda|less

as Pawel rightly points most probably it's reallocated sectors, but i had few cases when communication problems [visible in smartctl -l xerror -d megaraid,5 /dev/sda] were reported as Media Error Count.

answered Aug 16 '11 at 19:30

pQd

29,561
5
64
106

When I try it, I get `INVALID ARGUMENT TO -d: megaraid,0`. Changing `/dev/sda` to `ata` outputs the same error. – Sandra Aug 18 '11 at 08:31
@Sandra - works here with dell's perc6 [ MegaRAID SAS 1078 ] and smartctl 5.40 – pQd Aug 18 '11 at 13:17
@Sandra, older smartctl, eg. version 5.38 on Ubuntu 10.04 does not have megaraid support. (5.41 on Ubuntu 12.04 has it) – Peter May 20 '14 at 08:30

score 2 · Answer 3 · answered Aug 16 '11 at 08:06

As long as your array is up and running, it should be ok. Media error counter can increase from events such as a failing sector reallocation on one of the drives, while the other errors counter can be increased by any non-problematic event (bus device reset, power cycle, etc). However, if the error is critical, the drive will be automatically taken out of the array by the controller and reported as failed, in which case you will have to take an action.

It would be great if smartctl would be able to provide detailed SMART info on megaraid and individual unit status, but I don't think it supports it. Give it a try just in case.

score 0 · Answer 4 · answered Oct 08 '13 at 16:23

Sometimes, drives will generate read errors and, in my experience, usually when that happens I can run "badblocks" on it to stress test the drive and the drive may report some errors early on, but then once the drive has been stressed a bit it will either continue reporting errors, in which case it's bad, or it will report no errors.

I've figured that this was due to some sectors of the drive being marginal, and badblock remapping can only kick in when you are writing to the disc, not reading from it. If you write data to a sector that goes bad, the drive has to report an error reading it, because if it just silently remapped that sector to one of the spare sectors, it would give you back invalid data rather than an error. But on a write, if it notices the sector is bad, it can write that data to a spare sector and remap it.

Unfortunately, you can't clear this error count, so if you have monitoring that reports media errors, you either have to replace the drive or make it so you can tell the monitoring to ignore this or that many errors and only report when it changes again.

You can check the drive SMART status with smartctl (thanks, @pQd, I didn't know about that) with:

MegaCli64 -PDList  -aALL | grep -e '^$' -e Slot -e Count
#  Find the slot number to use for "X".
#  For "Y" use the device name the system knows, such as "sda".
smartctl -a -d megaraid,X /dev/sdY

It's probably not entirely unreasonable to rebuild the drive and see if it continues to have problems. With MegaRAID, you can do that with these commands:

#  WARNING: Make sure the array is "Optimal" first, this will degrade it.
MegaCli64 -LDInfo -Lall -aALL | grep State
#  NOTE: This assumes drive 3 of enclosure 32 for adapter 0
MegaCli64 -PDOffline -PhysDrv [32:3] -a0
MegaCli64 -PDRbld -Start -PhysDrv [32:3] -a0

#  Now check the rebuild status until it completes:
MegaCli64 -PDRbld -ShowProg -PhysDrv [32:3] -a0

# And the array status should go back to Optimal
MegaCli64 -LDInfo -Lall -aALL | grep State

I used to have drives fall out of the RAID array all the time (maybe once every month or two, across a sample of 100 to 200 drives). But the drives weren't showing up as bad after I replaced them.

I started burning in all drives before putting them into production, using "badblocks" for around a week, and after I started doing that the number of these array drop-outs reduced dramatically. Now it happens maybe twice a year, across 500 drives.

This is a destructive test, so make sure you have no data on the drive:

badblocks -svw -p 5 /dev/sdX

Were "sdX" is the device to test. Be very careful here, picking the wrong drive will destroy your data. I run my tests on a standalone machine on my testbench.

Is this a critical RAID error?

4 Answers4