Identification of a Failing Drive on a PERC H700

Question

I need some expert assistance.

I have home LAB ESX box where I'm using a Dell PERC H700 raid card with RAID 6 on ports 0-5 & RAID1 on ports 6-7. The last few nights when my backups run I get several errors as below:

Controller ID:  0 Event From : 192.168.123.100   Unexpected sense:   PD  
    =   -:-:4Unrecovered read error,   CDB   =    0x28 0x00 0x21 0x3b 0x29 0x80 0x00 0x00 0x80 0x00    ,   Sense   =    0xf0 0x00 0x03 0x21 0x3b 0x29 0x80 0x0a 0x00 0x00 0x00 0x00 0x11 0x00 0x81 0x80 0x00 0x96
Event ID:113
Generated On: Thu Jul 19 03:26:36 EDT 3917

I believe I have a failing drive, but I'm not certain what port the problematic drive is on. I'm kind of assuming that PD = -:-:4 is port 4, but when I google this error I don't see any references with -:- as part of the port description.

Thanks in advance to anyone who can add some additional clarity and assure me im going to swap out the correct drive.

EDIT:

I should have included that this error alert I'm getting is being generated by the MegaRAID Storage Manager Software, and the same information is contained in the log I download from MSM.

Since this is your home lab, this question technically belongs on SuperUser: "Server Fault is for questions about managing information technology systems in a business environment." — JimNim, Jul 19 '17 at 15:40
If you have OpenManage Server Administrator (OMSA) installed on the server, you should be able to utilize that for exporting a PERC controller log. PERC logs are a little more descriptive in calling out specific slot numbers for errors like that, and you can't trust the hypervisor's disk numbering as a guide for that. — JimNim, Jul 19 '17 at 15:43
Thanks @JimNim, My apologies for posting in the wrong forum I use this server as a home lab to test configuration before doing it on an identical production system at a small business who's systems I manage so its a bit of both. I edited the post to include information of the source of the error. Since the server hardware is not Dell Im not able to run the OMSA software hence my use of the MSM software. Do you still believe the disk numbering may be inaccurate knowing that its from MSM and not the hypervisor? — Zenonk, Jul 19 '17 at 20:23
Not sure on that one... though MSM might have an option to export the controller log (if not, MegaCLI does) to give you more certainty. And the use case of your home lab probably sits it right atop the blurry line between SF and SU forums... — JimNim, Jul 19 '17 at 20:29
MSM does allow me to export the log as a txt but the information in there is identical to what is I see in MSM, So I'm not sure if there is a more detailed controller log or not. I dont have the vib installed for megaCLI so that will probably be my next task. Another question if you have an opinion should I run a consistency check before swapping the drive or swap the drive, let it rebuild then run a consistency check? Interestingly despite getting these errors when running a backups patrol reads complete with no warnings... Is that normal? — Zenonk, Jul 21 '17 at 00:58
It's possible for a patrol read to miss things, sure. Not common/normal though. A consistency check BEFORE would be best, though you're likely safe either way with dual parity on RAID6. — JimNim, Jul 21 '17 at 06:59
So after much reading an research I was able to ascertain with certainty MSM is reporting the port connected to the failing drive. I have the replacement drive and am planning to swap it out. I don't have a hot spare nor do I need to do a hot swap so my planned order of event is as follows can you confirm if im missing anything?1.Run a consistency check 2.Make the drive offline 3.Shutdown 4. Replace Drive 5. Reboot 6.Make Drive Online 7.The rebuild to start automatically correct? — Zenonk, Jul 27 '17 at 05:30
No, don't shut down a system for replacement of hot swappable drives. It's DESIGNED to be hot swapped. Consistency check, fail/offline the drive, replace the drive, and assign the new drive as a spare. New drives likely won't be automatically claimed as the target for a rebuild — JimNim, Jul 27 '17 at 14:49
Thanks for the answers and assistance, everything went exceptionally well. As discussed ran the consistency check - OK, then marked port 4 as OFFLINE. The array status changed to partially degraded, removed the problematic drive and inserted new drive. The card automatically added the drive to the array and initiated the rebuild. The rebuild took about an hour and I was back to 100% health. (6 450GB 15K drives in RAID6) — Zenonk, Aug 01 '17 at 00:06

Identification of a Failing Drive on a PERC H700

0 Answers0