2

Got a notice last night that a drive failed on a server. Got in this morning to replace it, and we're getting the following. Controller config report for the array looks fine, with the unusual status Ready for Rebuild.

 ~ # hpacucli controller all show config
Smart Array P400i in Slot 0 (Embedded)    (sn: XXXXXXXX     )
   array A (SAS, Unused Space: 0 MB)
   logicaldrive 1 (341.7 GB, RAID 5, Ready for Rebuild)
   physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 72 GB, OK)
   physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 72 GB, OK)
   physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 72 GB, OK)
   physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK)
   physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 72 GB, OK)
   physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 72 GB, OK)

The logical drive shows a hint, Parity Initialization Status: Initialization Failed:

~ # hpacucli controller slot=0 logicaldrive 1 show 
Smart Array P400i in Slot 0 (Embedded)
   array A
      Logical Drive: 1
         Size: 341.7 GB
         Fault Tolerance: RAID 5
         Heads: 255
         Sectors Per Track: 32
         Cylinders: 65535
         Strip Size: 64 KB
         Full Stripe Size: 320 KB
         Status: Ready for Rebuild
         Array Accelerator: Enabled
         Parity Initialization Status: Initialization Failed
         Unique Identifier: XXXXXXX
         Disk Name: /dev/cciss/c0d0
         Mount Points: /boot 191 MB, / 28.6 GB
         OS Status: LOCKED
         Logical Drive Label: XXXXX     6797

Array configuration if it helps:

 ~ # /usr/sbin/hpacucli ctrl slot=0 show
Smart Array P400i in Slot 0 (Embedded)
   Bus Interface: PCI
   Slot: 0
   Serial Number: XXXXXXXX     
   Cache Serial Number: XXXXXXXX
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Hardware Revision: B
   Firmware Version: 1.18
   Rebuild Priority: Low
   Expand Priority: Low
   Surface Scan Delay: 15 secs
   Surface Scan Mode: Idle
   Post Prompt Timeout: 0 secs
   Cache Board Present: True
   Cache Status: OK
   Accelerator Ratio: 50% Read / 50% Write
   Drive Write Cache: Disabled
   Total Cache Size: 256 MB
   Total Cache Memory Available: 208 MB
   No-Battery Write Cache: Disabled
   Cache Backup Power Source: Batteries
   Battery/Capacitor Count: 1
   Battery/Capacitor Status: OK
   SATA NCQ Supported: False

How do I go about debugging this?

Edit:

All of the individual drives appear fine:

~ # hpacucli controller all show config detail | grep Status
   RAID 6 (ADG) Status: Enabled
   Controller Status: OK
   Cache Status: OK
   Battery/Capacitor Status: OK
      Status: OK
         Status: Ready for Rebuild
         Parity Initialization Status: Initialization Failed
         OS Status: LOCKED
         Status: OK
         Status: OK
         Status: OK
         Status: OK
         Status: OK
         Status: OK

edit2:

I'm debugging some adverse interactions between hpaducli and grsec (also mp-SSH and Ubuntu) but we do have hpacucli diag results available, and buried in the Logical Drive Status Flags is Rebuild Aborted From Read Error. What confuses me here is how a read error during rebuild does not result in marking one of the drives predictive failure, or worse, but does cause a rebuild to stop.

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
jldugger
  • 14,122
  • 19
  • 73
  • 129
  • Rather than pursue the failure codes further, we're rsync'ing data over to a new host. I'm still baffled by Broco's explanation that a read error would succeed at the OS level but fail during rebuild when you're already down a drive. – jldugger Aug 15 '14 at 21:36

2 Answers2

3

Ready for Rebuild is a bad status if you're using a parity RAID level, like 5 or 6. It means that you likely have read errors on another drive in the array... e.g. another failing drive.

If the system is still online your best option is to recover data or rebuild. There's no good fix for this, and definitely not much you can do to debug.

See the following:

Force LUN in a HP Smart Array to rebuild

HP Proliant ML350 G5 SAS HDD

HP SmartArray P400: How to repair failed logical drive?

And of course: RAID-5: Two disks failed simultaneously?

ewwhite
  • 194,921
  • 91
  • 434
  • 799
  • I've updated the question with the results of **config detail**: none of the drives are failing, or predicted to fail. – jldugger Aug 13 '14 at 18:50
  • @jldugger You may not *see* that there's a drive issue, but there is... An Array Diagnostics Utility (ADU) report will confirm this. – ewwhite Aug 13 '14 at 18:55
  • @jldugger trust ewwhite, he knows what he is talking about and I can confirm this. There is a certain tolerance before a device will be marked as failed (e.g. certain amount of dysfunctional blocks). It may be tolerable for for the system to keep working but it is **not** tolerable for rebuilding a RAID (this is especially the case if parity data is being compromised by bad blocks). Fetch your data and reinstall the machine with a fresh RAID and replace all old hard drives. For the future, you should have a fixed cycle for exchanging HDDs, best once a year, min. every 2 years. – Broco Aug 13 '14 at 20:13
  • I don't doubt it's broken, I just don't see any evidence that it's the disks in the usual place I check. I follow the 'anything that leads to a CRITICAL is a WARNING' model, so I want to know what happened and how we might monitor for it going forward. – jldugger Aug 14 '14 at 00:18
  • See the above links about URE's. The key is to basically avoid RAID5. – ewwhite Aug 14 '14 at 02:41
  • It could jsut be a driver issue. Have the same with Adaptec - it just does not start the rebuild. Reboot - rebuild starts, goes through. Something about programmers making errors that somehow never get fixed. – TomTom Aug 10 '15 at 05:44
2

Have you upgraded your firmware? Seems like the v1.8 is pretty old for the P400i controller. Having all drives OK and also fail parity seems like a bug to me.

I've had a number of cases where HP shipped older firmware and doing the upgrade fixed parity initialization issues (but I needed to reconstruct the array from scratch) and significantly improved performance as well (not exactly the same unit though, I'm using the P440AR).

Tel
  • 61
  • 2