Got a notice last night that a drive failed on a server. Got in this morning to replace it, and we're getting the following. Controller config report for the array looks fine, with the unusual status Ready for Rebuild.
~ # hpacucli controller all show config
Smart Array P400i in Slot 0 (Embedded) (sn: XXXXXXXX )
array A (SAS, Unused Space: 0 MB)
logicaldrive 1 (341.7 GB, RAID 5, Ready for Rebuild)
physicaldrive 1I:1:1 (port 1I:box 1:bay 1, SAS, 72 GB, OK)
physicaldrive 1I:1:2 (port 1I:box 1:bay 2, SAS, 72 GB, OK)
physicaldrive 1I:1:3 (port 1I:box 1:bay 3, SAS, 72 GB, OK)
physicaldrive 1I:1:4 (port 1I:box 1:bay 4, SAS, 146 GB, OK)
physicaldrive 2I:1:5 (port 2I:box 1:bay 5, SAS, 72 GB, OK)
physicaldrive 2I:1:6 (port 2I:box 1:bay 6, SAS, 72 GB, OK)
The logical drive shows a hint, Parity Initialization Status: Initialization Failed:
~ # hpacucli controller slot=0 logicaldrive 1 show
Smart Array P400i in Slot 0 (Embedded)
array A
Logical Drive: 1
Size: 341.7 GB
Fault Tolerance: RAID 5
Heads: 255
Sectors Per Track: 32
Cylinders: 65535
Strip Size: 64 KB
Full Stripe Size: 320 KB
Status: Ready for Rebuild
Array Accelerator: Enabled
Parity Initialization Status: Initialization Failed
Unique Identifier: XXXXXXX
Disk Name: /dev/cciss/c0d0
Mount Points: /boot 191 MB, / 28.6 GB
OS Status: LOCKED
Logical Drive Label: XXXXX 6797
Array configuration if it helps:
~ # /usr/sbin/hpacucli ctrl slot=0 show
Smart Array P400i in Slot 0 (Embedded)
Bus Interface: PCI
Slot: 0
Serial Number: XXXXXXXX
Cache Serial Number: XXXXXXXX
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Hardware Revision: B
Firmware Version: 1.18
Rebuild Priority: Low
Expand Priority: Low
Surface Scan Delay: 15 secs
Surface Scan Mode: Idle
Post Prompt Timeout: 0 secs
Cache Board Present: True
Cache Status: OK
Accelerator Ratio: 50% Read / 50% Write
Drive Write Cache: Disabled
Total Cache Size: 256 MB
Total Cache Memory Available: 208 MB
No-Battery Write Cache: Disabled
Cache Backup Power Source: Batteries
Battery/Capacitor Count: 1
Battery/Capacitor Status: OK
SATA NCQ Supported: False
How do I go about debugging this?
Edit:
All of the individual drives appear fine:
~ # hpacucli controller all show config detail | grep Status
RAID 6 (ADG) Status: Enabled
Controller Status: OK
Cache Status: OK
Battery/Capacitor Status: OK
Status: OK
Status: Ready for Rebuild
Parity Initialization Status: Initialization Failed
OS Status: LOCKED
Status: OK
Status: OK
Status: OK
Status: OK
Status: OK
Status: OK
edit2:
I'm debugging some adverse interactions between hpaducli and grsec (also mp-SSH and Ubuntu) but we do have hpacucli diag results available, and buried in the Logical Drive Status Flags is Rebuild Aborted From Read Error
. What confuses me here is how a read error during rebuild does not result in marking one of the drives predictive failure, or worse, but does cause a rebuild to stop.