0

I have Raid 50 configured on a Xenserver host with a Perc H700 card and a few weeks ago, replaced a disk which had failed. The raid had rebuilt and I am now checking the status of the array via omreport:

# omreport storage vdisk

Controller PERC H700 Integrated (Slot 4)
ID                            : 0
Status                        : Critical
Name                          : Virtual Disk 0
State                         : Resynching
Hot Spare Policy violated     : Not Assigned
Virtual Disk Bad Blocks       : Yes
Encrypted                     : Not Applicable
Layout                        : RAID-50
Size                          : 14,900.00 GB (15998753177600 bytes)
Associated Fluid Cache State  : Not Applicable
Device Name                   : /dev/sda
Bus Protocol                  : SATA
Media                         : HDD
Read Policy                   : Adaptive Read Ahead
Write Policy                  : Write Through
Cache Policy                  : Not Applicable
Stripe Element Size           : 64 KB
Disk Cache Policy             : Enabled

My question is, why has the state been stuck at Resynching for such a long time? There is not much IO activity as there are no VM's running on the host at the moment. And also what does Resynching involve?

The other point to mention is that the battery status is critical:

# omreport storage battery

Controller PERC H700 Integrated (Slot 4)
ID                  : 0
Status              : Critical
Name                : Battery 0 
State               : Failed
Recharge Count      : Not Applicable
Max Recharge Count  : Not Applicable
Learn State         : Idle
Next Learn Time     : 15 days 22 hours
Maximum Learn Delay : 7 days 0 hours
Learn Mode          : Auto

However using Megacli, it is showing the battery as Optimal:

BBU status for Adapter: 0
BatteryType: BBU
Voltage: 4035 mV
Current: 0 mA
Temperature: 27 C
Battery State: Optimal

What is the reason for the conflict in the two reports?

Thanks in advance, please ask if you require any further information.

W Khan
  • 58
  • 6

1 Answers1

1

It's possible that the disks being read from to calculate the "resync" data are encountering some bad blocks during the process. Since you're using RAID50, if you encounter ANY bad blocks from any drive in the "half" (RAID5) that is rebuilding, it automatically results in a URE (referred to by Dell as a "puncture").

I say I suspect this because you're seeing Virtual Disk Bad Blocks : Yes - bad blocks don't occur at a virtual disk level unless the underlying RAID "loses" a block due to multiple pieces being bad or missing. This is one reason why production data is typically much safer on either RAID10 or RAID6. In almost every instance I've encountered with Virtual-level bad blocks, the only fix is to re-initialize the RAID and restore from backup. The only way of escape is if that block just happens to contain data that doesn't need to be read (or empty space at the file system level), and is eventually overwritten... otherwise you likely have some degree of data corruption that should be investigated and addressed.

As for the battery status discrepancy, I would trust MegaCLI over omreport. MegaCLI is from the OEM (LSI) and is designed specifically for that task, while omreport deals with monitoring all of the Dell hardware components. Most likely, a restart of the OMSA services or an update of the installed version would clear up the discrepancy.

If you have an active warranty on the system, you may also want to consider contacting Dell to advise on both matters.

JimNim
  • 2,736
  • 12
  • 23
  • Informative answer, thanks. The resyncing stage is now finished. I have read up more on virtual bad blocks and this doesnt look good! There is really no way to recover now, except to start again? Raid 6 here would have helped right? here is the output from MegaCli -PDList -aAll `... Slot Number: 2 Media Error Count: 0 Other Error Count: 1188 ...` However Predictive Failure Count is set to 0. I am now seeing awful performance on this host with just a 3 VM's. – W Khan Jul 23 '15 at 08:34
  • Yes, RAID6 likely would have prevented this, though it may not be a suitable RAID level for your performance needs and workload. You might consider RAID10 if you can get by with the drop in usable capacity. Once a virtual block is lost/bad, its data has been lost due to an inability to calculate it. Typically I suggest migrating virtual machines to alternate storage, starting fresh on the damaged VD, then performing file system checks on the guest OS level for each VM. Read up on RAID rebuild UREs for more info of why RAID5 isn't used as an industry standard any longer for production data. – JimNim Jul 23 '15 at 15:10