Oracle Exalytics X4-4 server - RAID 1 volume corrupted while data sync was going on

Question

We replaced a failed drive in Oracle Exalytics X4-4 machine. Failed drive was replaced fine and rebuild started. But when rebuild reached at 70%, the main disk got a bad sector and the rebuild failed. I tried rebuilding manually in megacli but it failed again. Oracle says that the RAID 1 volume has corrupted and the only option remains is to rebuild entire server. Server is still running and is in degraded mode. Is there any chance to survive from this situation? Can entire server rebuild be avoided? Need help....

score 4 · Answer 1 · answered Feb 20 '20 at 16:12

4

LSI RAID controllers should let the user rebuild a RAID1 array with an uncorrectable read error on source drive, resulting in a punctured array. This, however, can be implementation dependent (ie: the firmware and utils of your Oracle box may not support it). Are you sure that you can not rebuild not even using megacli?

If you can't really rebuild the array, the suggested plan is to backup all your data, destroy the array, recreate it and reload all data. If, and only if, this is not possible, you can try to attach the original disk to a spare machine and, from here, ddrescue it into a new identical disk. Then, use the newly cloned disk to boot your Oracle box, rebuilding the array into a third disk.

Disclaimer: this will cause downtime and any error can led to complete data loss; don't even think to try it without recent backups and good understanding of the problem.

answered Feb 20 '20 at 16:12

shodanshok

44,038
6
98
162

1

There's also [the `lsiutil` tool](https://docs.broadcom.com/docs/12351668), and that will have a lot more capabilities than `megacli` does - including turning the HBA into a paperweight... – Andrew Henle Feb 20 '20 at 16:28
How to rebuild a RAID array when main disk itself got bad and syncing second disk is not possible? – Nikhil Patwardhan Feb 21 '20 at 10:03
@NikhilPatwardhan Hi, I explained the basic steps in the answer above. If you don't know how to use `ddrescue`, please stop here and seek a professional IT consultant. Otherwise you will cause more damange and/or data loss. – shodanshok Feb 21 '20 at 10:10
I gogoled about ddrescue and found this – Nikhil Patwardhan Feb 21 '20 at 12:48
ddrescue -f -n /dev/[baddrive] /dev/[gooddrive] /root/recovery.log . looks like this should work. I will try this and will let you know the status. I am a experienced Linux Admin but never used ddrescue before so wanted to be clear. Found some useful info about it. Thanks – Nikhil Patwardhan Feb 21 '20 at 12:49
I can do cloning using dd command as well. But not sure what will happen to the bad block while cloning. Please let me know if you have any info about it. – Nikhil Patwardhan Feb 21 '20 at 12:57

Oracle Exalytics X4-4 server - RAID 1 volume corrupted while data sync was going on

1 Answers1