0

I am trying to install a host OS but whatever I try installing, it always fails with I/O related error. Looking at similar issues described on the internet, everything points to an imminent / existing disk failure. I have tried multiple RAID combinations to narrow down the issue but no luck so far. The machine is a Dell R720 with PERC H710p RAID controller and it originally came with 6 x 600GB 6G SAS 10k 2.5" drives. Tried these:

  • 1 disk RAID 0 (group 1), 5 disks in RAID 5 (group two): errors when attempting install on /dev/sda (group 1) and /dev/sdb (group 2), tried using different disks from the bay to form same groups
  • 3x RAID 1: errors when attempting install on /dev/sda, /dev/sdb and /dev/sdc
  • removed 3 drives, tried 1x RAID 5 on all 3 remaining drives: errors in the same way

The operating systems I tried so far:

  • Alpine 3.13: reported I/O error, installer exits to ash when trying to write the partition table
  • Ubuntu 16 LTS: reported I/O error right at the start of the installation process
  • Ubuntu 18 LTS: udevadm settle retried multiple times, I/O errors reported right before, installer crashes and restarts to region selection
  • Ubuntu 20 LTS: same as Ubuntu 18
  • CentOS 7: reported python anaconda error when trying to write to disk before I was able to even put the root password in, installer hangs, machine requires hard reboot
  • XenServer 7.0: installer stopped at 68%, machine required hard reboot

With every one of these, regardless of which disk group (VD?) I use for the OS, as soon as the installer attempts writing the partition table, all disks from the selected disk group start blinking amber. Ubuntu 18 / 20 consistently when it is time to put the user name, server name and password. After reboot, disks are blinking green again. In RAID configuration (CTRL+R), all disks are online, VD state is reported as Optimal. I have SATA AHCI set in boot properties in BIOS.

I ran the lifecycle manager tests on the server, everything is dandy. No errors reported, except of the missing PERC battery as the server does not have one physically installed. I understand why I would need this battery for data consistency on power loss but it should not prevent me from installing the OS? I suspect that the RAID controller is faulty but I am not an expert.

Is there anything else I can do to further diagnose the problem?

  • It certainly sounds like your RAID controller, though it could be a backplane or cable problem too. I'd just pick up another controller on eBay or some similar site and try it out. – Michael Hampton Jan 27 '21 at 23:01

1 Answers1

1

I would check that the firmware across the board has been updated to the latest versions (BIOS, PERC, disks, etc.) in case there is an issue that has been resolved already.

If the issue continues I would replace the PERC card and get one with the battery if possible (you really don't want corrupted disks in the event of a power loss and it'll allow you to use write back caching safely which will improve write throughput to the array).

If it still persists then look at the backplane itself. I've never experienced a backplane problem, but it can happen.

Justin Scott
  • 8,748
  • 1
  • 27
  • 39