Dell PE2950 Will not boot

0

I have a fresh install of CentOS 7.2 on a Dell PE2950 that will not reboot properly.

Every time that I try to reboot, BIOS changes, RAID changes, after booting into the OS on HDDs or USB. No matter what, it will not reboot.

The fans spin partially down, sometimes even stop, spin back up a bit, but no input goes back to the monitor and the keyboard and mouse stay unpowered.

Now the key is that I can get into the OS just fine, all I have to do is unplug the PSUs for ~1 min. Then I get one proper boot cycle out of it.

Since I'm new to Linux I dot even know where to start debugging, but I know that the PSUs are working fine because they are going through the heaviest usage and running the OS just fine.


Clearing BIOS Helped

So after considerable tinkering I found the solution. All I had to do was clear the NVRAM. Seems there may have been something that I did in BIOS or some setting that was done before which made it unwilling to properly reboot.

I've now run it through 3 boot cycles and it's working fine without any errors.

Hopefully this helps in case anyone runs into the same kind of issue.


NVRAM Fix Didn't Stick

So clearing the NVRAM seems to help, but after a few cycles and getting to actually starting to configure the server, it's stuck back into the same problem.

Moved from GNOME to Console using init 3

Once in console, I couldn't log in as root

host login: root
password: [correct password]
Login Incorrect

host login: localUser
password: [correct password]

$sudo passwd
localUsers is not a sudoer...

$reboot

The above is slightly paraphrased. Now it's stuck again. :/

kyle_engineer

Posted 2017-05-05T17:34:17.213

Reputation: 131

Sounds like you have a hardware issue. Start by trying a different, known-good power supply. – Ƭᴇcʜιᴇ007 – 2017-05-05T17:39:17.843

Ok. I'll see if I have any others and update. – kyle_engineer – 2017-05-05T18:09:42.853

I found another set of PSUs and am trying them now. Loading RAID now and all is good so far... – kyle_engineer – 2017-05-05T18:15:40.920

So, it loaded up to the point of choosing a boot partition (I chose CentOS 7 (core)) and then the screen went into pwr save mode, fans ramped up, and it's stuck again as though it tried to reboot and is stuck. Should I leave it alone for a bit in case it is doing something? – kyle_engineer – 2017-05-05T18:18:18.570

So I switch the PSUs with a couple others (which I believe are properly working) and booted into the rescue partition that is on the primary drive. Everything worked. Installed clamAV, then clicked update and shutdown and it did the same thing. :'( Still stuck on reboot action... I'll try another PSU, but it doesn't seem like that helped at all... – kyle_engineer – 2017-05-06T00:20:07.980

Kyle - you say... `At the end of the installation process I clicked reboot, waited for the server to shutdown, then removed the USB drive.

The problem is`. I say "the problem is your installation may be the issue as well since after you made that change, you never confirmed that change was successful"... I would not do a RAID 0 tho as that's not RAID unless you don't car e and just need the performance with no redundancy. I would RAID 5 and split out separate partitions as needed before I'd run a RAID 0. Check your RAID configs and disk drive health, etc. just in case. – Pimp Juice IT – 2017-05-06T05:50:24.827

@Spittin'IT that's fine. So aside from the RAID configuration being unideal, what might be causing the intermittent booting ability? Cause it also happens after using the RAID configuration tool pre-OS... so i don't understand how an OS issue would affect the pre-boot characteristics... unless part of the install affected some firmware on the server... :/ – kyle_engineer – 2017-05-06T05:59:39.240

Sometimes, yeah, but I have to unplug it for 30+ seconds before powering back on. It's a poweredge 2950 with the A07 BIOS update. – kyle_engineer – 2017-05-06T06:05:47.100

I say unplug all power and then hold down on the power button for 10 seconds to discharge any residual voltage. Then with it still unplugged, unplug each SCSI drive from each bay one by one and reseat them. Be sure you have all 6 healthy drives, and then plug the server back in and boot to the RAID and carve all the drives to a RAID 5. From there boot to your OS install media and carve out a partition and install the OS and then reboot and confirm if it works as expected. – Pimp Juice IT – 2017-05-06T06:10:26.810

If you installed the OS onto a RAID 0 and one of those two failed with the error you saw about the failed drive, then the OS install is hosed as RAID 0 is opposite of having any redundancy; if either of the drives fail, the whole array fails and is hosed. – Pimp Juice IT – 2017-05-06T06:12:47.460

Let us continue this discussion in chat.

– Pimp Juice IT – 2017-05-06T06:20:28.863

Answers

0

Hardware Fixes

So I replaced the CMOS battery and then started getting errors that had not been showing before - namely, RAM failures. I replaced all the RAM in the set that was generating errors (DIMMs 5-8), and now it seems to be booting properly every time. Now it's faster than it was too.

At this point it's been booting/rebooting properly for ~22 hours and everything seems to check out as far as hardware stability is concerned. So this is looking like the final fix.


Moral of the story, check ALL aspects of hardware before spending a bunch of time deploying a machine - especially one that's been sitting for several years. I could have saved myself some stress by properly checking CMOS and RAM first before trying to install and deploy. But, hindsight is 20/20, so there you go.

kyle_engineer

Posted 2017-05-05T17:34:17.213

Reputation: 131