The company I work for just bought 3 PowerEdge 2970 servers and they all have the same problem.
- Is this server worth buying or are the problems that come with it make it not worth it?
- Are there alot of issues with using an AMD processors (it's an Opteron)?
- Are you guys able to pin point the problem if I give details on which errors I get in the event logs?
Here is the problem:
1.Power on server. It boots up to the red hat splash screen.
2.In the middle of the boot up the server crashes with the following errors:
-CPU Machine Chk: processor sensor, transition to non-recoverable was asserted
-PCI Parity Err: critical event sensor, PCI PERR (BUS 0 DEVICE 1 FUNC 0)
Then I tried to update the bios and the BMC but the problem was still there. After that I tried to update the OS (it had red hat Enterprise 5.1) to red Hat 5.3 There was something odd there too. I booted the server with the Build and update utility then selected install OS. I selected red hat enterprise 5.3 x86_64. It queried me for the x86_64 media so I put in the disc that said : supplementary disc 1 of 1 for 64-bit AMD64 and Intel 64. It said wrong disc. So then I used the disc that said: installation disc 1 of 1 for 64-bit Intel Itanium. My guess is thats the disc I needed to use all along.
After this the system was able to boot up to the command line login screen. I loggued in and typed : startx to get into the gui environment. At that point less than a page of text scrolled fast and the server crashed without showing anything gui related.
At that point I had at 2 different errors(notice the device is 4 now, gonna check which device it is):
-PCI Parity Err: critical event sensor, PCI PERR (BUS 0 DEVICE 4 FUNC 0)
-PCI Sytem Error:critical event sensor, PCI SERR(BUS 0 DEVICE 4 FUNC 0)
So today the tech guy came with a bunch of parts and basically rebuilt the server (PCI riser, mother board, DIMMs, a SAS card and something else I cant figure off the top of my head)on site but after that the problems were even worse. Some of these errors were(mind you at that point he was putting back some of the original parts so things got messy):
ECC uncorr Err: memory sensor, uncorrectable ECC (DIMM1 DIMM2) was asserted.
E1231 1.2V HT core power GD
E1911 <3 ERRORS check log
E1000 failsafe
Tomorrow he is coming back with a power supply...
UPDATE: Seems like I cant waste anymore time on this. We are calling the sales people and asking for new servers.