0

One of our Dell PowerEdge LCDs was showing "CPU 2 machine check error", but I couldn't find anything in the logs regarding MCE or "Hardware Error." I cleared the message, but I wanted to run the machine through some heavy stuff to see if I could make it stumble again.

I utilized an infinite loop bash script executed 64 times (once for each core) for a few minutes. Then I used a program called "stress" to do the same thing with CPU and memory. My question is, what is a sufficient amount before it's generally OK to say, "okay, this machine is good to go"? A few minutes? An hour? As long as CPU temps remain OK?

CptSupermrkt
  • 233
  • 1
  • 3
  • 9

2 Answers2

4

If the server is under warranty, have the vendor replace the part.

If the server is not under warranty and the part cannot be replaced, the end-all-answer will be subjective.

Is this a server that CANNOT fail (ie: running life support, handling real time financial transactions)? Or is this just a web server for a puppy fan site?

Either way, just run the server through whatever 'burn in' process you have for new hardware.


I will add: If you came here hoping to find someone to sign off on the risk involved with leaving this server in productions, none of our answers should be construed in a way that we believe it is acceptable to leave the server in production as is. THAT is something you will have to send through the risk assessment process that is internal to your company. No one here can give a definitive "Run memtest and prime for x days without error and you are guaranteed a stable server"...

Daniel Widrick
  • 3,418
  • 2
  • 12
  • 26
0

For memory: At least several hours using memtest86. The more time you can spend on it the better. Everything below 3 hours is not reliable at all from my experience. I'd say let it run at least 12 to 24 hours to be certain.

For testing the CPU you can run primenumber crunching programs i.e mprime or other stress tests like compiling huge amounts of code to verify that the calculations are correct. The longer these run the better.

These running fine still give you no guarantee whatsoever. If one of these tests fails you at least have a way to reproduce.

Machine Check Error on the other hand looks like something you really should report to the vendor, even if you can't reproduce it. Your machine could run fine for weeks and months even with testing but at the most unfortunate moment will crash again.

kei1aeh5quahQu4U
  • 445
  • 4
  • 22
  • Another "better" thing than 24 hours is a "simple weekend". Turn on tests friday, see if it runs monday ;) can be done on your desk if needed - noone in the office anyway ;) – TomTom Feb 10 '14 at 17:33
  • I don't know that a "weekend" is technically better than 24hrs. While a weekend is statistically more likely to find an error than a single 24hr span, it does not address any of the subjectivity of the issue. There's a big difference between the server that the middle-schoolers are using to host minecraft and a server backing the stock tickers on the floor @wallstreet. – Daniel Widrick Feb 10 '14 at 17:37