Machine Check Exception reported by kernel

3

I built a new computer:

  • Intel Core i7 4770K
  • Gigabyte Z87N-WIFI
  • Samsung 840 Evo S x2 (in RAID 0)
  • 450w Corsair RM 80Plus
  • Dark Rock Pro 3 Cooling
  • Kingston 1600 DDR3
  • NO DEDICATED GPU

Operating System:

  • Linux Mint 16 Petra

The BIOS settings are completely default, except from the RAID configuration. The CPU is NOT overclocked, nor ever has been since I bought it.

About 3 times per day since I built the system it will unexpectedly crash and go to a black screen saying "Machine Check Exception ...", image below:

enter image description here

The temperature looks good:

➜  ~  sensors
acpitz-virtual-0
Adapter: Virtual device
temp1:        +27.8°C  (crit = +105.0°C)
temp2:        +29.8°C  (crit = +105.0°C)

coretemp-isa-0000
Adapter: ISA adapter
Physical id 0:  +42.0°C  (high = +80.0°C, crit = +100.0°C)
Core 0:         +40.0°C  (high = +80.0°C, crit = +100.0°C)
Core 1:         +41.0°C  (high = +80.0°C, crit = +100.0°C)
Core 2:         +41.0°C  (high = +80.0°C, crit = +100.0°C)
Core 3:         +40.0°C  (high = +80.0°C, crit = +100.0°C)

pkg-temp-0-virtual-0
Adapter: Virtual device
temp1:        +42.0°C 

Updated BIOS to latest.

Can somebody tell me what the problem could be?

StuR

Posted 2014-04-21T15:48:05.383

Reputation: 83

2Sounds like a hardware problem. Was the CPU working before? Have you checked for bent pins on the motherboard? Have you tried distros other than Mint 16? – DanteTheEgregore – 2014-04-25T14:16:26.360

Have tried with a number of Linux distros: Fedora, Ubuntu, and Linux Mint with PCRE. All of them crash similarly. Is it likely to be a faulty motherboard or faulty CPU causing this error? – StuR – 2014-04-25T17:07:16.267

It might be a CPU failure. Try downloading Prime95 (don't bother registering) and run the Blend test (mprime -m to run the config utilty) for 6-8 hours (3 minimum). It'll keep running till you stop it or it encounters an error.

– DanteTheEgregore – 2014-04-25T17:45:41.313

ACPI temperatures are no good. My server also reports these exact temperatures—at any given time. Try using lm_sensors. – Daniel B – 2014-04-30T09:17:18.837

Answers

2

This is definitely a hardware problem. mcelog --ascii reports the following:

Hardware event. This is not a software error.
CPU 0 BANK 4 TSC 2d95278285f8
RIP !INEXACT! 10:ffffffff816f6570
MISC 0
TIME 1398091195 Mon Apr 21 16:39:55 2014
MCG status:RIPV MCIP
MCi status:
Uncorrected error
Error enabled
MCi_MISC register valid
Processor context corrupt
MCA: Internal unclassified error: 402
STATUS ba00000052000402 MCGSTATUS 5
CPUID Vendor Intel Family 6 Model 60
SOCKET 0 APIC 0 microcode 9

...which is unfortunately not very helpful. It’s probably some undocumented internal CPU error. Your best bet would be to go for a warranty exchange (of your CPU), if possible.

Daniel B

Posted 2014-04-21T15:48:05.383

Reputation: 40 502

You were right, it was a faulty CPU. – StuR – 2014-05-20T11:27:15.100