I have a desktop running as a ubuntu server at another office. Lately its been shutting itself down once in a while, and I'm a bit unsure how to diagnose this. The syslog looks like this:
May 20 15:42:35 hostname sensord: Chip: coretemp-isa-0000
May 20 15:42:35 hostname sensord: Adapter: ISA adapter
May 20 15:42:35 hostname sensord: Core 0: 67.0 C
May 20 15:42:35 hostname sensord: Core 1: 66.0 C
May 20 15:42:35 hostname sensord: Core 2: 61.0 C
May 20 15:42:35 hostname sensord: Core 3: 58.0 C
May 20 16:04:16 hostname kernel: [ 5243.049529] CPU0: Core temperature above threshold, cpu clock throttled (total events = 1)
May 20 16:04:16 hostname kernel: [ 5243.050011] CPU0: Core temperature/speed normal
May 20 16:05:48 hostname kernel: [ 5335.083540] CPU2: Core temperature above threshold, cpu clock throttled (total events = 1)
May 20 16:05:48 hostname kernel: [ 5335.084028] CPU2: Core temperature/speed normal
May 21 16:06:52 hostname kernel: [ 5399.816039] mce: [Hardware Error]: Machine check events logged
At first i suspected a broken fan or something thermal, and activated sensord. But the temperatures seems stable over time.
Edit: I've install mcelog and the deamon is running. Pretty much waiting for it to happen again to see if the mcelog makes any sense.
Update
The mcelog indicates that it's a thermal issue, I have logs like the one below which match with the times of the Gitlab server backup cron job.
MCE 0
CPU 0 THERMAL EVENT TSC 16ec0aadec3a0
TIME 1401260314 Wed May 28 08:58:34 2014
Processor 0 heated above trip temperature. Throttling enabled.
Please check your system cooling. Performance will be impacted
STATUS 88020003 MCGSTATUS 0
MCGCAP 806 APICID 0 SOCKETID 0
CPUID Vendor Intel Family 6 Model 15
Hardware event. This is not a software error.
I've also done some testing today on stressing the system by stress -c 4 -i 1 -m 1 -t 120
and I very quickly reach 100 C on CPU temp.
coretemp-isa-0000
Adapter: ISA adapter
Core 0: +100.0°C (high = +84.0°C, crit = +100.0°C)
Core 1: +96.0°C (high = +84.0°C, crit = +100.0°C)
Core 2: +85.0°C (high = +84.0°C, crit = +100.0°C)
Core 3: +79.0°C (high = +84.0°C, crit = +100.0°C)
I suspect that the heatsink isn't properly mounted, and I will check this when I find the time to.
Solution
I'll check the heatpaste and sink of the cpu, as a quick fix.
I got hold of a used Dell PowerEdge R200 to replace this server, and I will try to get set it up next week. Thank you very much for the advice.