51

I have a dual Opteron server running Linux with libvirt to host several VMs. The VMs work fine and the server processes OK, but I notice one CPU always runs about 69C (throttles at 70C) and the other runs about 15C.

This doesn't seem normal to me? Shouldn't they both be a little closer in temperature?

I'm not sure how to dianose any further. Maybe there isn't enough thermal paste on one of the CPUs?

Edit: The motherboard is ASUS KGPE-D16 and cooled by dual Noctua NH-U9DO fans.

Note that I think the temperatures might be degress above ambient, rather than absolute values? When the server is idling, the CPU temperatures drop to 2C and 13C. I am using the lmsensors configuration from here

samoz
  • 1,943
  • 6
  • 18
  • 20

5 Answers5

107

The problem ended up being a poorly fit heatsink. Maybe poorly fit isn't the right description. Turns out, you have to put thermal paste on the heatsink, not the plastic cover that goes over the heatsink.

enter image description here

After removing the plastic cover, the CPU is nice and cool, thanks everyone!

samoz
  • 1,943
  • 6
  • 18
  • 20
  • 52
    +1 just because it's funny – HBruijn Nov 28 '14 at 00:25
  • 1
    I'm pretty sure the plastic cover is meant to be removed... ;-) – Davidw Nov 28 '14 at 01:50
  • Just as well you checked eh! – GreenAsJade Nov 28 '14 at 01:56
  • I hope you put some thermal paste on the heatsink after removing the plastic cover. :) – user Nov 28 '14 at 08:52
  • 10
    You mean someone left the plastic cover in place and then put paste on it and then put the heatsink on that? Epic. – TomTom Nov 28 '14 at 10:30
  • Glad to have steered you in the right direction! – MadHatter Nov 28 '14 at 10:32
  • 4
    Baaaaaahaaahaaahahahaa!! – Craig Tullis Nov 28 '14 at 22:09
  • 2
    That is definitely a large part of your problem! However, it still doesn't explain the sensors reading 2 deg C unless you encased the server in dry ice or something. – Grant Nov 29 '14 at 01:46
  • 3
    @Grant: He already addressed that. Read the _whole_ paragraph. His sensors are giving temp above either ambient or a calibrated "expected" value. Indeed, with a heatsink fitted in _this_ way, getting just 69ºC absolute on the hotter core would be something of a miracle. – Lightness Races in Orbit Nov 29 '14 at 19:46
  • 8
    I love how you can see the terms and conditions, limited warranty and returns policy in the background. :) – Lightness Races in Orbit Nov 29 '14 at 19:47
  • 6
    If it makes you feel any less stupid, (and it won't), I did a similar thing with my new office coffee-maker. The coffee was too cold to drink and I was packing it back up for return to the shop before a disk of protective cardboard dropped off the heating element:) – Martin James Nov 29 '14 at 20:16
  • 1
    @LightnessRacesInOrbit actually he says he *thinks* that. The config file he linked to sets cpu temp max to 70. That would meam 70 deg above ambient before it alerts you...which seems way too high. – Grant Nov 29 '14 at 23:16
  • 1
    You have no idea how many times I've had people return their iphones because they can barely hear people. It's always the plastic cover that goes over the front of the phone. – PsychoData Dec 01 '14 at 01:21
25

In my experience, it is normal for paired components in a case to run at different temperatures, because airflow is not the same everywhere. Here's a graph of HDD temperature from my colo box. The drives are mirrored, so the workloads on them are near to identical.

munin graph of HDD temps over past year

As you can see, they track each other, but they're not the same; they're also, on average, only 6C apart. Whether your sensors report absolute temperature or overtemperature, a difference of 55C under load seems very badly wrong. If you have confidence the data are right, then given the quiescent difference drops to 10C, which is the sort of difference I see due to airflow, I'd suspect a poorly-fitted heatsink.

MadHatter
  • 78,442
  • 20
  • 178
  • 229
  • 1
    Using mpstat (from Christopher Perrin, thanks!) I confirmed that the load is fairly evenly distributed. Things are idling right now at +3C and +20C. I'm going to try fiddling with the heatsink to see if it is loose. Do you think it could be a thermal paste issue? – samoz Nov 27 '14 at 13:02
  • That is very possible (and more so after you start wiggling it). – MadHatter Nov 27 '14 at 13:16
8

It is not. Unless you have some serious issues with the airflow. Or one of the coolers is bad. Temperature WILL vary - but not that much (70 vs. 15 degree celsius).

Given how low 15 degree is I would assume (a) your sensor is off (you really store the server in a that cool room?).

I would also assume one of the CPU does simply no work at all, for whatever reason.

Small differences are normal. Some little larger ones may be (airflow coming to my mind). but here we talk about one being COLD.

TomTom
  • 50,857
  • 7
  • 52
  • 134
2

This could be either cooling or uneven loading (given the temp difference your situation is probably uneven loading). You should use something like prime95 to load all the cores evenly and see if the temps still vary. If they don't then you need to balance the VMs, check that your apps are multithreaded and busy. How to do that depends on your software and individual workload so is beyond the scope of the question really. Bear in mind there is no real advantage to doing this if you don't have enough load to top out a single cpu/core, in fact your VM may deliberately avoid using a second cpu so that it can go into power saving modes on multi-cpu systems.

If you have narrowed it down to cooling. A small difference of upto 10C could be too little (or too much!) thermal paste. A bigger difference indicates a significant problem or difference between cpu coolers. It could be that one has blocked airflow, a heatsink has been knocked loose, etc.

JamesRyan
  • 8,138
  • 2
  • 24
  • 36
0

I would have to concur with, defective temp. sensor, as 15C is only 59F!!! Unless the computer's in an extremely frigid datacenter, I would imagine the ambient air temperature would be higher than 59F! You try to assign the VM's to the low temperature core and see if there is any change; if not, I would highly suspect the sensor as being faulty.

You may also want to look at the output of dmesg (boot messages) and see if there is anything out of the ordinary there.

binki
  • 161
  • 10
J. Simons
  • 1
  • 1