3

I have 2 HPE Proliant DL360 Gen10 servers that are configured nearly the same. They both run CentOS 7.5. The only differences are that one has newer firmware and kernel, in an attempt to fix this problem.

dmesg is reporting the following repeatedly and the performance of the server is suffering.

[Oct12 11:43] CPU5: Package temperature above threshold, cpu clock throttled (total events = 539077151)
[  +0.000001] CPU1: Package temperature above threshold, cpu clock throttled (total events = 539077144)
[  +0.000003] CPU4: Package temperature above threshold, cpu clock throttled (total events = 539077179)
[  +0.000002] CPU7: Package temperature above threshold, cpu clock throttled (total events = 539077201)
[  +0.000001] CPU3: Package temperature above threshold, cpu clock throttled (total events = 539077211)
[  +0.000004] CPU6: Package temperature above threshold, cpu clock throttled (total events = 539077197)
[  +0.000001] CPU2: Package temperature above threshold, cpu clock throttled (total events = 539077208)
[  +0.000001] CPU0: Package temperature above threshold, cpu clock throttled (total events = 539077122)
[Oct12 11:44] CPU6: Core temperature above threshold, cpu clock throttled (total events = 447115263)
[  +0.000001] CPU2: Core temperature above threshold, cpu clock throttled (total events = 447115267)
[  +0.002025] CPU6: Core temperature/speed normal

The HP iLO is reporting ~30C less than sensors is reporting.

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:        +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:        +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:        +95.0°C  (high = +86.0°C, crit = +96.0°C)
Core 4:        +94.0°C  (high = +86.0°C, crit = +96.0°C)

The HPE iLO interface reports the CPU is 55C at the same time the sensors reading is taken.

When I run sensors, I get the following in dmesg:

[Oct12 11:46] ACPI Error: SMBus/IPMI/GenericSerialBus write requires Buffer of length 66, found length 32 (20180313/exfield-393)
[  +0.000726] ACPI Error: Method parse/execution failed \_SB.PMI0._PMM, AE_AML_BUFFER_LIMIT (20180313/psparse-516)
[  +0.000500] ACPI Error: AE_AML_BUFFER_LIMIT, Evaluating _PMM (20180313/power_meter-338)

I updated to the latest kernel (4.18.13-1.el7.elrepo.x86_64) this morning and that didn't help either.

Kerry Knopp
  • 43
  • 1
  • 4
  • 2
    Check iLO or physically inspect the server for failing/failed fans. – Michael Hampton Oct 12 '18 at 18:34
  • I would add to Michael's comment to inspect it, that if a fan is faulty or the cooling, the system will make a lot of noise as other fan will try to balance for the faulty's one. – yagmoth555 Oct 12 '18 at 18:42

3 Answers3

2

Open the system's IML log from the ILO web interface and see what events it's reporting.

That is the authoritative way to check hardware status on HPE server equipment.

ewwhite
  • 194,921
  • 91
  • 434
  • 799
0

I was able to mostly resolve this by updating the kernel in the OS. I'm now on 4.18.13-1.el7.elrepo.x86_64 and the temperature is reported differently than in the iLO UI, but the ratio between CPU temp and "high" is much better and lines up better with the iLO ratios.

coretemp-isa-0000
Adapter: ISA adapter
Package id 0:  +74.0°C  (high = +86.0°C, crit = +96.0°C)
Core 0:        +72.0°C  (high = +86.0°C, crit = +96.0°C)
Core 2:        +72.0°C  (high = +86.0°C, crit = +96.0°C)
Core 3:        +74.0°C  (high = +86.0°C, crit = +96.0°C)
Core 4:        +71.0°C  (high = +86.0°C, crit = +96.0°C)
Kerry Knopp
  • 43
  • 1
  • 4
0

Intel's thermal monitoring can lead to a number of different "temperatures" depending on what interface / MSR you use. Also, different processors can have different thresholds based of fabrication.

May also want to fool around with some of the thermal tweaking in UEFI. There are "Max Cooling" options that may get you from reaching threshold.

Finally, take note of the option cards you use and see if that has any impact. IO cards may trip up thermal monitoring making FW / OS SW think there the system is in thermal distress.

Dan
  • 211
  • 1
  • 3