I've been having problems with the BMC/IPMI Event Log registering over-temperature errors (in some cases critical) for the CPUs. I am concerned that these are mainly false positives and that the default sensor thresholds set on the BMC are wrong.
Hardware: RS924A-E6/RS8 with 4x AMD 6376 CPUs - the AMD CPUs provide a Temperature Control Margin (Tctl Margin) instead of a raw temperature reading. My understanding of Tctl Margin is that it is a reverse scale from 0..255
whereby 0
represents the maximum operating temperature of the CPU (69 Celsius
in this case). In essence, the closer we get to 0
, the hotter the CPU physically is - more info here.
Data: The two tables below provide information on the thresholds and the registered events.
Sensor Thresholds:
ID | Name | Type | Reading | Units | Lower NR | Lower C | Lower NC | Upper NC | Upper C | Upper NR | Event
1 | CPU1 Tctl Margin | Temperature | 26.00 | unspecified | -10.00 | -5.00 | 0.00 | 127.00 | 127.00 | 127.00 | 'OK'
2 | CPU2 Tctl Margin | Temperature | 26.00 | unspecified | -10.00 | -5.00 | 0.00 | 127.00 | 127.00 | 127.00 | 'OK'
Event Log:
ID | Date | Time | Name | Type | Event
1 | Mar-28-2017 | 17:25:45 | CPU1 Tctl Margin | Temperature | Upper Non-recoverable - going low ; Sensor Reading = 31.00 unspecified ; Threshold = 127.00 unspecified
2 | Apr-09-2017 | 10:12:38 | CPU1 Tctl Margin | Temperature | Upper Non-recoverable - going low ; Sensor Reading = 24.00 unspecified ; Threshold = 127.00 unspecified
As you can see in the table above, CPU1
typically suffers a Upper Non-recoverable temperature error. Where I am confused is that this error occurs at a sensor reading of 24
(or 31
) but the threshold is 127
. Is it the case that the BMC is misinterpreting the sensor reading, or that the thresholds are wrong? What can I do to fix this?