0

I've been having problems with the BMC/IPMI Event Log registering over-temperature errors (in some cases critical) for the CPUs. I am concerned that these are mainly false positives and that the default sensor thresholds set on the BMC are wrong.

Hardware: RS924A-E6/RS8 with 4x AMD 6376 CPUs - the AMD CPUs provide a Temperature Control Margin (Tctl Margin) instead of a raw temperature reading. My understanding of Tctl Margin is that it is a reverse scale from 0..255 whereby 0 represents the maximum operating temperature of the CPU (69 Celsius in this case). In essence, the closer we get to 0, the hotter the CPU physically is - more info here.

Data: The two tables below provide information on the thresholds and the registered events.

Sensor Thresholds:

ID | Name             | Type         | Reading    | Units       | Lower NR   | Lower C    | Lower NC   | Upper NC   | Upper C    | Upper NR   | Event
1  | CPU1 Tctl Margin | Temperature  | 26.00      | unspecified | -10.00     | -5.00      | 0.00       | 127.00     | 127.00     | 127.00     | 'OK'
2  | CPU2 Tctl Margin | Temperature  | 26.00      | unspecified | -10.00     | -5.00      | 0.00       | 127.00     | 127.00     | 127.00     | 'OK'

Event Log:

ID | Date        | Time     | Name             | Type         | Event
1  | Mar-28-2017 | 17:25:45 | CPU1 Tctl Margin | Temperature  | Upper Non-recoverable - going low ; Sensor Reading = 31.00 unspecified ; Threshold = 127.00 unspecified
2  | Apr-09-2017 | 10:12:38 | CPU1 Tctl Margin | Temperature  | Upper Non-recoverable - going low ; Sensor Reading = 24.00 unspecified ; Threshold = 127.00 unspecified

As you can see in the table above, CPU1 typically suffers a Upper Non-recoverable temperature error. Where I am confused is that this error occurs at a sensor reading of 24 (or 31) but the threshold is 127. Is it the case that the BMC is misinterpreting the sensor reading, or that the thresholds are wrong? What can I do to fix this?

Hans
  • 103
  • 4

1 Answers1

0

I believe you may be misinterpreting the text. The "going low", indicates that the temperature was above 127 but is now below it. Which appears to be correct given the thresholds you list above.

I'm assuming there are no "going high" events as well. It's possible the motherboard simply does not report those events, as they are what should be "normal".

Albert Chu
  • 646
  • 3
  • 5
  • Yup, that's it - for a week now no more events have been logged (the servers have been under heavy load in that time)! – Hans May 24 '17 at 20:02