2

I am running Centos 7 (3.10.0-514.26.2.el7.x86_64) on a supermicro H8QG6 board with 4 AMD 6276 cpus (16 cores), for a total of 64 cores. I use it for scientific computing, and usually everything runs smoothly, as in the htop first image. Then, suddendly, htop starts reporting single cpu usage above 1000%, as in this second image, and the computer becomes nearly unresponsive. Incidentally, nearly every process seems to overload the cpus: even htop itself reports a load of 1600%!!! At the same time these errors appear in journalctl: perf: interrupt took too long (3973 > 3883), lowering kernel.perf_event_max_sample_rate to 50000 kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 52, got netfn 5 cmd 2d kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 52, got netfn 5 cmd 2 kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 37, got netfn 7 cmd 52 kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 52, got netfn 7 cmd 37 but I cannot determine if they are a symptom or the cause of the strange behavior. Does anyone have any idea about what's going on? Thanks a lot!

Update: I installed ipmitool and launched ipmievd daemon. As soon as the problem starts, the following appears in journalctl:

ipmievd[7567]: Memory sensor - Correctable ECC (@DIMMD1(CPU3))

Is it possible that the problem is related to a faulty memory module?

ehyG
  • 51
  • 5

1 Answers1

3

The culprit was indeed a faulty ram module. The anomalous spike in cpu usage appeared as the ram error started to pop up in the logs. the module has been identified and replaced, and the problem is now solved.

ehyG
  • 51
  • 5