I am running Centos 7 (3.10.0-514.26.2.el7.x86_64) on a supermicro H8QG6 board with 4 AMD 6276 cpus (16 cores), for a total of 64 cores. I use it for scientific computing, and usually everything runs smoothly, as in the htop first image.
Then, suddendly, htop starts reporting single cpu usage above 1000%, as in this second image, and the computer becomes nearly unresponsive. Incidentally, nearly every process seems to overload the cpus: even htop itself reports a load of 1600%!!! At the same time these errors appear in journalctl:
perf: interrupt took too long (3973 > 3883), lowering kernel.perf_event_max_sample_rate to 50000
kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 52, got netfn 5 cmd 2d
kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 52, got netfn 5 cmd 2
kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 37, got netfn 7 cmd 52
kernel: IPMI message handler: BMC returned incorrect response, expected netfn 7 cmd 52, got netfn 7 cmd 37
but I cannot determine if they are a symptom or the cause of the strange behavior. Does anyone have any idea about what's going on? Thanks a lot!
Update: I installed ipmitool and launched ipmievd daemon. As soon as the problem starts, the following appears in journalctl:
ipmievd[7567]: Memory sensor - Correctable ECC (@DIMMD1(CPU3))
Is it possible that the problem is related to a faulty memory module?