Identify event that stalled server with GPU-applications temporarily

1

I'm running 4 intensive applications (training of machine learning models on GPUs) that regularly prints information about how fast they're running on a Linux 14.04 machine. Strangely enough, the server slowed down for a period of roughly 2.5 hours with these applications running 3x slower than normal. There were no changes to the server or the applications before/during/after this happened that I know of. I've experienced something similar before on the same server, but I didn't investigate it further.

Running htop and iotop during the stall revealed no hints as the CPU usage was low with 6/12 cores being almost completely unused, memory usage was low (16/64 gb used) and there was little I/O activity. Each of the server's 4 GPU's is has 95% of the memory allocated to a single instance of these intensive applications. This doesn't change while the applications run. The applications perform identical operations over and over again (matrix multiplications) so the slowdown should not be related to any activity caused by the applications.

How can I identify what was causing this stall of my applications?

pir

Posted 2017-05-10T03:10:45.447

Reputation: 221

Have you thought of thermal throttetling? 4 GPUs in a server running full throttle need quit a thermal solution and quite low ambiente temperatures. All GPUs I know of throttle when they become hot - a factor of 3 would sound quite realistic. – Eugen Rieck – 2017-05-10T03:17:47.553

That sounds like it! Thanks! I can see that the GPU temperatures are now around 80 degrees Celsius (right after they stopped stalling). I'll try to do some logging, but that's most likely the cause. – pir – 2017-05-10T03:21:32.710

If you happen to know of simple ways to log the temperature of the GPU's I'd love to hear about it. I can't seem to find any simple solutions. – pir – 2017-05-10T03:26:28.937

LMsensors can read at least the Nvidia temps. – Eugen Rieck – 2017-05-10T03:37:22.180

Please let us know, if this solved it - I would then create an answer, so that others can find it when in a similar situation. – Eugen Rieck – 2017-05-10T03:38:30.403

I'll need to setup the sensors and experience another stall before I can confirm anything, but feel free to add an answer now. – pir – 2017-05-10T17:45:55.013

No answers