1

We have a new Supermicro Server AS-4124GS-TNR equipped with eight NVIDIA RTX A6000. The OS is Ubuntu 20.04.2, the NVIDIA driver version is 460.73.01 (no Nouveau driver used), the CUDA Version is 11.2.

We ran a few long-lasting tests on the GPUs and the system was stable. However, after some GPU idling the system crashed repeatedly.

We assume that GpuPowerMizerMode has to be set to 1 to prevent crashes during GPU idling (an assumption backed by other user reports found on the internet).

The only way to do this that we know of is to start X (e.g. by starting gdm) and then set the value accordingly via nvidia-settings (running nvidia-settings without X/gdm leads to "Unable to init server: Could not connect: Connection refused."). But when stopping X/gdm, the GpuPowerMizerMode value is automatically reset to 2. Unfortunately, keeping X/gdm running is not an option because this also leads to system instability.

So, our problem seems to be as follows:

  1. GPU idling + GpuPowerMizerMode != 1 can result in a system freeze. GpuPowerMizerMode can only be set via nvidia-settings connected to a running X/dm(?). In order to persistently set the value to 1 X/dm(?) has to keep running.
  2. A running X/gdm can cause a system crash.

Are our assumptions correct? / Are others also experiencing these specific problems?

How can we solve the problem of freezing during GPU idling?

user776206
  • 13
  • 4

1 Answers1

1

It should not be necessary to start a GUI session (or even have one installed!) to change settings such as this; nvidia-settings should work fine from the framebuffer console or even in a script you write that runs at startup.

Check to be sure:

# nvidia-settings -q GpuPowerMizerMode

  Attribute 'GPUPowerMizerMode' (blacktemple:1[gpu:0]): 1.
    Valid values for 'GPUPowerMizerMode' are: 0, 1 and 2.
    'GPUPowerMizerMode' can use the following target types: GPU.

For eight GPUs just write a simple script, something like:

for n in $(seq 0 7); do
    nvidia-settings -a "[gpu:$n]/GpuPowerMizerMode=1"
done

and run it at startup in whatever manner you find convenient.


I can't say whether your crashes are due to running with GpuPowerMizerMode!=1. If that is the case, then you probably have some sort of defective hardware that you should track down and replace.

Michael Hampton
  • 237,123
  • 42
  • 477
  • 940