1

I have a problem running nvidia-docker containers on a slurm cluster. When inside the container all gpus are visible so basically it ignores the CUDA_VISIBLE_DEVICES set env by slurm. Outside the container the visible gpus are correct.

Is there a way to restrict the container e.g. with -e NVIDIA_VISIBLE_DEVICES ? Or is there are way to set NVIDIA_VISIBLE_DEVICES to CUDA_VISIBLE_DEVICES ?

Andrew Schulman
  • 8,561
  • 21
  • 31
  • 47

1 Answers1

0

This problem happened to me, and the solution was to install rootless docker in the compute nodes. I think this is because docker daemon is executed before Slurm processes are executed, so you loose the abstraction layer of Slurm.

To install rootless docker, you can do it using a method similar to Deepops installation process, using a playbook. You can follow the following guide.

I hope this solves your problem.