slurm nvidia-docker ignores CUDA_VISIBLE_DEVICES

Question

I have a problem running nvidia-docker containers on a slurm cluster. When inside the container all gpus are visible so basically it ignores the CUDA_VISIBLE_DEVICES set env by slurm. Outside the container the visible gpus are correct.

Is there a way to restrict the container e.g. with -e NVIDIA_VISIBLE_DEVICES ? Or is there are way to set NVIDIA_VISIBLE_DEVICES to CUDA_VISIBLE_DEVICES ?

score 0 · Accepted Answer · answered Sep 23 '21 at 14:00

This problem happened to me, and the solution was to install rootless docker in the compute nodes. I think this is because docker daemon is executed before Slurm processes are executed, so you loose the abstraction layer of Slurm.

To install rootless docker, you can do it using a method similar to Deepops installation process, using a playbook. You can follow the following guide.

I hope this solves your problem.

slurm nvidia-docker ignores CUDA_VISIBLE_DEVICES

1 Answers1