2

I am able to launch a job on a GPU server the traditional way (using CPU and MEM as consumables):

~ srun -c 1 --mem 1M -w serverGpu1 hostname
serverGpu1

but trying to use the GPUs will give an error:

~ srun -c 1 --mem 1M --gres=gpu:1 hostname
srun: error: Unable to allocate resources: Invalid generic resource (gres) specification

I checked this question but it doesn't help in my case.

Slurm.conf

On all nodes

SlurmctldHost=vinz
SlurmctldHost=shiny
GresTypes=gpu
MpiDefault=none
ProctrackType=proctrack/cgroup
ReturnToService=1
SlurmctldPidFile=/media/Slurm/slurmctld.pid
SlurmctldPort=6817
SlurmdPidFile=/var/run/slurmd.pid
SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
StateSaveLocation=/media/Slurm
SwitchType=switch/none
TaskPlugin=task/cgroup

InactiveLimit=0
KillWait=30
MinJobAge=300
SlurmctldTimeout=120
SlurmdTimeout=300
Waittime=0
DefMemPerCPU=1
SchedulerType=sched/backfill
SelectType=select/cons_tres
SelectTypeParameters=CR_CPU_Memory
AccountingStorageType=accounting_storage/none
AccountingStoreJobComment=YES
ClusterName=cluster
JobCompLoc=/media/Slurm/job_completion.txt
JobCompType=jobcomp/filetxt
JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/cgroup
SlurmctldDebug=info
SlurmctldLogFile=/media/Slurm/slurmctld.log
SlurmdDebug=info
SlurmdLogFile=/var/log/slurm-llnl/slurmd.log
MaxArraySize=10001
NodeName=docker1 CPUs=144 Boards=1 RealMemory=300000 Sockets=4 CoresPerSocket=18 ThreadsPerCore=2 Weight=100 State=UNKNOWN
NodeName=serverGpu1 CPUs=96 RealMemory=550000 Boards=1 SocketsPerBoard=2 CoresPerSocket=24 Gres=gpu:nvidia_tesla_t4:4 ThreadsPerCore=2 Weight=500 State=UNKNOWN

PartitionName=Cluster Nodes=docker1,serverGpu1 Default=YES MaxTime=INFINITE State=UP

cgroup.conf

On all nodes

CgroupAutomount=yes 
CgroupReleaseAgentDir="/etc/slurm-llnl/cgroup" 

ConstrainCores=yes 
ConstrainDevices=yes
ConstrainRAMSpace=yes

gres.conf

Only on GPU servers

AutoDetect=nvml

As for the log of the GPU server:

[2021-12-06T12:22:52.800] gpu/nvml: _get_system_gpu_list_nvml: 4 GPU system device(s) detected
[2021-12-06T12:22:52.801] CPU frequency setting not configured for this node
[2021-12-06T12:22:52.803] slurmd version 20.11.2 started
[2021-12-06T12:22:52.803] killing old slurmd[42176]
[2021-12-06T12:22:52.805] slurmd started on Mon, 06 Dec 2021 12:22:52 +0100
[2021-12-06T12:22:52.805] Slurmd shutdown completing
[2021-12-06T12:22:52.805] CPUs=96 Boards=1 Sockets=2 Cores=24 Threads=2 Memory=772654 TmpDisk=1798171 Uptime=8097222 CPUSpecList=(null) FeaturesAvail=(null) FeaturesActive=(null)

I would like some guidance on how to resolve this issue, please.

Edits: As requested by @Gerald Schneider

~ sinfo -N -o "%N %G"
NODELIST GRES
docker1 (null)
serverGpu1 (null)
user324810
  • 121
  • 3
  • can you please add the output of `sinfo -N -o "%N %G"`? – Gerald Schneider Dec 06 '21 at 14:56
  • @GeraldSchneider done! – user324810 Dec 06 '21 at 14:58
  • Try adding the GPUs to gres.conf on the node directly, instead of setting it to AutoDetect. I get the correct GPU definitions in the %G column with sinfo on my nodes. – Gerald Schneider Dec 06 '21 at 15:00
  • I removed the `AutoDetect=nvml` and I set in the `gres.conf` the following line: `Name=gpu File=/dev/nvidia[0-3]` and in the slurm.conf, I changed the NodeName of the GPU by modifying to `Gres=gpu`. In the log, I got `[2021-12-06T16:05:47.604] WARNING: A line in gres.conf for GRES gpu has 3 more configured than expected in slurm.conf. Ignoring extra GRES.` – user324810 Dec 06 '21 at 15:06
  • My config looks very similar to yours. The only difference I see is that I have AccountingStorage enabled and have set `AccountingStorageTRES=gres/gpu,gres/gpu:tesla`, but I don't think that should be necessary. I also have a `Type=` set in gres.conf, you could try setting it to `nvidia_tesla_t4` so it matches your definition in slurm.conf. – Gerald Schneider Dec 07 '21 at 09:05
  • 1
    Are the slurm.conf files identical on your nodes? Try setting `DebugFlags=gres` and see if something helpful shows up in the logs. – Gerald Schneider Dec 07 '21 at 09:05

0 Answers0