1
2
The problem I am facing with SLURM can be summarized as follows. Consider a bash script test.sh
that requests 8 CPUs but actually starts a job using 10 CPUs:
#!/bin/sh
#SBATCH --ntasks=8
stress -c 10
On a server with 32 CPUs, if I start 5 times this script with sbatch test.sh
, 4 of them start running right away and the last one appears as pending, as shown by the squeue
command:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
5 main test.sh jack PD 0:00 1 (Resources)
1 main test.sh jack R 0:08 1 server
2 main test.sh jack R 0:08 1 server
3 main test.sh jack R 0:05 1 server
4 main test.sh jack R 0:05 1 server
The problem is that these 4 jobs are actually using 40 CPUs and overload the system. I would on the contrary expect SLURM to either not start the jobs that are actually using more resources than requested by the user, or to put them on hold until there are enough resources to start them.
Some useful details about my slurm.conf
file:
# SCHEDULING
#DefMemPerCPU=0
FastSchedule=1
#MaxMemPerCPU=0
SchedulerType=sched/backfill
SchedulerPort=7321
SelectType=select/cons_res
SelectTypeParameters=CR_CPU
# COMPUTE NODES
NodeName=server CPUs=32 RealMemory=10000 State=UNKNOWN
# PARTITIONS
PartitionName=main Nodes=server Default=YES Shared=YES MaxTime=INFINITE State=UP
I am just starting with SLURM and I am puzzled by this behavior. How can I make sure that the users of my server do not start jobs that use too many CPUs? I read the manual and spent a lot of time looking for information on forums, but unfortunately I did not find anything helpful.
Many thanks in advance for your help!