SLURM allows jobs using more CPUs than requested to start

The problem I am facing with SLURM can be summarized as follows. Consider a bash script test.sh that requests 8 CPUs but actually starts a job using 10 CPUs:

#!/bin/sh
#SBATCH --ntasks=8
stress -c 10

On a server with 32 CPUs, if I start 5 times this script with sbatch test.sh, 4 of them start running right away and the last one appears as pending, as shown by the squeue command:

JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
    5      main  test.sh     jack PD       0:00      1 (Resources)
    1      main  test.sh     jack  R       0:08      1 server
    2      main  test.sh     jack  R       0:08      1 server
    3      main  test.sh     jack  R       0:05      1 server
    4      main  test.sh     jack  R       0:05      1 server

The problem is that these 4 jobs are actually using 40 CPUs and overload the system. I would on the contrary expect SLURM to either not start the jobs that are actually using more resources than requested by the user, or to put them on hold until there are enough resources to start them.

Some useful details about my slurm.conf file:

# SCHEDULING                                                                       
#DefMemPerCPU=0                                                                    
FastSchedule=1                                                                     
#MaxMemPerCPU=0                                                                    
SchedulerType=sched/backfill                                                       
SchedulerPort=7321                                                                 
SelectType=select/cons_res                                                         
SelectTypeParameters=CR_CPU
# COMPUTE NODES                                                                 
NodeName=server CPUs=32 RealMemory=10000 State=UNKNOWN                   
# PARTITIONS                                                                    
PartitionName=main Nodes=server Default=YES Shared=YES MaxTime=INFINITE State=UP

I am just starting with SLURM and I am puzzled by this behavior. How can I make sure that the users of my server do not start jobs that use too many CPUs? I read the manual and spent a lot of time looking for information on forums, but unfortunately I did not find anything helpful.

Many thanks in advance for your help!

remek

Posted 2015-06-19T17:15:22.840

Reputation: 111

Answers

Slurm can not know how many processes/threads a script is going to create. It can only rely on the resources requested and hence that is what it uses to schedule jobs.

The best approach here will be to use any of the affinity plugins in Slurm to prevent jobs using more resources than requested. This plugins bind a job to the requested cpus. (Affinity documentation)

Obviously you can not control how many processes/threads a user starts in its script, but limiting the amount of cores a job can use you will reduce the impact that an uncontrolled user may cause over other users jobs.

This will not prevent your system to appear to be overloaded, but the "bad" users will only affect themselves.

Carles Fenoy

Posted 2015-06-19T17:15:22.840

Reputation: 151

Following our discussion over at SO I've been trying to use the --exclusive argument to achieve this. My architecture is different to yours (I have 7 processors available to slurm) but here is what I did:

#!/bin/sh
#SBATCH --ntasks=2    
srun -n 2 --exclusive stress -c 1

and then running

sbatch test.sh ; sbatch test.sh ; sbatch test.sh ; sbatch test.sh

gives me 6 stress processes:

15050 tom       20   0    7308    212    108 R 100.0  0.0   1:47.46 stress                                                                                                              
15054 tom       20   0    7308    208    108 R 100.0  0.0   1:47.47 stress                                                                                                              
15063 tom       20   0    7308    208    108 R 100.0  0.0   1:47.47 stress                                                                                                              
15064 tom       20   0    7308    212    108 R 100.0  0.0   1:47.47 stress                                                                                                              
15080 tom       20   0    7308    208    108 R 100.0  0.0   1:47.46 stress                                                                                                            
15076 tom       20   0    7308    212    108 R  99.7  0.0   1:47.45 stress

with the last one waiting in the queue:

     JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
      2368       Tom  test.sh      tom PD       0:00      1 (Resources)
      2365       Tom  test.sh      tom  R       5:03      1 Tom
      2366       Tom  test.sh      tom  R       5:03      1 Tom
      2367       Tom  test.sh      tom  R       5:03      1 Tom

So in this case using srun -n 2 causes the same process to be launced twice. The same thing happens if I use

#!/bin/sh
#SBATCH --ntasks=2
srun -n 1 --exclusive stress -c 1 &
srun -n 1 --exclusive stress -c 1 &
srun -n 1 --exclusive stress -c 1 &
wait

i.e. SLURM knows this batch script has two tasks so it will let two run simultaneously; the third has to 'wait its turn'.

On the other hand

#!/bin/sh
#SBATCH --ntasks=1
srun -n 1 --exclusive stress -c 2

gives me the behaviour you describe in your question.

Not sure if this answers 100% but maybe it helps a little.

Tom

Posted 2015-06-19T17:15:22.840

Reputation: 101