1

I am configuring a Slurm scheduler, and I need limit the number of maximum jobs running concurrently on a partition (queue).

I am reading a lot of information about accounting and resources, but is all related to user limits, also I read about create associations, but I am not sure if it is necessary.

I need to limit the number of jobs per partition (queue), since I have compute nodes which belongs to the same partition.

I.e. I have 2 partitions, short and long, with the same compute nodes, but with different time limits and priority. If all the user launch long jobs using the long partition, they can block the cluster. So, I want to limit the number of maximum jobs running for the long partition.

Thanks in advance.

José Manuel
  • 11
  • 1
  • 3

3 Answers3

1

Now I have seen your edit, this should actually be accomplished via priority and node sharing, not job limiting.
See both multifactor priority and preemption if you don't implement accounting.

Preemption is simpler to configure by a large margin, with PreemptType=preempt/partition_prio and set higher priority to the short jobs queue.
you will have to set PreemptMode=SUSPEND,GANG in slurm.conf, and Shared=FORCE on the default queue/each queue configured for priority.

it works quite well, but can result in starvation of the long duration jobs.

Multifactor is fairer, but you will have to experiment to see what works for you. You will probably want to set PriorityWeightPartition as there is no direct factor related to job wall time.

Otherwise - install accounting, and simply charge more for long duration jobs.

Dani_l
  • 498
  • 2
  • 8
0

Since I can't comment yet I'm posting as an answer.
Can you share your reasoning? slurm works great as a resource manager - it will not allow more resources to be used than available, unless you allow oversubscription. why would you want to impose an artificial limit on top of that?

Anyway, if you are using backfill, you might get away with a simple bf_max_job_part=# or the more general partition_job_depth=#

Read about those options in man slurm.conf

Dani_l
  • 498
  • 2
  • 8
0

The best way to accomplish this is using QoS. For each QoS you can set up different limits for the amount of CPUs or the maximum walltime for that QoS, etc. QoS are more flexible than partition in terms of limits.

So my recomendation is that you use only one partition with 2 QoS and set the limits at the QoS level.

Carles Fenoy
  • 155
  • 3