PBS/Torque priority vs MPI program priority

Question

We have a Cluster performing different tasks. It is computing simulations using the Torque scheduler. We also have an interactive simulation, which also needs the full computation power. The interactive simulation is an OpenMPI program, starting processes on each node.

So we want to have the following: If the interactive simulation is started, all of the pbs jobs should be shifted to the background, releasing the workload for the interactive simulation.

Is this even possible with these two different parallelization schemes?

I tried the following: I assigned lower priority to the users of the torque queue by appending in /etc/security/limits.conf a line

user    hard    priority    10

for each user on each node. But this is ignored by the scheduler, the pbs jobs still get a niceness value of 0.

The cluster is running with CentOS.

Does the priority option qsub -p affect the priority of the corresponding jobs, or is it only for the scheduler?

I hope someone here has experience with the correct configuration of the queuing system.

score 0 · Answer 1 · answered Jul 02 '18 at 15:38

The -p option in qsub does only affect the priority of the job in the queue. If you want to start the processes with a lower priority or in Unix/Linux a higher nice value, you can use qsub -l nice=19 to start the processes. Depending on the used MPI implementation and application this still might not work, but should work in general.
Well, this method will only tell the OS to assign less CPU time / scheduler slots to that application, but it is very likely that this will hog your interactive job running at niceness 0. Not to forget the amount of memory which might not satisfy the needs of the concurrent jobs.

Normally you want to suspend a job to free up resources in a cluster which brings its own problems. One approach could be to stop the processes with SIGSTOP and restart them with SIGCONT again. I did not work with Torque for a couple of years, but I guess there should be a way to specify your own script when a job is suspended. Look also for preemption on your further research.
Unfortunately this approach might also not work correctly with MPI and depends on the MPI implementation. So you also have to read about the possibilities of the MPI version you are using. Additionally, this also does not free up your memory resources and the node could start swapping or the OOM kills your jobs.

Normally you want to checkpoint your jobs, which makes it possible to really end the processes and free up the resources. This checkpointing is maybe best supported within the application writing out interim results from which you can restart. There are also checkpointing approaches that do not rely on the application, but as far as I know these also do not work reliably. Just use a search engine of your choice and look for mpi checkpoint.
One could also consider to cancel the MPI job and requeue it to start over at a later time. Depending on the jobs and the user acceptance.

You also could leave up one or two nodes for interactive work during your work hours, if you can afford this or if the need for free resources for interactive jobs is that high.

Another point to consider might be changing the scheduler to maui as the default torque scheduler is very limited in configuring advanced scheduling.

Thanks for the detailed insights. Lots of routes to try, i'll report on our progress. Fortunately, the memory resources are not an issue yet and both simulations can fit in, but i'll keep that in mind. The interactive MPI program needs at least half of the cluster and is only used rarely, so it would be a waste to reserve nodes all the time. — stephanp, Jul 03 '18 at 09:20

PBS/Torque priority vs MPI program priority

1 Answers1