0

My supercomputing center recently moved from SGE to pbs/Torque. Now, when I schedule my array jobs, only half of the jobs in the array get scheduled. When they finish, the other half get scheduled. This happens despite the fact that they are largely under utilized.

For example, I just scheduled an array with 10 jobs. Here is the qstat output 10 minutes later:

[myuserna@sub ~]$ qstat -t
Job id                    Name             User            Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
3100[1].systemm2           ...-to-work.sh-1 myuserna        00:07:40 R short          
3100[2].systemm2           ...-to-work.sh-2 myuserna        00:07:32 R short          
3100[3].systemm2           ...-to-work.sh-3 myuserna        00:09:55 R short          
3100[4].systemm2           ...-to-work.sh-4 myuserna        00:09:44 R short          
3100[5].systemm2           ...-to-work.sh-5 myuserna        00:09:07 R short          
3100[6].systemm2           ...-to-work.sh-6 myuserna               0 Q short          
3100[7].systemm2           ...-to-work.sh-7 myuserna               0 Q short          
3100[8].systemm2           ...-to-work.sh-8 myuserna               0 Q short          
3100[9].systemm2           ...-to-work.sh-9 myuserna               0 Q short          
3100[10].systemm2          ...to-work.sh-10 myuserna               0 Q short          
[myuserna@sub ~]$ 

Any clues how to fix the scheduler?

Here is the relevant portion of the scheduler config:

create queue short
set queue short queue_type = Execution
set queue short Priority = 10000
set queue short max_user_queuable = 500
set queue short max_running = 200
set queue short resources_max.walltime = 24:00:00
set queue short resources_default.nodes = 1
set queue short max_user_run = 50
set queue short enabled = True
set queue short started = True
#

#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = systemm2
set server acl_roots = root@*
set server managers = root@systemm2.local
set server operators = root@systemm2.local
set server default_queue = route
set server log_events = 511
set server mail_from = adm
set server resources_default.walltime = 01:00:00
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server mom_job_sync = True
set server keep_completed = 300
set server submit_hosts = submit-1
set server submit_hosts += submit-0
set server auto_node_np = True
set server next_job_number = 6217
set server max_job_array_size = 512
set server max_slot_limit = 5
mattdm
  • 6,550
  • 1
  • 25
  • 48
vy32
  • 2,018
  • 1
  • 15
  • 20
  • It is hard to tell with little information. How many nodes and processors are available to the scheduler? What is your scheduler config like? – ryanlim Dec 01 '10 at 15:34
  • We have 1100 nodes. Right now about 80% of them are idle. – vy32 Dec 01 '10 at 15:36
  • Could you run: qmgr -c 'p n node01' ... where node01 is any arbitrary node in your cluster. What does "np" show? – ryanlim Dec 01 '10 at 15:40

1 Answers1

3

Check with your administrator. It is possible to limit the number of slots in use per user per queue.

Update: okay, now you've updated the question to show

set server max_slot_limit = 5

which I'm pretty sure answers the question.

mattdm
  • 6,550
  • 1
  • 25
  • 48