3

Jobs I add to the queue stays there in "Queued" state without attempts to be executed (unless I manually qrun them)

/var/spool/torque/server_logs say just

04/11/2011 12:43:27;0100;PBS_Server;Job;16.localhost;enqueuing into batch, state 1 hop 1
04/11/2011 12:43:27;0008;PBS_Server;Job;16.localhost;Job Queued at request of test@localhost, owner = test@localhost, job name = Qqq, queue = batch

The job requires just 1 CPU on 1 node.

# qmgr -c "list queue batch"
Queue batch
    queue_type = Execution
    total_jobs = 0
    state_count = Transit:0 Queued:0 Held:0 Waiting:0 Running:0 Exiting:0 
    max_running = 3
    acl_host_enable = True
    acl_hosts = localhost
    resources_min.ncpus = 1
    resources_min.nodect = 1
    resources_default.ncpus = 1
    resources_default.nodes = 1
    resources_default.walltime = 00:00:10
    mtime = Mon Apr 11 12:07:10 2011
    resources_assigned.ncpus = 0
    resources_assigned.nodect = 0
    kill_delay = 3
    enabled = True
    started = True

I can't set resources_assigned to nonzero because of Cannot set attribute, read only or insufficient permission resources_assigned.ncpus.

When I qrun some task, this goes to mom's log:

04/11/2011 21:27:48;0001;   pbs_mom;Svr;pbs_mom;LOG_DEBUG::mom_checkpoint_job_has_checkpoint, FALSE
04/11/2011 21:27:48;0001;   pbs_mom;Job;TMomFinalizeJob3;job 18.localhost started, pid = 28592
04/11/2011 21:27:48;0080;   pbs_mom;Job;18.localhost;scan_for_terminated: job 18.localhost task 1 terminated, sid=28592
04/11/2011 21:27:48;0008;   pbs_mom;Job;18.localhost;job was terminated
04/11/2011 21:27:48;0080;   pbs_mom;Svr;preobit_reply;top of preobit_reply
04/11/2011 21:27:48;0080;   pbs_mom;Svr;preobit_reply;DIS_reply_read/decode_DIS_replySvr worked, top of while loop
04/11/2011 21:27:48;0080;   pbs_mom;Svr;preobit_reply;in while loop, no error from job stat
04/11/2011 21:27:48;0080;   pbs_mom;Job;18.localhost;obit sent to server

Scheduler log (/var/spool/torque/sched_logs/20110705):

07/05/2011 21:44:53;0002; pbs_sched;Svr;Log;Log opened
07/05/2011 21:44:53;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20110705 opened
07/05/2011 21:44:53;0002; pbs_sched;Svr;main;/usr/sbin/pbs_sched startup pid 16234

qstat -f:

Job Id: 26.localhost
    Job_Name = qwe
    Job_Owner = test@localhost
    job_state = Q
    queue = batch
    server = localhost
    Checkpoint = u
    ctime = Tue Jul  5 21:43:31 2011
    Error_Path = localhost:/home/test/jscfi/default/0.738784810485275/qwe.e26
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Tue Jul  5 21:43:31 2011
    Output_Path = localhost:/home/test/jscfi/default/0.738784810485275/qwe.o26

    Priority = 0
    qtime = Tue Jul  5 21:43:31 2011
    Rerunable = True
    Resource_List.ncpus = 1
    Resource_List.neednodes = 1:ppn=1
    Resource_List.nodect = 1
    Resource_List.nodes = 1:ppn=1
    Resource_List.walltime = 00:01:00
    substate = 10
    Variable_List = PBS_O_HOME=/home/test,PBS_O_LANG=en_US.UTF-8,
    PBS_O_LOGNAME=test,
    PBS_O_PATH=/usr/local/bin:/usr/bin:/bin:/usr/bin/X11:/usr/games,
    PBS_O_MAIL=/var/mail/test,PBS_O_SHELL=/bin/sh,PBS_SERVER=127.0.0.1,
    PBS_O_WORKDIR=/home/test/jscfi/default/0.738784810485275,
    PBS_O_QUEUE=batch,PBS_O_HOST=localhost
    euser = test
    egroup = test
    queue_rank = 1
    queue_type = E
    etime = Tue Jul  5 21:43:31 2011
    submit_args = run.pbs
    Walltime.Remaining = 6
    fault_tolerant = False

How to make it execute jobs automatically, without manual qrun?

Vi.
  • 821
  • 11
  • 19
  • If you do a qrun to force the job to run, does it work? What do you see on the mom_log on either your scheduler node or the execution node after you do a qrun? I saw this issue once a while back (jobs refusing to autostart), but it was a really weird condition and I'm trying to remember how I fixed it. I'm assuming that restarting pbs_server, pbs_mom, etc makes no difference? – ajdecon Apr 11 '11 at 13:14
  • @ajdecon, No, restarting changes nothing. – Vi. Apr 11 '11 at 18:30
  • OK, I found my notes from this bug, but I'm not sure it will help. When I saw this issue, it was caused by a mismatch of the /etc/group and /etc/passwd files between the head node and the computes. Only doing qrun as root would make the jobs start. – ajdecon Apr 12 '11 at 00:15
  • Running everything on single system. How can all that /etc/{hosts,passwd,group,whatever} affect it, especially without any loud log messages? Is there something like "debug log" or other thing where I can look why is it holding back the task? – Vi. Apr 12 '11 at 00:30
  • I do not see any communication between the scheduler and the server in the log. Also, localhost is not a good name for a server. You should configure a proper hostname that can be resolved correctly on every node of the cluster. – Dmitri Chubarov Nov 03 '12 at 16:29

2 Answers2

3

I spent several hours on the problem with similar symptoms and at the end it was single option missing in server settings:

qmgr -c "set server scheduling = True"
anonymous
  • 31
  • 2
  • Evidently, according to the logs, that makes it actually use the "basic" scheduler (pbs_sched). Any hint how you found out about that? – exic Nov 30 '19 at 14:27
0

Normally it would be the scheduler that decides when jobs are to be run, i.e. when there are sufficient resources, and tells the server to run the job. Are you running a scheduler? TORQUE includes a basic scheduler (pbs_sched), or you could install and run the more sophisticated maui (free) or moab (pay-for).

The pbs_server part of PBS/TORQUE is a "resource manager" - essentially just a 'framework'. It makes no decisions itself: that is the job of the scheduler.

Norky
  • 849
  • 4
  • 14
  • Yes, scheduler is running (basic Torque scheduler, not maui). Attaching the scheduler log. – Vi. Jul 05 '11 at 18:51
  • @Vi: in that case, the standard TORQUE scheduler might have attached a comment to the job: run `qstat -f` and checked for comments at the end of each job's metadata which might give you a clue as to why it is not running. – Norky Jul 06 '11 at 10:51
  • I see nothing strange in `qstat -f` output. Where to look? (attached it to the question). – Vi. Jul 06 '11 at 12:21