5

I am using torque 4.0.1 on openSUSE 12.1 in a cluster environment. When I qsub a job (simple as "echo hello"), it remains in 'Q' state, and never gets scheduled. I can force the job to run with qrun, and it is executed on the first node without error.

I tried to find the solutions for the past few days but failed. I read the manual, the logs, even the source code, but still can not locate the problem. Of course I googled a lot, tried various solutions, however no one worked.

Here is some info that maybe helpful:

  • pbs_sched is running, but its logs seem to suggest it receives no notification about jobs being queued.

    05/13/2012 18:55:08;0002; pbs_sched;Svr;Log;Log opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;TokenAct;Account file /var/spool/torque/sched_priv/accounting/20120513 opened
    05/13/2012 18:55:08;0002; pbs_sched;Svr;main;pbs_sched startup pid 32604
  • pbs_server log just showed that the job was queued into default queue batch:

    05/13/2012 19:33:08;0002;PBS_Server;Svr;PBS_Server;Torque Server Version = 4.0.1, loglevel = 0
    05/13/2012 19:33:56;0100;PBS_Server;Job;16.head;enqueuing into batch, state 1 hop 1
    05/13/2012 19:33:56;0008;PBS_Server;Job;16.head;Job Queued at request of pubuser@head, owner = pubuser@head, job name = STDIN, queue = batch
  • qstat -f 16 showed nothing useful

    Job Id: 16.head
    Job_Name = STDIN
    Job_Owner = pubuser@head
    job_state = Q
    queue = batch
    server = head
    Checkpoint = u
    ctime = Sun May 13 19:33:56 2012
    Error_Path = head:/fserver/home/pubuser/STDIN.e16
    Hold_Types = n
    Join_Path = n
    Keep_Files = n
    Mail_Points = a
    mtime = Sun May 13 19:33:56 2012
    Output_Path = head:/fserver/home/pubuser/STDIN.o16
    Priority = 0
    qtime = Sun May 13 19:33:56 2012
    Rerunable = True
    Resource_List.walltime = 01:00:00
    substate = 10
    Variable_List = PBS_O_QUEUE=batch,PBS_O_HOME=/,
        PBS_O_WORKDIR=/fserver/home/pubuser,PBS_O_HOST=head,PBS_O_SERVER=head,
        PBS_O_WORKDIR=/fserver/home/pubuser
    euser = pubuser
    egroup = users
    queue_rank = 4
    queue_type = E
    etime = Sun May 13 19:33:56 2012
    fault_tolerant = False
    job_radix = 0
    submit_host = head
    init_work_dir = /fserver/home/pubuser
  • All nodes are free:

    sun1
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910403,varattr=,jobs=,state=free,netload=44492032184,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1697420kb,totmem=1802616kb,idletime=241085,nusers=0,nsessions=0,uname=Linux sun1 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun2
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910408,varattr=,jobs=,state=free,netload=39762812881,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1701012kb,totmem=1802616kb,idletime=239982,nusers=0,nsessions=0,uname=Linux sun2 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun3
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=45984311925,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1699772kb,totmem=1802616kb,idletime=212303,nusers=0,nsessions=0,uname=Linux sun3 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun4
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910407,varattr=,jobs=,state=free,netload=37538584401,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805480kb,totmem=1908308kb,idletime=211197,nusers=0,nsessions=0,uname=Linux sun4 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun5
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=173547166,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803816kb,totmem=1908308kb,idletime=211199,nusers=0,nsessions=0,uname=Linux sun5 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun6
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910411,varattr=,jobs=,state=free,netload=24641446,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805704kb,totmem=1908308kb,idletime=212999,nusers=0,nsessions=0,uname=Linux sun6 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun7
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910412,varattr=,jobs=,state=free,netload=1548383055,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805432kb,totmem=1908308kb,idletime=215630,nusers=0,nsessions=0,uname=Linux sun7 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun8
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910400,varattr=,jobs=,state=free,netload=128755968,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1803448kb,totmem=1908308kb,idletime=211866,nusers=0,nsessions=0,uname=Linux sun8 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0

    sun9
         state = free
         np = 2
         ntype = cluster
         status = rectime=1336910374,varattr=,jobs=,state=free,netload=1371896399,gres=,loadave=0.00,ncpus=2,physmem=888188kb,availmem=1805664kb,totmem=1908308kb,idletime=211161,nusers=0,nsessions=0,uname=Linux sun9 3.1.0-1.2-desktop #1 SMP PREEMPT Thu Nov 3 14:45:45 UTC 2011 (187dde0) x86_64,opsys=linux
         mom_service_port = 15002
         mom_manager_port = 15003
         gpus = 0
  • qmgr -c 'p s':

    #
    # Create queues and set their attributes.
    #
    #
    # Create and define queue batch
    #
    create queue batch

    set queue batch queue_type = Execution

    set queue batch resources_default.walltime = 01:00:00

    set queue batch enabled = True

    set queue batch started = True

    #
    # Set server attributes.
    #
    set server scheduling = True

    set server acl_hosts = head

    set server managers = pubuser@head

    set server managers += root@head

    set server operators = pubuser@head

    set server operators += root@head

    set server default_queue = batch

    set server log_events = 511

    set server mail_from = adm

    set server scheduler_iteration = 600

    set server node_check_rate = 150

    set server tcp_timeout = 300

    set server job_stat_rate = 45

    set server poll_jobs = True

    set server mom_job_sync = True

    set server keep_completed = 0

    set server submit_hosts = head

    set server next_job_number = 17

    set server moab_array_compatible = True
  • momctl -d 13 on first node:

Host: sun1/sun1   Version: 4.0.1   PID: 5362
Server[0]: head (192.168.0.1:15001)
  Last Msg From Server:   1584 seconds (DeleteJob)
  Last Msg To Server:     7 seconds
HomeDirectory:          /var/spool/torque/mom_priv
stdout/stderr spool directory: '/var/spool/torque/spool/' (4457492 blocks available)
MOM active:             229485 seconds
Check Poll Time:        45 seconds
Server Update Interval: 45 seconds
LogLevel:               0 (use SIGUSR1/SIGUSR2 to adjust)
Communication Model:    TCP
MemLocked:              TRUE  (mlock)
TCP Timeout:            0 seconds
Trusted Client List:  127.0.0.1:0,192.168.0.1:0,192.168.0.101:0,192.168.0.101:15003,192.168.0.102:15003,192.168.0.103:15003,192.168.0.104:15003,192.168.0.105:15003,192.168.0.106:15003,192.168.0.107:15003,192.168.0.108:15003,192.168.0.109:15003:  0
Copy Command:           /usr/bin/scp -rpB
NOTE:  no local jobs detected

diagnostics complete

The problem is that TCP Timeout is 0 seconds, which does not seem to be normal. During the diagnostics, the following log was found in mom_logs


05/13/2012 20:30:10;0001;   pbs_mom;Svr;pbs_mom;LOG_ERROR::Resource temporarily unavailable (11) in tcp_read_proto_version, no protocol version number End of File (errno 2)

I googled it, but found nothing.

  • I compiled OpenMPI with this torque 4.0.1 (for tm support), and I can mpirun test programs without any problem.

I hope someone can solve this problem. Thank you!

liding
  • 51
  • 1
  • 4
  • We've been having the same problem for some time. qrun is a pretty rough workaround, but I have found no other information on the issue. – Erik Garrison Nov 06 '13 at 17:35
  • I once had similar problem with torque and maui. Turned out the problem was that conf file torque automatically created have my hostname in uppercase. Changing it to lower case solved the problem – Azad May 06 '15 at 13:59

0 Answers0