Torque reports error when posting job to client nodes

Question

The system has two machines, one (called macondo02) runs pbs_server and pbs_schedule, another (called macondo01) runs pbs_mom. I have ensured that the host can clearly identify the existance of the guest:

$ pbsnodes -a
macondo01
state = free
np = 64
ntype = cluster
status = rectime=1403183300,varattr=,jobs=,state=free,netload=1102560564743,gres=,loadave=0.00,ncpus=64,physmem=131988228kb,availmem=263457400kb,totmem=266160896kb,idletime=705,nusers=6,nsessions=17,sessions=2817 59201 59937 18341 21924 27356 30089 31663 32133 32934 34374 7341 42678 58843 59605 59606 59741,uname=Linux macondo01 3.2.0-38-generic #61-Ubuntu SMP Tue Feb 19 12:18:21 UTC 2013 x86_64,opsys=linux

However, whenever I submit a job through qsub, the job didn't run, and I got error message in the PBS_server log.

06/19/2014 23:00:19;0040;PBS_Server;Svr;macondo02.edu.au;Scheduler was sent the command new
06/19/2014 23:00:19;0008;PBS_Server;Job;54.macondo02.edu.au;Job Modified at request of Scheduler@macondo02.uq.edu.au
06/19/2014 23:00:19;0008;PBS_Server;Job;54.macondo02.edu.au;Job Run at request of Scheduler@macondo02.uq.edu.au
06/19/2014 23:00:19;0040;PBS_Server;Svr;macondo02.edu.au;Scheduler was sent the command recyc
06/19/2014 23:00:20;0010;PBS_Server;Job;54.macondo02.uq.edu.au;Exit_status=0 resources_used.cput=00:00:00 resources_used.mem=7680kb resources_used.vmem=23876kb resources_used.walltime=00:00:01
06/19/2014 23:00:24;000d;PBS_Server;Job;54.macondo02.uq.edu.au;Post job file processing error; job 54.macondo02.uq.edu.au on host macondo01/0
06/19/2014 23:00:24;0100;PBS_Server;Job;54.macondo02.uq.edu.au;dequeuing from batch, state COMPLETE
06/19/2014 23:00:24;0040;PBS_Server;Svr;macondo02.uq.edu.au;Scheduler was sent the command term

Apparently the failure comes from posting job from the host(ie macondo02) to the guest (ie macondo01).

I have serveral idea in my mind: 1. I know it is necessary to establish a seamless shh between the host and guest using NFS. I have done that to MY OWN NORMAL user, and use this user to submit the qsub job. while error still occurs. 2. in the error file I saw another user called Scheduler@macondo02.uq.edu.au however I can neither find any info about this usr on cat /etc/groups, nor give seamless right to visit macondo01.

Any suggestions would be appreciated!

Tombart · Answer 1 · 2015-04-22T15:56:46.883

Try checking /var/log/syslog or PBS logfiles on the machine where was the job running, which was host macondo01.

You're looking for something like this, probably error while copying job's logfile:

pbs_mom: LOG_ERROR::sys_copy, command '/usr/bin/scp -rpB /var/spool/torque/spool...

You can find the actual log from that run in /var/spool/torque/undelivered/.

The problem might be with PBS_SCP command which requires passwordless ssh access to machine, typically it uses command like this: $PBS_SCP -rpB <path to source> <user>@<destination.host>:<path to destination>

Torque reports error when posting job to client nodes

1 Answers1