1

We upgraded our OS from Debian 5 to Debian 6 and consequently upgraded Torque.

Now qstat and qsub works for about 1 minute and fails for another minute.

I have torque-2.5.5 (but I tried 2.4.8 and it had same issues).

When we run qstat half of the time it works and half of the time we get:

pbs_iff: cannot read reply from pbs_server
No Permission.
qstat: cannot connect to server torque-server (errno=15007) Unauthorized Request

On the mom syslog:

pbs_mom: LOG_ERROR::Operation now in progress (115) in
TMomFinalizeChild, cannot open interactive qsub socket to host
girkelab-3.ucr.edu:51056 - 'cannot connect to port 777 in
client_to_svr - errno:115 Operation now in progress' - check routing
tables/multi-homed host issues

On the server:

/opt/torque-2.5.5/bin/qmgr -c 'print server'
#
# Create queues and set their attributes.
#
#
# Create and define queue batch
#
create queue batch
set queue batch queue_type = Execution
set queue batch resources_default.nodes = 1
set queue batch enabled = True
set queue batch started = True
#
# Set server attributes.
#
set server scheduling = True
set server acl_hosts = torque-server
set server acl_hosts += torque-server+biocluster+parrot+owl
set server acl_hosts += owl-33+biocluster-33
set server acl_hosts += girkelab-3+girkelab-4
set server operators = root@torque-server
set server default_queue = batch
set server log_events = 511
set server mail_from = adm
set server query_other_jobs = True
set server scheduler_iteration = 600
set server node_check_rate = 150
set server tcp_timeout = 6
set server log_level = 0
set server submit_hosts = biocluster+parrot+owl
set server submit_hosts += girkelab-3+girkelab-4
set server submit_hosts += owl-33+biocluster-33
set server allow_node_submit = True
set server next_job_number = 206082

Why does it say permission error when it works half of the time?

What can I do to diagnose the problem?

Aleksandr Levchuk
  • 2,415
  • 3
  • 21
  • 41
  • I got [some comments](http://www.supercluster.org/pipermail/torqueusers/2011-March/012539.html) on this form Torque's mailing list. In a way, it explains why "Permission error". (note - pbs_iff is setuid root) – Aleksandr Levchuk Mar 31 '11 at 15:00

2 Answers2

3

Conclusion: The server was jammed because of a dead node.

Before we figured it out, many things were tried:

  • Looked at individual packets via tcpdump.
  • Server, clients, and mom logs.
  • Tested my network file system if that was freezing.
  • Tested if UPD traffic lost packets.

Nothing, was wrong and no matter what I tried the transient "No Permission" error would not go away.

I had one node that went dead the night before. We had problems before when Torque would get jammed instead of detecting dead nodes. So, I removed the nodes from /var/spool/torque/server_priv/nodes (the standard Torque configuration location). Restarted the torque but that did not help.

Late nigh, with my boss, we found the solution. There were a bunch of old files ("running jobs") in /var/spool/torque/server_priv/jobs/ which belonged to the removed dead node. Delete. Restart. Solved.

"No Permission"?!

Aleksandr Levchuk
  • 2,415
  • 3
  • 21
  • 41
1

Well, you're not the only one: http://comments.gmane.org/gmane.comp.clustering.torque.user/8401

Jeff Albert
  • 1,967
  • 9
  • 14