0

I have a small amount of old ubuntu desktops connected by a switch acting as a mini test cluster. The workers take commands from a master node via the SLURM queue manager. They share a data mount and a mount containing executables to act on the data via NFS on a separate fileserver box. All the machines are around 5yrs old. Jobs from the master are split into tasks, and tasks are then fed into SLURM by the master node. The splitting generates work directories in which symlinks of the corresponding data files are deposited:

../job_workdir/task_1/datafile.dat -> ../datadir/dataset/task_1/datafile.dat

When a task is run, the splitting framework has done its job but sometimes the extension (.dat or similar) of the symlink isn't accepted by an executable, as it demands e.g. .txt files. Therefore a job runs a wrapper that symlinks the symlink to a name which is accepted, after which the wrapper pretty much immediately calls the executable.

../job_workdir/task_1/datafile.dat -> ../datadir/dataset1234/task_1/datafile.dat
../job_workdir/task_1/datafile.txt -> ../job_workdir/task_1/datafile.dat

Sometimes, the executable unexplainably exits with 'file does not exist' for the symlink to treat. I cannot reproduce this for specific tasks, it usually works but not always.

So my question is, is there some issue with symlink creation timing on NFS? The NFS server is an old i3 machine with two HDs acting as a logical volume and the switch is a 3com gigabit 8 switch ('for small offices').

glormph
  • 115
  • 4

1 Answers1

0

No answers, so I'll describe what I've done. Not sure if this was the underlying problem, but I found there was a clock difference between the different computers. The worker nodes and fileserver are not connected to the internet, so I installed an ntp server on the master node, and clients on the workers and file server. Clients then synced with the master node. Haven't seen the problems ever since.

glormph
  • 115
  • 4