4

I couldn't find any info on this short of spinning up a Nomad cluster and experimenting so maybe someone here may be able to help.

Saying you want to run 100 iterations of a batch java job, each with a different set of parameters and get the resulting output files back.

1) does Nomad accept the concept of input_files where you point to a local file on your computer and it will distribute such file?

# in HTCondor would be something like this
transfer_input_files = MyCalculator.jar,logback.xml

2) does Nomad bring back the results of such calculation, say *.csv files that were produced?

# this would do it in HTCondor
should_transfer_files = YES
when_to_transfer_output = ON_EXIT
transfer_output_files = /output_dir

3) does nomad allow the usage of parameters in the way condor does, allowing you to send a job with n params that will then be distributed as multiple jobs to the cluster?

# this would do it in HTCondor
Arguments    = x=1
Queue 
(...)
Arguments    = x=100
Queue
bodgit
  • 4,661
  • 13
  • 26
Frankie
  • 419
  • 1
  • 6
  • 19

1 Answers1

2

Containers are considered to be stateless which means you will need additional steps to your process for this to work. Condor adds this functionality for you, but I never found it useful and it never worked properly when I used it (last time as 2009). To get around it I separated out data transfer all together from Condor. To do it you will need to do the following:

Your output data files need to be stored in a persistent data store of some sort and not in the container itself. Some container's allow for mounting of the host's direct disk or even mounting of a remote disk over the network (NFS, Samba, SSHFS, etc). In the past I have used a distributed filesystem (or network mountable), like AWS-S3, to handle this requirement.

When I worked with Condor in 2009 for my masters thesis I handled this requirement by building BASH wrapper scripts for the Java applications I was running in batch jobs. The script would handle sending in appropriate input variances (download from the distributed file system resource) and when the job was completed the script would then kick off data transfers of the output files to the same distributed file resource (with the job name, job number, hostname that ran the job, and the datetime stamp in the file output's name).

HTCondor, Nomad or even Kubernetes can handle this problem set for you. You will need to add some kind of logic in your job runner wrapper scripts to handle data transfer before starting up and shutting down the application itself.

I hope this helps.

Levi
  • 195
  • 5
  • It's helpful, thanks. Basically, no to 1, no to 2 and yes to 3. I'll eventually try out nomad but HTCondor's currently fits my necessities in more straightforward way. Thanks! – Frankie Oct 18 '18 at 17:37