Running a large number of small jobs in Windows, in parallel, with timeout capability

1

I need to process >50,000 files using a third-party .exe command-line application. The application takes only one input file at a time, so I have to launch the application >50,000 times.

Each file (each job) usually takes about one second. However, sometimes the application hangs indefinitely.

I have written a Windows shell script that runs all the jobs serially, and checks every second to see whether the job is done. After 10 seconds, it kills the job and moves on to the next. However, it takes about 20 hours. I believe I can bring the total runtime down by a large amount if I run multiple jobs in parallel. The question is how?

In CMD I launch the task with Start, but there is no simple way to recover the process ID (PID) and therefore I cannot easily keep track of which instance has run for how long. I feel like I am trying to reinvent the umbrella. Any suggestions?

Mattia Landoni

Posted 2017-08-14T02:12:27.197

Reputation: 43

Questions seeking product, service, or learning material recommendations are off-topic because they become outdated quickly and attract opinion-based answers. Instead, describe your situation and the specific problem you're trying to solve. Share your research. – Xavierjazz – 2017-08-14T02:14:38.337

I have described my problem in detail in the post title and the first two paragraphs. The third paragraph talks about what I did. I changed the fourth paragraph but I don't know that the question is better now. – Mattia Landoni – 2017-08-14T02:20:14.483

Answers

0

Powershell is your friend.

https://serverfault.com/questions/626711/how-do-i-run-my-powershell-scripts-in-parallel-without-using-jobs asks something similar.

"Quick" and "robust" are of course subjective.

quadruplebucky

Posted 2017-08-14T02:12:27.197

Reputation: 521

1Thanks, Powershell is what I needed. I will add an answer below with the exact code I used, which I think is very reusable. I used the "Invoke-Parallel" tool mentioned in the answer you pointed to. – Mattia Landoni – 2017-08-14T20:05:14.387

I also removed "quick" and "robust" from the title. Thx – Mattia Landoni – 2017-08-14T20:16:13.357

2

Powershell did the trick, as indicated in quadruplebucky's answer. Here is the code I used. The second-last line (./xml2csv...) is the job itself. The rest of the script can be reused for any similar tasks.

# PARAMETERS
$root = 'D:\Ratings'
$folder = 'SP'

# Import Invoke-Parallel
 .".\Invoke-Parallel.ps1"

# Run in parallel
Get-ChildItem ".\$folder-xml" -Filter *.xml |
Invoke-Parallel -throttle 10 -runspaceTimeout 10 -ImportVariables `
  -ScriptBlock {
    $file = $_.BaseName
    echo $file
    cd $root
    (./xml2csv $folder-xml\$file.xml $folder-csv\$file.csv fields-$folder.txt -Q) | out-null
  }

Some notes:

  • The Invoke-Parallel function (aka cmdlet) can be downloaded here.
  • A runspace is what I would have called an "instance". -runspaceTimeout provides the maximum running time for each instance.
  • -throttle sets the maximum number of simultaneous running instances.

Mattia Landoni

Posted 2017-08-14T02:12:27.197

Reputation: 43