Execute Shell loop in parallel, but only N workers

Asked Jul 08 '15 at 08:25

Active Jul 10 '15 at 08:00

Viewed 1,958 times

We have more than 100 git repos, and sometimes I want to grep over all.

To update the repos I use this:

for repo in *; do (cd $repo; git checkout master; git pull); done

This is quite slow.

How to speed it up?

Running all updates at once would spawn too many processes.

I need a way to reduce the load to N workers.

Has someone a solution to this?

asked Jul 08 '15 at 08:25

guettli

3,113
14
59
110

Have you checked GNU parallel ? – Nehal Dattani Jul 08 '15 at 09:52
@NehalDattani no, I did no check GNU parallel. – guettli Jul 08 '15 at 09:55
This should give you some motivation. " A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel." – Nehal Dattani Jul 08 '15 at 09:57
@NehalDattani why don't you write an answer? I could accept and up-vote it. – guettli Jul 08 '15 at 14:21

3 Answers3

You can use GNU parallel to do this task. From GNU parallel's home page,

" A job can also be a command that reads from a pipe. GNU parallel can then split the input and pipe it into commands in parallel."

There is excellent tutorial and this specific section addresses what exactly you have asked.

Edit: Here is the command you can use. (Slightly modified from Ole Tang's answer)

parallel -j<number of jobs to run> 'cd {} && git checkout master && git pull' ::: */

This will trigger parallel "number of jobs" you have specified and perform whatever you have asked to do it.

HTH

edited Jul 10 '15 at 07:16

answered Jul 09 '15 at 08:16

Nehal Dattani

How to use gnu parallel for the shell loop from the question? – guettli Jul 09 '15 at 09:48

You can use xargs to do the job, for example

(for repo in *
    do
    [ -d ${repo} ] && echo ${repo}
    done ) | xargs -I{} -P4 ./gitActions.sh {}

The flag -P4 tells xargs to run up to 4 simultaneous process so you can play with the number of process you want/need.

Then your gitActions.sh file should contain:

#!/bin/bash
repo=$1
cd $repo; git checkout master; git pull

answered Jul 08 '15 at 09:38

alphamikevictor

1,062
6
19

Using GNU Parallel it looks like this:

parallel -j77 'cd {} && git checkout master && git pull' ::: */

It gives 77 workers.

GNU Parallel is a general parallelizer and makes is easy to run jobs in parallel on the same machine or on multiple machines you have ssh access to. It can often replace a for loop.

If you have 32 different jobs you want to run on 4 CPUs, a straight forward way to parallelize is to run 8 jobs on each CPU:

Simple scheduling

GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time:

GNU Parallel scheduling

Installation

If GNU Parallel is not packaged for your distribution, you can do a personal installation, which does not require root access. It can be done in 10 seconds by doing this:

(wget -O - pi.dk/3 || curl pi.dk/3/ || fetch -o - http://pi.dk/3) | bash

For other installation options see http://git.savannah.gnu.org/cgit/parallel.git/tree/README

Learn more

See more examples: http://www.gnu.org/software/parallel/man.html

Watch the intro videos: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Walk through the tutorial: http://www.gnu.org/software/parallel/parallel_tutorial.html

edited Jul 10 '15 at 08:00

answered Jul 09 '15 at 21:39

Ole Tange

2,836
5
29
45

2

Only script is not enough, explain what it does. – peterh Jul 09 '15 at 22:33
The images are nice. But I don't understand what the first image wants to show me. You say "GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time". This sounds like a comparison. The one tools is `GNU Parallel`. What is the second tool of your comparison? – guettli Aug 04 '15 at 09:51
The first method (and picture) divides the jobs before running them. The second is GNU Parallel that divides the jobs while they are being run. – Ole Tange Aug 04 '15 at 11:35