What's the best way to perform a parallel copy on Unix?

I routinely have to copy the contents of a folder on a network file system to my local computer. There are many files (1000s) on the remote folder that are all relatively small but due to network overhead a regular copy cp remote_folder/* ~/local_folder/ takes a very long time (10 mins).

I believe it's because the files are being copied sequentially – each file waits until the previous is finished before the copy begins.

What's the simplest way to increase the speed of this copy? (I assume it is to perform the copy in parallel.)

Zipping the files before copying will not necessarily speed things up because they may be all saved on different disks on different servers.

dsg

Posted 2011-08-24T20:50:20.397

Reputation: 1 019

http://serverfault.com/questions/152331/parallel-file-copy – Ciro Santilli 新疆改造中心法轮功六四事件 – 2015-08-03T09:24:18.507

Zipping the files before copying will speed things up massively because there will not need to be any more "did you get that file", "yes, I did", "here's the next one", "okay", ... It's those "turnarounds" that slow you down. – David Schwartz – 2013-01-15T17:19:51.643

It's probably disk speed, rather than network speed, that is your limiting factor, and if that is the case then doing this per file in parallel will make the operation slower, not faster, because you will force the disk to constantly seek back and forth between files. – Joel Coehoorn – 2013-01-15T17:20:51.260

While zipping might not be a good idea (running compression algo over 1000s of files might take a little while), tar might be viable. – Rob – 2013-01-15T17:27:50.193

@JoelCoehoorn still, there are cases when this is not the case: e.g. multiple spindles + small files (or simply random reads). In this scenario, "parallel cp" would help. – CAFxX – 2013-04-13T06:37:25.137

Answers

As long as you limit the copy commands you're running you could probably use a script like the one posted by Scrutinizer

SOURCEDIR="$1"
TARGETDIR="$2"
MAX_PARALLEL=4
nroffiles=$(ls "$SOURCEDIR" | wc -w)
setsize=$(( nroffiles/MAX_PARALLEL + 1 ))
ls -1 "$SOURCEDIR"/* | xargs -n "$setsize" | while read workset; do
  cp -p "$workset" "$TARGETDIR" &
done
wait

OldWolf

Posted 2011-08-24T20:50:20.397

Reputation: 2 293

1Note of warning though: This script breaks with filenames containing spaces or globbing characters. – slhck – 2011-08-24T21:56:33.950

@OldWolf -- Can you explain how this script works? For example, which part does the parallelization? – dsg – 2011-08-24T22:42:43.597

3@dsg: The & at the end of the cp command allows the while loop to continue and start the next cp command without waiting. The xargs command passes the filenames in groups of 4 (MAX_PARALLEL) to the while loop. – RedGrittyBrick – 2011-08-24T23:21:56.747

Doesn't worked for me. I'm not sure it is possible to speed up cp. You obviosly can speed up calculation through the multithreading. But I don't think same holds for hard drive data coping. – Adobe – 2011-09-04T17:12:22.907

If you have GNU Parallel http://www.gnu.org/software/parallel/ installed you can do this:

parallel -j10 cp {} destdir/ ::: *

You can install GNU Parallel simply by:

$ (wget -O - pi.dk/3 || lynx -source pi.dk/3 || curl pi.dk/3/ || \
   fetch -o - http://pi.dk/3 ) > install.sh
$ sha1sum install.sh | grep 3374ec53bacb199b245af2dda86df6c9
12345678 3374ec53 bacb199b 245af2dd a86df6c9
$ md5sum install.sh | grep 029a9ac06e8b5bc6052eac57b2c3c9ca
029a9ac0 6e8b5bc6 052eac57 b2c3c9ca
$ sha512sum install.sh | grep f517006d9897747bed8a4694b1acba1b
40f53af6 9e20dae5 713ba06c f517006d 9897747b ed8a4694 b1acba1b 1464beb4
60055629 3f2356f3 3e9c4e3c 76e3f3af a9db4b32 bd33322b 975696fc e6b23cfb
$ bash install.sh

Watch the intro videos for GNU Parallel to learn more: https://www.youtube.com/playlist?list=PL284C9FF2488BC6D1

Ole Tange

Posted 2011-08-24T20:50:20.397

Reputation: 3 034

One way would be to use rsync which will only copy the changes - new files and the changed parts of other files.

http://linux.die.net/man/1/rsync

Running any form of parallel copy operation will probably flood your network and the copy operation will just grind to a halt or suffer from bottlenecks at the source or destination disk.

Linker3000

Posted 2011-08-24T20:50:20.397

Reputation: 25 670

Honestly, the best tool is Google's gsutil. It handles parallel copies with directory recursion. Most of the other methods I've seen can't handle directory recursion. They don't specifically mention local filesystem to local filesystem copies in their docs, but it works like a charm.

It's another binary to install, but probably one you might already run considering all of the cloud service adoption nowadays.

diq

Posted 2011-08-24T20:50:20.397

Reputation: 121

Parallel rsync using find:

export SOURCE_DIR=/a/path/to/nowhere
export DEST_DIR=/another/path/to/nowhere

# sync folder structure first
rsync -a -f'+ */' -f'- *' $SOURCE_DIR $DEST_DIR

# cwd
cd $SOURCE_DIR

# use find to help filter files etc. into list and pipe into gnu parallel to run 4 rsync jobs simultaneously
find . -type f | SHELL=/bin/sh parallel --linebuffer --jobs=4 'rsync -av {} $DEST_DIR/{//}/'

on a corporate LAN, single rsync does about 800Mbps; with 6-8 jobs i am able to get over 2.5Gbps (at the expense of high load). Limited by the disks.

yee379

Posted 2011-08-24T20:50:20.397

Reputation: 121

There are many things one may have to consider depending on the topology you have. But before you start thinking about complex solutions, you could simply try to divide the task to two jobs and check if the time needed will reduce significantly:

The next time try:

  cp remote_folder/[a-l]* ~/local_folder/ &
  cp remote_folder/[!a-l]* ~/local_folder/ &
  wait
  wait

(you may want to replace [a-l]* to something else that matches about half of the files - maybe [0-4]* - depending on the contents of the folder)

If time improves not dramatically it may be more important to check if it's neccessary to copy all files (what's the ratio of changed files to all files?)

ktf

Posted 2011-08-24T20:50:20.397

Reputation: 2 168