What is the fastest way to move a million images from one directory to another in Linux?

13

1

I have a million images that takes up 30GB of disk space that need to be moved from one local directory to another local directory.

What would be the most efficient way to do this? Using mv? Using cp? Using rsync? Something else?

I need to take these:

/path/to/old-img-dir/*
                     00000000.jpg
                     --------.jpg  ## nearly 1M of them! ##
                     ZZZZZZZZ.jpg

and move them here:

/path/to/new/img/dir/

Ryan

Posted 2012-10-16T06:54:33.390

Reputation: 345

5I don't think you can beat mv, performance-wise, if both the source and target directories reside in the same filesystem. – Frédéric Hamidi – 2012-10-16T06:57:50.917

Answers

24

rsync would be a poor choice because it does a lot of client/server background work which accounts for local as well as remote systems.

mv is probably the best choice. If possible, you should try mv directory_old directory_new rather than mv directory_old/* directory_new/. This way, you move one thing instead of a million things.

Richard

Posted 2012-10-16T06:54:33.390

Reputation: 2 565

If there are many images, using a simple shell wildcard will overflow the maximum command line. – Raúl Salinas-Monteagudo – 2017-07-07T11:25:07.723

1Moving between disks will still move all data. On same disk, mv just updates inode information so mv directory_old directory_new works faster than mv directory_old/* directory_new – Anshul – 2018-01-10T04:59:18.270

6+1 for the advice to move the directories instead of the files. – Ex Umbris – 2012-10-16T07:11:14.513

4Plus, the wildcard expansion would likely break the maximum arguments supported by mv if we're talking about millions. – slhck – 2012-10-16T09:29:52.827

6rsync handles transfers on local storage media just fine. It forces things like --whole-file (removing the implementation of the delta xfer algorithm) and prevents other things like --compression which serve no purpose in local transfers. If the directories reside on different filesystems, 'mv' won't provide any kind of performance. If they DO reside on the same filesystem, then just 'mv' the directories like these folks said. – UtahJarhead – 2012-10-16T14:04:58.740

13

find src_image_dir/ -type f -name '*.jpg' -print0 | xargs -0r mv -t dst_image_dir/ 
  • This will not overflow argument expansion.
  • You can specify the file extension, if you want to. (-name ...)
  • find -print0 with xargs -0 allows you to use spaces in the names.
  • xargs -r will not run mv unless there is something to be moved. (mv will complain if no source files are given).
  • The syntax mv -t allows you to specify first the destination and then the source files, needed by xargs.
  • Moving the whole directory is of course much faster, since it takes place in constant time regardless of the number of files contained in it, but:
    • the source directory will disappear for a fraction of time and it might create you problems;
    • if the process is using the current directory as output directory (in contrast to always referring to a full path from a non-moving location), you would have to relaunch it. (like you do with log rotation).

By the way, I would ask myself whether I really have to move such a big amount of files at once. Batch processing is overrated. I try not to accumulate huge amounts of work if I can process things at the moment they are generated.

Raúl Salinas-Monteagudo

Posted 2012-10-16T06:54:33.390

Reputation: 1 058

This works well enough for moving files across filesystems on the same server. Well enough that I didn't bother looking for solution in rsync. Sure it took an hour or two, but it works.

One thing to note, if you give find a directory name instead of "." -- be sure to use the trailing slash in the find command, else the directory will be recreated in the destination of the mv command. – Speeddymon – 2017-07-06T20:08:31.897

6

If the two directories reside on the same filesystem, use mv on the DIRECTORY and not the contents of the directory.

If they reside on two different filesystems, use rsync:

rsync -av /source/directory/ /destination

Notice the trailing / on the source. This means it will copy the CONTENTS of the directory and not the directory itself. If you leave the / off, it will still copy the files but they will sit in a directory named /destination/directory. With the /, the files will just be in /destination

rsync will maintain file ownership if you run it as root or if the files are owned by you. It will also maintain the mtime of each individual file.

UtahJarhead

Posted 2012-10-16T06:54:33.390

Reputation: 1 755

1For copying a large folder from one hard drive to a different hard drive, rsync seems to run circles around mv. Thanks for the tip! – leo-the-manic – 2013-07-20T05:52:40.273

2

tar cf - dir1 | (cd dir2; tar xf -)

tar cf - dir1 | ssh remote_host "( cd /path/to/dir2; tar xf - )"

When you use 'cp' each file does a open-read-close-open-write-close. Tar uses different processes for reading and writing as well as multiple treads to operate on multiple files at once. Even on a single CPU box multithreaded apps are faster.

maholt

Posted 2012-10-16T06:54:33.390

Reputation: 31

2While this may answer the question, it would be a better answer if you could provide some explanation why it does so. – DavidPostill – 2016-04-16T19:04:28.533

1If they are in the local machine, chances are they reside in the same filesystem.

By using tar c | tar x you get a cost of O(total_size) instead of O(file_count). – Raúl Salinas-Monteagudo – 2017-07-07T11:26:39.070

1

As both directory_old and directory_new are on the same filesystem you could use cp -l instead of mv as an option. cp -l will create a hard links to the original files. When you are done with 'move' and you satisfied with result then you can remove these files from directory_old. in terms of speed it will be same as 'mv' as you first create the links and then you remove the original ones. But this approach let you to start from the beginning if this makes sense

Serge

Posted 2012-10-16T06:54:33.390

Reputation: 2 585

0

To copy at least ~10k of files (no directories), cp complained with:

unable to execute /bin/cp: Argument list too long

The best option is Rsync:

rsync source target

And it was done very quickly!

Nico

Posted 2012-10-16T06:54:33.390

Reputation: 1

0

If you have the free space, archive them into a single .tar file (with no compression is faster) and then move that file over and unarchive it.

endolith

Posted 2012-10-16T06:54:33.390

Reputation: 6 626

0

The nature of the destination would determine the most efficient way to do this task. Let's assume you are on a local system, your PWD is / right now. and /acontains the millions of images. Our task is to move all of the images to /b, while maintaining all sub-directory structure. Lets also assume /a and /b are mount points for two different partitions, each on a locally connected disk. We'd want to do this task with a tarpipe. This might take some time, so make sure you're using screen, tmux, or you execute this as a background process.

tar -C /a -cf . | tar -C /b -xf -

That would copy all files and directories in /a to /b, so now you'll need to clean up /a once you confirm it completed without error.

J. M. Becker

Posted 2012-10-16T06:54:33.390

Reputation: 593

0

It depends(tm). If your filesystem is copy-on-write, then copy (cp or rsync, for instance) should be comparable to a move. But for most common cases, move (mv) will be the fastest, since it can simply switch around the pieces of data that describe where a file is placed (note: this is overly simplified).

So, on your average Linux installation, I'd go for mv.

EDIT: @Frédéric Hamidi has a good point in the comments: This is only valid if they are both on the same filesystem and disk. Otherwise the data will be copied anyway.

carlpett

Posted 2012-10-16T06:54:33.390

Reputation: 285