Leveraging rsync for a local copy between two slow drives, benefits without a daemon

Question

I am confused by the benefits of rsync (3.1.1) if there is no daemon running remotely, e.g. copying from a drive mounted via SMB2 (via a VPN) to an external HDD (USB 2.0, sadly). Both connections are slow (and my data is ~1TB), but I am confused how compression or careful diffing could speed things up if all this requires my CPU reading in the data in the first place, no? Both drives are local in this sense. (I cannot replace the SMB connection with SSH via rsync, as it cannot handle my password.) Or even with a remote drive, I am confused how rsync could do its magic if there is no one on the other end doing the compression before the data gets to the local CPU.

Is this a reasonable setup for such a copy? rsync -vhcrC --progress src dest

-c: Maybe checksums are a bad idea, file size and timestamp might be the only thing rsync can check without loading the data in in the first place.
-h: human-readable output
-v: verbose
-C: skipping what CVS skips

omitting:

-a: I am not interested in archiving, as files move from Windows to mac, permissions will change anyway, I think
-z: this is the compression issue
-W: sometimes copying whole-files-only use less of the CPU, but some files are really big here (~100GB), and an interrupted transfer is better restarted

you wrote...I cannot replace the SMB connection with SSH via rsync, as it cannot handle my password... Why? IMHO rsync with password less ssh works fine. — Jayan, Aug 03 '14 at 03:46
@Jayan But I do have a password, and I cannot turn it off. Did I write anything different? — László, Aug 03 '14 at 09:26

score 2 · Accepted Answer · answered Aug 03 '14 at 02:09

Note: the following is all going off theory -- the real right way to make sure this is correct in your situation is to run tests on various combinations of options.

The data connections in an rsync operation look something like this:

Source disk <-> rsync instance <-> other rsync instance <-> destination disk

In general, rsync is designed for the case where the first and last links (between the rsync instances and their disks) are fast, and the middle link (between the rsync instances) is slow. This is especially true of the -z (compression) and -c (checksum files to decide which to transmit); in a situation where the both rsyncs are on the same computer (therefore with a fast connection), these options basically make no sense.

More specifically: the -z option compresses data over the middle link, trading off higher CPU load on both ends for lower bandwidth need on the middle link. If the middle link's fast, save CPU by skipping this option.

As for the -c option, this forces both rsyncs to read all files that don't need to be synced in full in order to really make sure they don't need to be synced. If either or both of the disk links are slow, and there are a lot of files that're already in sync, this will slow the process down proportionally. As long as you don't need to worry about files contents changing without their timestamps also changing, you should avoid this one. Note that omitting this isn't much use unless you also add the -t option (or -a) so it'll copy timestamps -- without those, it'll have to compare everything anyway.

You might also want to add the -W option (just copy whole files, skip the compare-and-find-just-the changes), as that'll avoid extra reading of modified files. This probably isn't necessary, though, as all the versions of rsync I'm familiar with do this automatically when both source and destination are specified as local paths (which should apply even if one of those local paths happens to be within a network mount point).

Short summary: remove -c, add -t and maybe -W.

Leveraging rsync for a local copy between two slow drives, benefits without a daemon

1 Answers1