16

I have a ton of relativity small data files but they take up about 50 GB and I need them transferred to a different machine. I was trying to think of the most efficient way to do this.

Thoughts I had were to gzip the whole thing then rsync it and decompress it, rely on rsync -z for compression, gzip then use rsync -z. I am not sure which would be most efficient since I am not sure how exactly rsync -z is implemented. Any ideas on which option would be the fastest?

5 Answers5

11

You can't "gzip the whole thing" as gzip only compress one file, you could create a tar file and gzip it to "gzip the whole thing" but you would loose rsync capability of copying only modified file.

So the question is: is it better to store file I need to rsync gziped or rely on -z option of rsync.
The answer is probably that you don't want the file unzipped on your server ? I guess yes, so I don't see how you could manage to gzip file before doing the rsync.

May be you don't need the rsync capability of copying only modified file ? In this case why using rsync instead of doing a scp of a tar.gz file containing your stuff ?

Anyway to answer the question, rsync gzip will be a little less efficient than gziping file with gzip. Why ? because rsync will gzip data chunk by chunk, so a smaller set of data will be used to create the table that gzip use to do compression, a bigger set of data (gzip would use the whole file at once) give a better compression table. But the difference will be very very small in most case but in very rare case the difference can be more important (if you have a very large file with very long partern repeating many time on the file but far away from each other) (This is a very simplified example)

radius
  • 9,545
  • 23
  • 45
  • 1
    From how I read his question, he'll compress to get it over the wire and then decompress the other side. I'd go with rsync native compression over gzip, simply because compressing and decompressing 50GB can take a significant amount of time. Then again, if the files are mostly text, they'll compress nicely. Third option: copy the files to a USB drive. –  Jun 24 '10 at 00:08
  • 3
    @Randolph Potter: yes time lost to compress 50GB locally then rsync would be higher than using rsync -z, anyway if he want to take advantage of rsync itself (copying only changed file) compression can't be done before – radius Jun 24 '10 at 00:15
  • very good point. +1 for you :-) –  Jun 24 '10 at 00:27
  • Recall also that gzip is a stream compressor. – Falcon Momot Oct 01 '12 at 02:31
6

If you're only copying the data once, rsync isn't going to be a big win in and of itself. If you like gzip, (or tar+gzip, since you have many files), you might try something like:

tar -cz /home/me/source/directory | ssh target tar -xz --directory /home/you/target/directory

That would get the compression you are looking for and just copy directly without involving rsync.

Hubert Kario
  • 6,351
  • 6
  • 33
  • 65
Slartibartfast
  • 3,265
  • 17
  • 16
  • i'd probably use --lzop for that instead of gzip ... much faster and lower cpu overhead and still has good compression ratios for text – underrun Jul 11 '13 at 18:37
6

@radius, a minor nit to pick about how gzip works - gzip is a block-based compression algorithm, and a fairly simple one at that. The whole file is not considered for the compression table - only each block. Other algorithms may use the whole contents of the file and there are a few that use the contents of multiple blocks or even variably-sized blocks. One fascinating example is lrzip, by the same author as rsync!

The skinny on gzip's algorithm.

So, in summary, using rsync -z will likely yield the same compression as gziping first - and if you're doing a differential transfer, better because of rsync's diffing algorithm.

That said, I think one will find that regular scp handily beats rsync for non-differential transfers - because it will have far less overhead than rsync's algorithm (which would use scp under-the-hood anyway!)

If your network does become a bottleneck, then you would want to use compression on the wire.

If your disks are the bottleneck, that's when streaming into a compressed file would be best. (for example, netcat from one machine to the next, streaming into gzip -c)

Usually, if speed is key, compressing an existing file before-hand is wasteful.

TIMTOWTDI, YMMV, IANAL, etc.

HopelessN00b
  • 53,385
  • 32
  • 133
  • 208
Hercynium
  • 161
  • 1
  • 2
2

According to this guy it may just be faster to use rsync -z, although I would guess it would be close to as efficient as compressing each file first before transferring. It should be faster than compressing the tar stream, as suggested by others.

From the man page:

          Note  that  this  option  typically  achieves better compression
          ratios than can be achieved by using a compressing remote  shell
          or  a  compressing  transport  because it takes advantage of the
          implicit information in the matching data blocks  that  are  not
          explicitly sent over the connection.
Insyte
  • 9,314
  • 2
  • 27
  • 45
  • 1
    I would suggest using --compress-level=1 with rsync -z if you have a fast network. You want the network to be your bottleneck, not CPU or disk IO, to minimize total transfer time. If the network is slow, using the default -z (which is equivalent to gzip -6 I think) might still make the process network bound. – rmalayter Jul 09 '10 at 13:36
1

Since both scp of compressed file and rsync will take very similar transfer times, the "most efficient way to do this" would be on-the-fly compression rather than compress, transfer.

In addition to "fastness" other considerations include:

rsync can be easily restarted if not all of the files get transferred.

rsync can be used to maintain the files on the remote machine.

local tar or gzip requires local space.

Port usage considerations for both target machine and firewalls: 1) scp uses port 22 (by default) which may not be acceptable. 2) rsync users port 873 (by default)

I am not sure why radius expects the original poster does NOT want unzipped files stored.

DGerman
  • 13
  • 4