30
8
I just did a little experiment where I created a tar archive with duplicate files to see if it would be compressed, to my awe, it was not! Details follow (results indented for reading pleasure):
$ dd if=/dev/urandom bs=1M count=1 of=a
1+0 records in
1+0 records out
1048576 bytes (1.0 MB) copied, 0.114354 s, 9.2 MB/s
$ cp a b
$ ln a c
$ ll
total 3072
-rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 a
-rw-r--r-- 1 guido guido 1048576 Sep 24 15:51 b
-rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 c
$ tar -c * -f test.tar
$ ls -l test.tar
-rw-r--r-- 1 guido guido 2109440 Sep 24 15:51 test.tar
$ gzip test.tar
$ ls -l test.tar.gz
-rw-r--r-- 1 guido guido 2097921 Sep 24 15:51 test.tar.gz
$
First I created a 1MiB file of random data (a). Then I copied it to a file b and also harlinked it to c. When creating the tarball, tar was apparently aware of the hardlink, since the tarball was only ~2MiB and not ~3Mib.
Now I expected gzip to reduce the size of the tarball to ~1MiB since a and b are duplicates, and there should be 1MiB of continuous data repeated inside the tarball, yet this didn't occur.
Why is this? And how could I compress the tarball efficiently in these cases?
Fair enough! Do you happen to know of any alternative that doesn't work on streams? – Guido – 2012-09-24T19:15:16.073
1I don't know of any packaged solution to your problem. If I expected this would be a recurring, serious problem, I (personally) would attack it with a script that did the n-way cmp (compare) operations to find duplicates, write the list to a file, then tar + gzip only the unique items + the list. To restore, I'd use a second script to ungzip and untar, then create the dups from the list. Another alternative would be to turn the dups into hard links, since you know tar does spot those. Sorry, I know that's probably not what you were hoping. – Nicole Hamilton – 2012-09-24T19:22:27.870
Yeah, I though about doing that (fdupes is a nice program to detect duplicates and even hard link them if you want!). But I just tried using xz for compressing and it worked! Apparently it scans the whole file. It's a big CPU/memory hog when compressing though. Thanks! – Guido – 2012-09-24T19:30:25.017
1
gzip and bzip2 both have to be relatively "stream friendly" because of their design - it's absolutely necessary to being able to work as part of a pipe. What you are looking for here is actually deduplication and not just compression. Since tar breaks the process into two parts - archiving only with tar, and then using a second program as a filter to compress. I couldn't find any compressed archive with deduplication in my searches, but I found this previous related question. http://superuser.com/questions/286414/is-there-a-compression-or-archiver-program-for-windows-that-also-does-deduplicat
– Stephanie – 2012-09-24T19:34:32.923Scratch that. It works for small files, but for large ones the problem exists. I'm guessing it just has a bigger default buffer size. – Guido – 2012-09-24T19:39:14.723
2
@Stephanie, NicoleHamilton: There is https://en.wikipedia.org/wiki/Lrzip#Lrzip.
– Mechanical snail – 2012-09-25T00:10:26.9501@Guido Of course nothing can remove duplicates of something it doesn't remember in a stream, but try something like
xz -9 -M 95%
, or evenxz -M 95% --lzma2=preset=9,dict=1610612736
. It won't be fast, but your duplicates are unlikely to be left in the result. – Eroen – 2012-09-25T00:20:47.703