Why doesn't Gzip compression eliminate duplicate chunks of data?

30

8

I just did a little experiment where I created a tar archive with duplicate files to see if it would be compressed, to my awe, it was not! Details follow (results indented for reading pleasure):

$ dd if=/dev/urandom bs=1M count=1 of=a
  1+0 records in
  1+0 records out
  1048576 bytes (1.0 MB) copied, 0.114354 s, 9.2 MB/s
$ cp a b
$ ln a c
$ ll
  total 3072
  -rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 a
  -rw-r--r-- 1 guido guido 1048576 Sep 24 15:51 b
  -rw-r--r-- 2 guido guido 1048576 Sep 24 15:51 c
$ tar -c * -f test.tar
$ ls -l test.tar 
  -rw-r--r-- 1 guido guido 2109440 Sep 24 15:51 test.tar
$ gzip test.tar 
$ ls -l test.tar.gz 
  -rw-r--r-- 1 guido guido 2097921 Sep 24 15:51 test.tar.gz
$ 

First I created a 1MiB file of random data (a). Then I copied it to a file b and also harlinked it to c. When creating the tarball, tar was apparently aware of the hardlink, since the tarball was only ~2MiB and not ~3Mib.

Now I expected gzip to reduce the size of the tarball to ~1MiB since a and b are duplicates, and there should be 1MiB of continuous data repeated inside the tarball, yet this didn't occur.

Why is this? And how could I compress the tarball efficiently in these cases?

Guido

Posted 2012-09-24T18:58:29.110

Reputation: 304

Answers

24

Gzip gzip is based on the DEFLATE algorithm, which is a combination of LZ77 and Huffman coding. It's a lossless data compression algorithm that works by transforming the input stream into compressed symbols using a dictionary built on-the-fly and watching for duplicates. But it can't find duplicates separated by more than 32K. Expecting it to spot duplicates 1MB apart isn't realistic.

Nicole Hamilton

Posted 2012-09-24T18:58:29.110

Reputation: 8 987

Fair enough! Do you happen to know of any alternative that doesn't work on streams? – Guido – 2012-09-24T19:15:16.073

1I don't know of any packaged solution to your problem. If I expected this would be a recurring, serious problem, I (personally) would attack it with a script that did the n-way cmp (compare) operations to find duplicates, write the list to a file, then tar + gzip only the unique items + the list. To restore, I'd use a second script to ungzip and untar, then create the dups from the list. Another alternative would be to turn the dups into hard links, since you know tar does spot those. Sorry, I know that's probably not what you were hoping. – Nicole Hamilton – 2012-09-24T19:22:27.870

Yeah, I though about doing that (fdupes is a nice program to detect duplicates and even hard link them if you want!). But I just tried using xz for compressing and it worked! Apparently it scans the whole file. It's a big CPU/memory hog when compressing though. Thanks! – Guido – 2012-09-24T19:30:25.017

1

gzip and bzip2 both have to be relatively "stream friendly" because of their design - it's absolutely necessary to being able to work as part of a pipe. What you are looking for here is actually deduplication and not just compression. Since tar breaks the process into two parts - archiving only with tar, and then using a second program as a filter to compress. I couldn't find any compressed archive with deduplication in my searches, but I found this previous related question. http://superuser.com/questions/286414/is-there-a-compression-or-archiver-program-for-windows-that-also-does-deduplicat

– Stephanie – 2012-09-24T19:34:32.923

Scratch that. It works for small files, but for large ones the problem exists. I'm guessing it just has a bigger default buffer size. – Guido – 2012-09-24T19:39:14.723

2

@Stephanie, NicoleHamilton: There is https://en.wikipedia.org/wiki/Lrzip#Lrzip.

– Mechanical snail – 2012-09-25T00:10:26.950

1@Guido Of course nothing can remove duplicates of something it doesn't remember in a stream, but try something like xz -9 -M 95%, or even xz -M 95% --lzma2=preset=9,dict=1610612736. It won't be fast, but your duplicates are unlikely to be left in the result. – Eroen – 2012-09-25T00:20:47.703

39

Nicole Hamilton correctly notes that gzip won't find distant duplicate data due to its small dictionary size.

bzip2 is similar, because it's limited to 900 KB of memory.

Instead, try:

LZMA/LZMA2 algorithm (xz, 7z)

The LZMA algorithm is in the same family as Deflate, but uses a much larger dictionary size (customizable; default is something like 384 MB). The xz utility, which should be installed by default on most recent Linux distros, is similar to gzip and uses LZMA.

As LZMA detects longer-range redundancy, it will be able to deduplicate your data here. However, it is slower than Gzip.

Another option is 7-zip (7z, in the p7zip package), which is an archiver (rather than a single-stream compressor) that uses LZMA by default (written by the author of LZMA). The 7-zip archiver runs its own deduplication at the file level (looking at files with the same extension) when archiving to its .7z format. This means that if you're willing to replace tar with 7z, you get identical files deduplicated. However, 7z does not preserve nanosecond timestamps, permissions, or xattrs, so it may not suit your needs.

lrzip

lrzip is a compressor that preprocesses the data to remove long-distance redundancy before feeding it to a conventional algorithm like Gzip/Deflate, bzip2, lzop, or LZMA. For the sample data you give here, it's not necessary; it's useful for when the input data is larger than what can fit in memory.

For this sort of data (duplicated incompressible chunks), you should use lzop compression (very fast) with lrzip, because there's no benefit to trying harder to compress completely random data once it's been deduplicated.

Bup and Obnam

Since you tagged the question , if your goal here is backing up data, consider using a deduplicating backup program like Bup or Obnam.

Mechanical snail

Posted 2012-09-24T18:58:29.110

Reputation: 6 625

7Zip using LZMA2 compression and a 1536Mb dicctionary size (maximum size available in Windows GUI) works great for me! – Leopoldo Sanczyk – 2016-11-02T19:49:49.903

This lrzip looks interesting. It even has an author known for non-traditional solutions. Now I'll have to revise my backup scripts. Again. – Eroen – 2012-09-25T00:26:48.510

3+1 Wow, what a fountain of knowledge/experience there. Appreciated. May I add dedup enabled filesystems to the mix? ZFS (and, I think Btrfs is slated to have it) - would work with block aligned duplication – sehe – 2012-09-25T00:40:40.803

2

gzip won't find duplicates, even xz with a huge dictionary size won't. What you can do is use mksquashfs - this will indeed save the space of duplicates.

Some quick test results with xz and mksquashfs with three random binary files(64MB) of which two are the same:

Setup:

mkdir test
cd test
dd if=/dev/urandom of=test1.bin count=64k bs=1k
dd if=/dev/urandom of=test2.bin count=64k bs=1k
cp test{2,3}.bin
cd ..

Squashfs:

mksquashfs test/ test.squash
> test.squash - 129M

xz:

XZ_OPT='-v --memlimit-compress=6G --memlimit-decompress=512M --lzma2=preset=9e,dict=512M --extreme -T4 ' tar -cJvf test.tar.xz test/
> test.tar.xz - 193M

Izzy

Posted 2012-09-24T18:58:29.110

Reputation: 273

Does mksquashfs only find duplicates on file-level, or does it also work on smaller chunks? That is: Will it also compress slightly-different-but-mostly-the-same files? – Chaos_99 – 2016-05-19T12:31:56.280

This works afaik on a file-basis only. You can see that when taring those three test-files into non-compressed tar archive and compressing them with mksquashfs afterwards. On the other hand, mksqashfs will report, when finding duplicates with Number of duplicate files found in stdout. – Izzy – 2016-05-19T12:56:17.057

2

In case of a backup, possibly with a largish set of smaller files, one trick that might work for you is to sort the files in the tar by extension:

find archive_dir -type f | rev | sort | rev | tar czf my_archive.tar.gz -I -

user216110

Posted 2012-09-24T18:58:29.110

Reputation: 21

I'd cut out all the rev's (why even reverse and then sort?) and look at the sort option "-r, --reverse" (though I'm not sure why you'd even want any reverse). But I think your tar option "-I" does not do what you think it does "-I, --use-compress-program PROG", you probably want "-T, --files-from FILE" – Xen2050 – 2015-02-02T04:02:42.853

I believe the | tar czf my_archive.tar.gz -I - should be | xargs tar Azf my_archive.tar.gz – Olivier Dulac – 2017-06-01T12:51:35.657

@Xen2050, rev reverses the order of the characters in each line, not the line order in the stream. Because of this, sort groups the files by their extension. I suspect the -I - should have been -T -, which provides the file list on stdin. – billyjmc – 2018-02-14T04:31:07.060

@billyjmc I see, that rev would kind of arrange by extension, not that there are many extensions in linux anyway. I'd imagine sorting by size would have a higher chance of finding dup's – Xen2050 – 2018-02-21T03:37:34.383

1

As an addition to 'mechanical snail's answer:

Even xz (or lzma) won't find duplicates if the file size of the uncompressed single file (or, more accurately, the distance between the duplicates) exceeds the dictionary size. xz (or lzma) even on the highest setting -9e only reserves 64MB for this.

Luckily you can specify your own dictonary size with the option --lzma2=dict=256MB (only --lzma1=dict=256MB is allowed when using the lzma alias to the command)

Unfortunately, when overriding the settings with custom compression chains like given in the example above, the default values for all the other parameters are not set to the same level as with -9e. So compression density is not as high for single files.

Chaos_99

Posted 2012-09-24T18:58:29.110

Reputation: 681

1

On my system lzma test.tar results in a 106'3175 bytes (1.1M) test.tar.lzma file

rmweiss

Posted 2012-09-24T18:58:29.110

Reputation: 624

-2

gzip with no command line switches uses the lowest possible algorithm for compression.

Try using:

gzip -9 test.tar

You should get better results

J Baron

Posted 2012-09-24T18:58:29.110

Reputation: 800

gzip with no command line switches uses the lowest possible algorithm for compression. => This is not true - "man gzip" states that "(t)he default compression level is -6 (that is, biased towards high compression at expense of speed)." This is true for all gzip version I know, if the compiled-in default settings are not overridden by the GZIP environment variable. Even level "-9" won't help you here, as already explained in the given answers. – Gunter Ohrner – 2019-02-13T14:02:21.823

1Not really, the difference is minimal. I also tried bzip2 with similar results. – Guido – 2012-09-24T19:15:45.463