Compressing many similar big files

I have hundreds of similar big files (30 megabyte each) which I want to compress. Every pair of files have 99% of same data (less then 1% difference), so I expect to have not more than 40-50 megabyte archive.

Single file can be compressed from 30 MB to 13-15 MB (with xz -1, gz -1, bzip2 -1), but when compressing two or more files I want to have archive with size 13-15MB + N*0.3MB where N is number of files.

When using tar (to create solid archive) and xz -6 (to define compression dictionary to be bigger than one file - Update - this was not enough!), I still have archive with size N*13MB.

I think that both gzip and bzip2 will not help me because they have dictionary less than 1 MB, and my tar stream has repetitions every 30 MB.

How can I archive the my problem in modern Linux using standard tools?

Is it possible to tune xz to compress fast, but use dictionary bigger than 30-60 MB?

Update: Did the trick with tar c input_directory | xz --lzma2=dict=128M,mode=fast,mf=hc4 --memory=2G > compressed.tar.xz. Not sure about necessary of mf=hc4 and --memory=2G options; but dict=128M set the dictionary to be big enough (bigger than one file), and mode=fast make the process bit faster than -e.

osgx

Posted 2014-03-18T19:35:48.463

Reputation: 5 419

Running xz -1 --memory=2G did not help, tested on 2 and 4 files from the set. – osgx – 2014-03-18T19:41:50.053

Answers

Given your details, I assume that you have verified that your files really have 99% of data in common, with a contiguous (or almost contiguous) 1% of difference in them.

First, you should use tar to make one archive with your files inside it. For tests, I would create a .tar with 10 files, so having a 300MB size.

Then, using xz, you have to set it so that the dictionary is bigger than the size of one file. Since you don't say if you have memory restrictions, I'd go with xz -9. There's no point in not using all available memory.

I'd also use the --extreme preset, to test if it makes difference.

Dictionary size

In one documentation that I have available - site - it's said that the dictionary size is roughly equal to the decompressor memory usage. And the -1 parameter means a dict of 1MiB, -6 means 10 MiB (or 8 MiB in another part of the same manual). That's why you're not getting any advantage by tarring those files together. Using the -9 would make the decompessor (and, so, dictionary) be 64 MiB, and I think that is what you wanted.

Edit

Another possibility would be using another compressor. I'd go with 7zip, but would tar those files first and then 7zip them.

Depending on your files content, perhaps you could use 7zip with PPM-D method (instead of LZMA or LZMA2, that is the default and the same used by xz)

Not good: Zip (dict = 32kB), Bzip (dict = 900 kB).

woliveirajr

Posted 2014-03-18T19:35:48.463

Reputation: 3 820

Xz and 7-Zip both use LZMA2 so there'd be no benefit there. PPMD is optimized for extremely slow but high compression rate entropy extraction from already-compressed media (e.g. MP3s and video). It is not particularly likely to find the large similarities between the two files and store them in the dictionary -- no more likely than LZMA2. – allquixotic – 2014-03-18T20:29:58.080

woliveirajr, what about using not -1 or -9 preset, but specify dict=64MB or dict=128MB and set mode=fast? – osgx – 2014-03-18T20:39:52.377

Using dict=xxMB instead of -1 or -9 would go direct to the point, but since I don't know how xz sets other parameters when you just use the -9, I don't know if you wouldn't miss something else. I think that you're in the right direction, and just testing will give you a precise answer. – woliveirajr – 2014-03-18T20:43:43.783

3With xz --lzma2=dict=128M,mode=fast,mf=hc4 --memory=2G I was able to compress 250 files (7.5 GB) to 18 MB tar.xz archive. – osgx – 2014-03-18T20:48:30.273

@osgx :) that's pretty nice. If it didn't take too much time (i.e., it's within your needs), problem solved! :) So you got final_size = 13MB + x * 6kB, more or less. – woliveirajr – 2014-03-18T20:58:34.630

If they are truly 99% similar as you say, you should be able to use bsdiff or a similar algorithm to calculate differences between the files. Is the difference cumulative (i.e., each file differs a little more from the first), or is the difference between any two files pretty much the same?

If it's not cumulative, you should be able to:

Take any arbitrary file as the "baseline"
Run bsdiff comparing the baseline file to each additional file
Store each diff as a separate file, alongside the baseline file
Run a compressor like xz across the results (the baseline + the diffs).

The result should be much smaller than just xzing the entire archive.

You can then "reconstitute" the original files by "applying" the diff on top of the baseline to get each of the other files out.

allquixotic

Posted 2014-03-18T19:35:48.463

Reputation: 32 256

Not cumulative. (" Every pair of files have 99% of same data... ") – osgx – 2014-03-18T20:40:43.723

1If the differences aren't cumulative then this should be a good application of the bsdiff algorithm. Give it a try. – allquixotic – 2014-03-18T20:48:39.083

Thank you for your answer, but I already did the task with xz: tar c directory|xz --lzma2=dict=128M,mode=fast and deleted input files. Actually my input files were text, so I even can use diff instead of bsdiff (which is not installed on my PC). – osgx – 2014-03-18T20:53:17.153

You (I) may use tar with some archiver capable of long-range pattern detection, for example, rzip or lrzip (Readme). Both uses long-range redundency detection/deduplication, then rzip uses bzip2 and lrzip uses xz(lzma)/ZPAQ:

rzip is a compression program, similar in functionality to gzip or bzip2, but able to take advantage long distance redundencies in files, which can sometimes allow rzip to produce much better compression ratios than other programs. ... The principal advantage of rzip is that it has an effective history buffer of 900 Mbyte. This means it can find matching pieces of the input file over huge distances compared to other commonly used compression programs. The gzip program by comparison uses a history buffer of 32 kbyte and bzip2 uses a history buffer of 900 kbyte

lrzip have larger buffer and may use many compression algorithms (very fast, fast, good, and one of the best - ZPAQ) after deduplication:

Lrzip uses an extended version of rzip which does a first pass long distance redundancy reduction. The lrzip modifications make it scale according to memory size.

The data is then either: 1. Compressed by lzma (default) which gives excellent compression at approximately twice the speed of bzip2 compression ...

Other way is using bup - backup program with block-/segment-level deduplication, based on git packfile:

It uses a rolling checksum algorithm (similar to rsync) to split large files into chunks.

osgx

Posted 2014-03-18T19:35:48.463

Reputation: 5 419