19
5
I have hundreds of similar big files (30 megabyte each) which I want to compress. Every pair of files have 99% of same data (less then 1% difference), so I expect to have not more than 40-50 megabyte archive.
Single file can be compressed from 30 MB to 13-15 MB (with xz -1
, gz -1
, bzip2 -1
), but when compressing two or more files I want to have archive with size 13-15MB + N*0.3MB
where N is number of files.
When using tar
(to create solid archive) and xz -6
(to define compression dictionary to be bigger than one file - Update - this was not enough!), I still have archive with size N*13MB
.
I think that both gzip
and bzip2
will not help me because they have dictionary less than 1 MB, and my tar stream has repetitions every 30 MB.
How can I archive the my problem in modern Linux using standard tools?
Is it possible to tune xz
to compress fast, but use dictionary bigger than 30-60 MB?
Update: Did the trick with tar c input_directory | xz --lzma2=dict=128M,mode=fast,mf=hc4 --memory=2G > compressed.tar.xz
. Not sure about necessary of mf=hc4
and --memory=2G
options; but dict=128M
set the dictionary to be big enough (bigger than one file), and mode=fast
make the process bit faster than -e
.
Running
xz -1 --memory=2G
did not help, tested on 2 and 4 files from the set. – osgx – 2014-03-18T19:41:50.053