Compress Similar Files Efficiently

I frequently have the need to compress files that are very similar to each other.

Currently I use 7Zip, which compresses a 16GB file down to 1.2GB in about 35 minutes using 8 cores with Ultra settings.

It seems to me that much of that time is spent computing the dictionary to use for compression. Since the files are highly similar, the dictionary actually used is likely also similar.

Is there a Windows-based compression tool (7Zip with an option I'm not aware of, or a different tool) that can save the dictionary and reuse that saved dictionary for subsequent files?

Is there a better way to approach the problem of maintaining a compression ratio similar to what I have, while compressing significantly faster?

compression

Eric J.

Posted 2013-03-15T12:50:17.270

Reputation: 1 449

Answers

The Lempel-Ziv-Welch (LZW) compression algorithm is inherently computationally intensive, with the majority of the work itself being actually computing the dictionary. This is literally just how LZW works.

The algorithm itself adds one new dictionary entry for every next "symbol" it scans, and thus during every single iteration, a new entry is added to the dictionary. In effect, the dictionary becomes the compressed copy of the file, and thus is actually the only thing the LZW compression spends any significant time computing in the first place.

If you used something like Huffman encoding, dictionary re-use would indeed be possible (at the expense of a possibly sub-optimal compression rate/size). However, most modern compression algorithms & tools use the LZW algorithm for efficiency and speed (Huffman compression would require two passes over the data [one to generate the Huffman tree/table, another to actually compress the data], whereas LZW can be completed in a single pass).

Breakthrough

Posted 2013-03-15T12:50:17.270

Reputation: 32 927

1With Huffman and a predefined dictionary, would just one pass be required? Are there any off-the-shelf Huffman based tools that support saved dictionaries? – Eric J. – 2013-03-15T13:11:15.430

@EricJ. yes, with a predefined dictionary it would be single-pass encoding. I don't know any software off-hand that can do this, although I have personally written programs that do so. While I haven't tried it, this tool looks like it can do just that. However, just note that (again, unlike LZW) to decode a Huffman-encoded bitstream, you still need the original dictionary to decompress the data.

– Breakthrough – 2013-03-15T13:12:44.413

Based on the age of that tool, I'm guessing it is single-threaded. I would guess using 1 core rather than 8 would offset any benefit to a fixed dictionary :-( Having the dictionary available on the other end is feasible in my scenario (transferring large files between data centers). – Eric J. – 2013-03-15T13:28:33.173

Unlike the DEFLATE algorithm, 7-Zip's LZMA uses solid compression by default, which takes advantage of inter-file redundancy. This will work with default settings as long as the files are small enough.

With the default settings of 2 GB for Solid Block size, a 16 GB file is actually compressed as 8 separate chunks.

As @Breakthorugh already said, the dictionary gets generated on the fly. You can verify this empirically by setting Solid Block size to Solid (compress all files at once) and Non-solid (compress each file separately).

Increasing the Solid Block size will actually result in a slow-down, but it may result in a much better compression ratio. For example, compressing two identical files will result in an archive almost twice as big with non-solid compression.

Dennis

Posted 2013-03-15T12:50:17.270

Reputation: 42 934

1In my case, I compress the similar files one at a time, on different occasions. There is only every one 16GB-ish file in a given archive. – Eric J. – 2013-03-15T13:31:54.230

Ah, OK. I misinterpreted that. Do the old archives get deleted when the new one gets created? If no, would it be admissible to store several files in a single archive? That won't help with the compression speed, but depending on how similar the files actually are, it might help with the ratio. – Dennis – 2013-03-15T13:41:37.207

1Nevermind, it doesn't. Updating a solid archive takes a lot more time, but it doesn't result in better compression. – Dennis – 2013-03-15T13:46:00.030