Is there a compression or archiver program for Windows that also does deduplication?

12

6

I'm looking for an archiver program that can perform deduplication (dedupe) on the files being archived. Upon unpacking the archive, the software would put back any files it removed during the compression process.

So far I've found:

Anyone aware of any others?

This would probably be an awesome addition to 7-zip.

Larry Silverman

Posted 2011-05-20T20:37:16.640

Reputation: 253

Question was closed 2016-07-12T05:15:13.713

Answers

12

Almost all modern archivers do exactly this, the only difference is that they refer to this as a "solid" archive, as in all of the files are concatenated into a single stream before being fed to the compression algorithm. This is different from standard zip compression which compresses each file one by one and adds each compressed file to the archive.

7-zip by its very nature effectively achieves de-duplication. 7-Zip for example will search for files, will sort them by similar file types and file names and so two files of the same type and data will be placed side by side in the stream going to the compressor algorithms. The compressor will then see a lot of data it has seen very recently and those two files will see a large increase in compression efficiency compared to compressing the files one-by-one.

Linux has seen a similar behaviour for a long time through the prevalence of their ".tgz" format (or ".tar.gz" to use it's full form) as the tar is simply merging all the files into a single stream (albeit without sorting and grouping of files) and then compressing with gzip. What this misses is the sorting that 7-zip is doing, which may slightly decrease efficiency but is still a lot better than simply blobbing a lot of individually compressed files together in the way that zip does.

Mokubai

Posted 2011-05-20T20:37:16.640

Reputation: 64 434

7ip does a fair job of deduplicating, but it's also designed to compress non-duplicate data effficiently, and uses a lot of CPU and memory to achieve that, which makes it a very inefficient way to deduplicate data. If you compress two identical 100MB files, it will take a lot of trouble to try and compress the first file efficiently, and only then (if the dictionary size is large enough) compress the second file as a duplicate of the first. – mwfearnley – 2016-10-22T22:08:18.060

Doesn’t gzip with .tar.gz only compress relatively small blocks (like 900KB) at a time completely independently from each other and thus not have the ability to deduplicate two large but identical files (e.g., a couple 4MB images)? – binki – 2018-06-11T20:15:25.280

E.g., 7z was able to dedupe between large files but gzip wasn’t: https://gist.github.com/binki/4cf98fb4f1f4aa98ee4a00edaf6048fa

– binki – 2018-06-11T20:40:04.590

learn something new every day. I did not realize that zip compressed each file separately but after running a couple of tests on my computer i realized that you are indeed correct. very interesting, thank you! – CenterOrbit – 2011-05-20T21:39:14.357

4

There is no point in using deduplication with a compression process. Most compression algorithms create what is called a 'dictionary' that will look for most common, or reused bits of data. from there it will just reference the dictionary entry instead of writing the whole "word" over again. In this way most compression processes already cut out redundant or duplicate data from all of the files.

For example if you take a 1 MB file and copy it 100 times with a different name each time (totaling 100 MB of disk space), then you compress it in a 7zip or zip file, you will have a 1 MB total zip file. This is because all of your data was put into one dictionary entry and referenced 100 times, which takes up very little space.

This is a very simple explanation of what happens, but the point is still conveyed well.

CenterOrbit

Posted 2011-05-20T20:37:16.640

Reputation: 1 759

1As the dictionary size is very limited for most compression archives, this is not valid in everyday use. Try this with 50MB files and your compressed size will double with two identical input files. – Chaos_99 – 2016-05-19T08:37:00.307

1Zip files, unlike 7zip files, don't support deduplication across files. Zip files compress and store each file separately, so duplicate files will simply be stored multiple times in the archive. – mwfearnley – 2016-10-22T21:58:40.313

1While 7zip does support deduplication across files, it is designed to find and compress much shorter matches. Its algorithms are a lot slower and more memory intensive than what is potentially possible for something designed for finding large-scale data duplication. – mwfearnley – 2016-10-22T22:01:46.123

4

7-Zip, zip, gzip and all other archivers do not detect identical areas that are far away from eachother, such as just a few megabytes or above, inside the same file or placed at different positions inside different files.

So no, normal archivers do not perform as well as exdupe and others, in some siturations. You can see this if you compress some virtual machines or other stuff.

Ian

Posted 2011-05-20T20:37:16.640

Reputation: 41

1This is correct. As soon as the unique data volume exceeds the compressors dictionary size, compression goes down the drain. exdupe offers superior performance for large data volumes. – usr – 2011-12-09T21:46:33.207