Almost all modern archivers do exactly this, the only difference is that they refer to this as a "solid" archive, as in all of the files are concatenated into a single stream before being fed to the compression algorithm. This is different from standard zip compression which compresses each file one by one and adds each compressed file to the archive.
7-zip by its very nature effectively achieves de-duplication. 7-Zip for example will search for files, will sort them by similar file types and file names and so two files of the same type and data will be placed side by side in the stream going to the compressor algorithms. The compressor will then see a lot of data it has seen very recently and those two files will see a large increase in compression efficiency compared to compressing the files one-by-one.
Linux has seen a similar behaviour for a long time through the prevalence of their ".tgz" format (or ".tar.gz" to use it's full form) as the tar is simply merging all the files into a single stream (albeit without sorting and grouping of files) and then compressing with gzip. What this misses is the sorting that 7-zip is doing, which may slightly decrease efficiency but is still a lot better than simply blobbing a lot of individually compressed files together in the way that zip does.
7ip does a fair job of deduplicating, but it's also designed to compress non-duplicate data effficiently, and uses a lot of CPU and memory to achieve that, which makes it a very inefficient way to deduplicate data. If you compress two identical 100MB files, it will take a lot of trouble to try and compress the first file efficiently, and only then (if the dictionary size is large enough) compress the second file as a duplicate of the first. – mwfearnley – 2016-10-22T22:08:18.060
Doesn’t gzip with
.tar.gz
only compress relatively small blocks (like 900KB) at a time completely independently from each other and thus not have the ability to deduplicate two large but identical files (e.g., a couple 4MB images)? – binki – 2018-06-11T20:15:25.280E.g., 7z was able to dedupe between large files but gzip wasn’t: https://gist.github.com/binki/4cf98fb4f1f4aa98ee4a00edaf6048fa
– binki – 2018-06-11T20:40:04.590learn something new every day. I did not realize that zip compressed each file separately but after running a couple of tests on my computer i realized that you are indeed correct. very interesting, thank you! – CenterOrbit – 2011-05-20T21:39:14.357