Why is my .tar.gz larger than the sum of the separately compressed files in it?

2

I observed the following situation which is somewhat unexpected to me:

I have a csv file and a corresponding txt file. Uncompressed, their sizes are 375MB and 5KB.

  • When I compress the csv file using gzip with standard settings, it's size is reduced to 95MB. So together i have ~ 95MB.
  • When I bundle both files in a tarball and then compress them with gzip standard settings, I end up with 189MB.

From what I know, the compressed tarball should rather be smaller than the compressed csv file + the txt file because then gzip can search for redundancies in all files from the archive. I know that this does not matter for my specific case since the txt file is so small.

However, shouldn't be the .tar.gz of about the same size as the compressed csv + txt file? In my case it's more that twice the size...

I would like to avoid several layers of archiving / compressing but still want to achieve good compression. Am I missing something?

der_grund

Posted 2018-11-15T09:42:31.400

Reputation: 121

4It's almost as if the 95 MB file got included twice - have you confirmed that did not happen? – Andrew Morton – 2018-11-15T09:53:02.913

We need a record of your session to understand what happened. – harrymc – 2018-11-15T09:58:58.697

@AndrewMorton You were right. I created the archive in a script, aiming to bundle three files. I actually did put three files in the archive, but instead of another small text file the regular expression found the already compressed csv so it ended up twice in the archive. I only checked for three files, but I missed that the wrong one was in there. Thanks for making me look twice! – der_grund – 2018-11-16T08:14:01.447

No answers