Why are there binary differences among compressed files generated exactly the same way from the exact same starting file?

4

I use the "diff" command to compare two compressed files generated using zip on the exact same starting file and they are reported as being different. However, when I uncompress them and use the "diff" command, no differences are shown. I've noticed this with both zip and gzip.

Christopher Bottoms

Posted 2009-12-18T19:48:54.137

Reputation: 1 309

Answers

2

You might also like to use zdiff if you do want to compare the compressed contents.

cyborg

Posted 2009-12-18T19:48:54.137

Reputation: 334

13

One of the fields in the gzip header is different between the two files. One such field is the last modified time of the compressed file (in seconds since 1970), or if the compressed data was not read from a file, then the time when the file was compressed.

Even a one second difference is enough to make the gzip files not match.

Kevin Panko

Posted 2009-12-18T19:48:54.137

Reputation: 6 339

@ChristopherBottoms This article investigates the same phenomenon: https://medium.com/@mpreziuso/is-gzip-deterministic-26c81bfd0a49#.6l1qbp9me

– XCore – 2017-01-20T00:12:34.863

2

You can use the gzip option --no-name (or -n) to stop gzip from adding the original file name and the time stamp to the gzip header. That should prevent mismatches when the data is the same, assuming the same compression level is used. One way to add this option to gzip commands is to set the GZIP environment variable, so that that option is used up by every gzip command. For example, in a Bourne-compatible shell such as bash,

export GZIP="--no-name -6"

or

export GZIP=--no-name

jrw32982 supports Monica

Posted 2009-12-18T19:48:54.137

Reputation: 145

2

Two possible causes:

  • different compression algorithm used by the same compression program, or
  • different compression programs

JMD

Posted 2009-12-18T19:48:54.137

Reputation: 4 427

2I didn't think to add that the PKZip file spec, for example, includes data areas that are reserved for comments. It may be that gzip (etc) may be putting data in the comment locations that include values like the Date and Time, which would cause the binary differences you're seeing. They wouldn't affect the data that is compressed, just the final compressed archive. – JMD – 2009-12-18T20:41:26.197

Thanks. I did not think to mention that they were generated by the same program with the same command line options. I did not change any parameters when zipping the two files. +1 when I can. – Christopher Bottoms – 2009-12-18T20:42:13.590