Does recompiling a program produce a bit-for-bit identical binary?

25

5

If I were to compile a program into a single binary, make a checksum, and then recompile it on the same machine with the same compiler and compiler settings and checksum the recompiled program, would the checksum fail?

If so, why is this? If not, would having a different CPU result in a non-identical binary?

David

Posted 2013-08-31T23:54:15.557

Reputation: 451

8It depends on the compiler. Some of them embed time stamps, so the answer is "no" for those. – ta.speot.is – 2013-09-01T00:54:05.920

Actually it depends on the executable format, not the compiler. Some executable formats like Windows’ PE format include a timestamp which is touched to the compile time and date, while other formats like Linux’ ELF format do not. Either way, this question hinges on the definition of “identical binary”. The image itself will/should be bitwise identical if the same source file is compiled with the same compiler and libraries and switches and everything, but the header and other metadata can vary.

– Synetech – 2013-12-27T05:13:19.947

Answers

19

  1. Compile same program with same settings on same machine:

    Although the definitive answer is "it depends", it is reasonable to expect that most compilers will be deterministic most of the time, and that the binaries produced should be identical. Indeed, some version control systems depend on this. Still, there are always exceptions; it is quite possible that some compiler somewhere will decide to insert a timestamp or some such (iirc, Delphi does, for example). Or the build process itself might do that; I've seen makefiles for C programs which set a preprocessor macro to the current timestamp. (I guess that would count as being a different compiler setting, though.)

    Also, be aware that if you statically link the binary, then you are effectively incorporating the state of all relevant libraries on your machine, and any change in any one of those will also affect your binary. So it is not just compiler settings which are relevant.

  2. Compile same program on a different machine with a different CPU.

    Here, all bets are off. Most modern compilers are capable of doing target-specific optimizations; if this option is enabled, then the binaries are likely to differ unless the CPUs are similar (and even then, it's possible). Also, see the above note about static linking: the configuration environment goes far beyond the compiler settings. Unless you have very strict configuration control, it's extremely likely that something differs between the two machines.

rici

Posted 2013-08-31T23:54:15.557

Reputation: 3 493

1Say I was using GCC, and I wasn't using the march option (the option that optimizes the binary for a specific family of CPU's), and I was to compile a binary with one CPU, and then with another CPU would there be a difference? – David – 2013-09-01T17:46:27.727

1@David: It still depends. First, the libraries you're linking to may have architecture-specific builds. So the output of gcc -c may well be identical, but the linked versions differ. Also, it's not just -march; there is also -mtune/-mcpu and -mfpmatch (and possibly others). Some of these may have different defaults on different installations, so you may need to force the worst-possible case for your machines explicitly; doing so might significantly reduce performance, particularly if you revert to i386 without sse. And, of course, if one of your cpus is an ARM and the other an i686... – rici – 2013-09-01T17:56:27.483

1Also, is GCC one of the compilers in question that add a timestamp to binaries? – David – 2013-09-01T17:58:32.670

@david: afaik, no. – rici – 2013-09-01T17:59:34.417

8

What your are asking is "is the output deterministic." If you compiled the program once, immediately compiled it again you would probably end up with the same output file. However, if anything changed - even a small change - especially in a component the compiled program uses, then the output of the compiler might also change.

headkase

Posted 2013-08-31T23:54:15.557

Reputation: 1 690

2

Very good point indeed. This article has some very interesting observations. In particular, compilation with GCC may not be deterministic with regards to inputs in certain cases, for instance in how it mangles functions in anonymous namespaces, for which it uses a random number generator internally. To get determinism in this particular case, supply an initial random seed by specifying the option -frandom-seed=string.

– ack – 2014-09-23T18:09:24.280

7

  • -frandom-seed=123 controls some GCC internal randomness. man gcc says:

    This option provides a seed that GCC uses in place of random numbers in generating certain symbol names that have to be different in every compiled file. It is also used to place unique stamps in coverage data files and the object files that produce them. You can use the -frandom-seed option to produce reproducibly identical object files.

  • __FILE__: put the source in a fixed folder (e.g. /tmp/build)

  • for __DATE__, __TIME__, __TIMESTAMP__:
    • libfaketime : https://github.com/wolfcw/libfaketime
    • override those macros with -D
    • -Wdate-time or -Werror=date-time: warn or fail if either __TIME__, __DATE__ or __TIMESTAMP__ are is used. The Linux kernel 4.4 uses it by default.
  • use the D flag with ar, or use https://github.com/nh2/ar-timestamp-wiper/tree/master to wipe stamps
  • -fno-guess-branch-probability: older manual versions say it is a source of non-determinism, but not anymore. Not sure if this is covered by -frandom-seed or not.

The Debian Reproducible builds project attempts to standardize Debian packages byte-by-byte, and recently got a Linux Foundation grant. That includes more than just compilation, but it should be of interest.

Buildroot has a BR2_REPRODUCIBLE option which may give some ideas on the package level, but it is far from complete at this point.

Related threads:

Ciro Santilli 新疆改造中心法轮功六四事件

Posted 2013-08-31T23:54:15.557

Reputation: 5 621

7

Does recompiling a program produce a bit-for-bit identical binary?

For all compilers? No. The C# compiler, at least, is not allowed to.

Eric Lippert has a very thorough breakdown on why the output of the compiler is not deterministic.

[T]he C# compiler by design never produces the same binary twice. The C# compiler embeds a freshly generated GUID in every assembly, every time you run it, thereby ensuring that no two assemblies are ever bit-for-bit identical. To quote from the CLI specification:

The Mvid column shall index a unique GUID [...] that identifies this instance of the module. [...] The Mvid should be newly generated for every module [...] While the [runtime] itself makes no use of the Mvid, other tools (such as debuggers [...]) rely on the fact that the Mvid almost always differs from one module to another.

Although it's specific to a version of the C# compiler, many points in the article can be applied to any compiler.

First off, we are assuming that we always get the same list of files every time, in the same order. But that's in some cases up to the operating system. When you say "csc *.cs", the order in which the operating system proffers up the list of matching files is an implementation detail of the operating system; the compiler does not sort that list into a canonical order.

ta.speot.is

Posted 2013-08-31T23:54:15.557

Reputation: 13 727

5The ECMA standard does not have to have timestamps or MVID differences. Without those, it is at least possible for identical binaries in C#. Thus the main reason is a questionable design decision and not a real technical constraint. – Shiv – 2015-01-30T05:37:53.527

It shouldn't be hard to make the built reproducible (apart from a few easily discarded fields like compilation time and the assembly GUID). For example sorting input files into a canonical order is a one-liner. Even that GUID could be a hash of the remainder of the assembly instead of newly generated. – CodesInChaos – 2013-09-01T11:29:34.970

I assume you mean the Microsoft C# compiler, or is it a requirement of the specification? – David – 2013-09-01T17:56:27.957

@David The CLI spec requires it. Mono's C# compiler would have to do the same. Ditto for any VB .NET compiler. – ta.speot.is – 2013-09-01T21:15:52.197

3

The project https://reproducible-builds.org/ is all about this, and is trying hard to make the answer to your question "no, they will not differ" in as many places as possible. NixOS and Debian are now over 90% in reproducibility for their packages.

If you compile a binary, and I compile a binary, and they're bit-for-bit identical, then I can be reassured that the source code and the tools are what determine the output, and that you didn't sneak in some trojan code along the way.

If we combine reproducibility with bootstrappability from human-readable source, as http://bootstrappable.org/ is working on doing, we get a system determined from the ground up by human-readable source, and only then are we at a point where we can trust that we know what the system is doing.

clacke

Posted 2013-08-31T23:54:15.557

Reputation: 218

1Cool links. I'm a Buildroot fanboy, but if someone gives me a Nix ARM cross arch setup that boots on QEMU, I'll be happy :-) – Ciro Santilli 新疆改造中心法轮功六四事件 – 2019-06-04T16:45:03.770

I didn't mention Guix because I don't know where to find their numbers, but they were before NixOS on the reproducibility train with verification tooling and such, so I'm sure they're on equal footing or better. – clacke – 2019-07-07T08:46:01.300

3

I'd say NO, it is not 100% deterministic. I previously worked with a version of GCC which generates target binaries for the Hitachi H8 processor.

It is not a problem with the time stamp. Even if the time stamp issue is ignored, the specific processor architecture may allow the same instruction to be encoded in 2 slightly different ways where some bits can be 1 or 0. My previous experience shows that the generated binaries were the same MOST of the time but occasionally the gcc would generate binaries with identical size but some of the bytes different by only 1 bit e.g. 0XE0 becomes 0XE1.

JavaMan

Posted 2013-08-31T23:54:15.557

Reputation: 447

And did that lead to different behavior or "serious problems"? – Florian Straub – 2019-04-05T13:14:17.883

1

In general, no. Most reasonably sophisticated compilers will include the compile time in the object module. Even if you were to reset the clock you'd have to be very accurate with regard to when you kicked off the compile (and then hope that disk accesses, etc, were the same speed as before).

Daniel R Hicks

Posted 2013-08-31T23:54:15.557

Reputation: 5 783