16

Two separate discussions have very recently opened my eyes to an issue I had not considered – how to confirm the Open Source binary one uses is based on the published source code.

There is a large discussion thread on cryptography-randombit based on Zooko Wilcox-O'Hearn's, founder and CEO of LeastAuthority.com, open letter to Phil Zimmermann and Jon Callas, two of the principals behind Silent Circle, the company that ran Silent Mail which touched on the subject. Additionally, a Dr. Dobbs article published today entitled Putting Absolutely Everything in Version Control touched on it as well.

The issue of concern for this question is ability to recompile Open Source code and get the same results as the published binary. In other words, if you recreate the same binary and hash it from source code it is unlikely to be identical due to differences in tool chains and some randomizations in compilers themselves.

The Dr. Dobbs article suggests putting even the tool chain under version control for reasons of reproducibility. Jon Callas points out that in many cases it may be impossible to redistribute the tool chain for various reasons including licensing restrictions. Unless you are compiling the code yourself you are adding a trust step to your assumption set as the binary cannot be even recreated by others with the same results.

I now understand that this is an understandably accepted risk. My question is are there other discussions or indicatives relative to making source code byte for byte reproducible when compiled thus eliminating the need to trust the provider of even Open Source binaries? As referenced in Jon Callas’ discussion Ken Thompson showed “You can't trust code that you did not totally create yourself.” What are the security implications thoughts on this subject?

Gilles 'SO- stop being evil'
  • 50,912
  • 13
  • 120
  • 179
zedman9991
  • 3,377
  • 15
  • 22
  • 7
    I'd like to point out that if you're compiling the program yourself to check against the published executable, you might as well use your own version instead. That way you save yourself some time and can't make a mistake. (Oh and also, comparing two files byte by byte is faster than comparing their hashes) – LS97 Jul 30 '14 at 19:57
  • @LS97: Strictly speaking, comparing hashes actually can be faster assuming random access is expensive and each file is stored contiguously on disk, since reading both files in parallel while comparing them incurs random access while hashing allows reading one after the other in series. – R.. GitHub STOP HELPING ICE Jul 31 '14 at 04:31
  • @R.. that's true, speed of hashing vs byte comparison depends on a lot of factors. If you have a large enough memory, for example, you can still load both serially. Or, you do it in parallel and hope to find a difference early on in a big file (and then stop there). Also, byte-by-byte is actually the only feasible method in a computer with tiny RAM (I'm talking a few kilobytes). – LS97 Jul 31 '14 at 08:25

8 Answers8

24

It's not that simple.

With the huge number of platforms on which the program could have been built, it can be extremely difficult to replicate the original build environment. Because of this, you could be using a different compiler, with different settings, using different versions of libraries. These slight variations in the environment can definitely affect the compiled binary. Of course, if the author is willing to specify their build environment precisely, or if you're lucky (different languages can affect this) it could be possible to rebuild the exact same binary.

For a recent situation in which this was an issue, see TrueCrypt, an open-source-ish0 full-disk encryption program. When the TrueCrypt site was abruptly replaced with an announcement declaring the unexpected end of the TrueCrypt project, people were obviously interested in checking the code. However, different people building TrueCrypt often got binaries that differed wildly from the official build, owing to variations in the build environment. One person apparently managed (after some arduous work in recreating something very close to the original environment) to replicate the TrueCrypt build from scratch with only slight variations in the compiled output.1 Of course, it's not possible to verify this yourself unless you're willing to attempt the same thing.

Of interest on that page is the fact that the binary contains a timestamp of the compile time. This alone means that compiling and comparing the hashes would never work.

0: TrueCrypt has a strange license with some issues; it's not certain whether it would actually be safe to fork the project.

1: Actually, it looks like they did this before the TrueCrypt site strangeness, but have since succeeded in replicating the version 7.2 build as well.

Lily Chung
  • 968
  • 1
  • 9
  • 13
  • 1
    You did choose a wise example with TrueCrypt. – Marcel Jul 31 '14 at 07:05
  • 3
    The Tor Project has a very interesting blog post about [Deterministic Builds](https://blog.torproject.org/blog/deterministic-builds-part-two-technical-details) that explains some of the troubles with this. – Konerak Jul 31 '14 at 07:15
16

If you compile the code yourself, then you may obtain the same binary. Or not. Basically, your chances are good if the compiler uses deterministic optimization algorithms (that's the usual case) and you use the exact same compiler version with the same command-line options (that's usually much harder to ensure).

Deterministic re-compilation is easier with programming framework where the "compiled" format is formally specified and not really optimized. I am talking here about Java bytecode or .NET assemblies. When using such tools, being able to recompile the source code and obtaining the same binary is possible, though hard. With C or C++, forget it.

The usual methods are:

  • Compile yourself.
  • Have some trusted third-party do the compilation. That third-party will get a copy of source, perform the compilation from their machines, and sign (with cryptography or with paper) both the source archive and the produced binary.
  • Have the provider of the binary sign the binary, and trust that reverse engineering will be feasible enough to demonstrate foul play if need be (there again, this is much less implausible when talking about Java bytecode than compiled C code).
  • Don't use external software; reimplement everything in-house (and yeah, this is a usual method, which is not the same as recommended).
  • Just go ahead and trust in your good luck (certainly not a recommended method, but surely the cheapest in the short term).

Note that (re)compiling code also requires that the machine on which the compilation takes place is not under hostile control. This very classic essay is a must-read on the subject. The underlying idea is that your trust must still start somewhere (if only in the hardware itself, whose firmware is assumed to be malware-free) so the best you can really do is to maintain a clear audit trail. Such a trace does not guarantee against backdoor insertion, but can help a lot in assigning blame and responsibility when trouble arises.

Gilles 'SO- stop being evil'
  • 50,912
  • 13
  • 120
  • 179
Thomas Pornin
  • 320,799
  • 57
  • 780
  • 949
  • 3
    If one had a cross-compiler for the target system which was designed to produce fully deterministic output, one inspected the code of that compiler to ensure there was nothing evil in it, and compiling an open-source program on multiple independent systems yielded identical results, it wouldn't be necessary to know that any particular machine was trustworthy--merely that there's no plausible mechanism by which they could all be infected the same way. – supercat Jul 30 '14 at 18:30
14

If you can recompile the source code and have your own binary, then maybe you won't be able to get the exact same binary as the one that is distributed; but why would it matter ? At that point, you have your own binary, which necessarily matches the source code (assuming your compiler is not itself malicious): you can just ditch the binary package, and use your own binary.

In other words, situations where you would be able to verify the compilation output are situations where you can compile yourself, making the verification a moot point.

There are package distribution frameworks out there, which rely on source code distribution and local compilation instead of binary packages; e.g. pkgsrc (the native system for NetBSD) or MacPorts (for MacOS X machines). However, they don't do that for trust or security, but because distribution of binary packages involves build systems somewhere, and these are not free; also, one point of pkgsrc is to provide easy management of local compilation options.

The famous Thompson essay highlights the idea that even making your own compilation is not enough. Taken to the extreme, you should write your own code, but also your own compiler, and run that on hardware which you designed and engraved yourself: you cannot trust the machine unless you started with a bucket of sand (for silicon, the main component of semiconductors). This is, of course, quite impractical. Therefore, we need the second best thing, and that second best is a paradigm shift: replace trust with violence.

What we do is that binary packages are signed. The package installer verifies the signature before installing it, and rejects packages which do not come from "trusted sources". The same concept applies to Java applets, which can be granted extra permissions (and, indeed, permission to do whatever they want with your computer) provided that they are signed. Note that this is indeed a signature, not just an authentication; it is not sufficient (nor indeed necessary) that the package was downloaded from a "trusted repository" through HTTPS. Such a download would give you quite some guarantee that the package comes from whom you believe, and has not been modified in transit. But you want more: you want a proof. You want a signature because IF the package turns out to be clock-full of malware, THEN you can use the signature to demonstrate that the package provider was an accomplice, at least "by negligence". From signatures comes responsibility, and responsibility works on fear. Fear of litigation from abused customers. Fear of retaliation from law enforcement agencies. Ultimately, fear of violence.

Tom Leek
  • 168,808
  • 28
  • 337
  • 475
  • You had me at melting the sand. – zedman9991 Sep 04 '13 at 00:37
  • 8
    How can you guarantee the sand has not been compromised? –  Sep 04 '13 at 01:05
  • Don't care! You can make a correct compiler out of compromised sand! Just have to check that the compiler compiled with this compiler gives… itself :). – dan Jul 31 '14 at 20:54
  • @danielAzuelos, disregarding the sand joke... a compromised compiler could compile itself (with the compromised part added). – domen Aug 01 '14 at 08:52
  • → domen: unfortunately you're right :(. The first C compiler was handwritten in PDP-7 assembler. If a machine code was introduced at this era, so as to be repeatedly embedded within every new compiled compiler, their would still be today fossils of this machine code in *any* C compiler of the world exactly as there is a track of our reptilian brain embedded in every humain being brain. – dan Aug 01 '14 at 09:33
11

Yes it is possible. But it is very hard, as the whole compilation process hasn't been designed for that goal. It is often called "deterministic builds", "reproducible builds", "idempotent builds" and is a challenge.

Bitcoin, Tor, and Debian, are attempting to use deterministic builds, and the technical process is described here.

Admittedly the process is imperfect, fragile, and very difficult to get right. When considering cross-platform builds the problem is even more complex.

user10008
  • 4,315
  • 21
  • 33
makerofthings7
  • 50,090
  • 54
  • 250
  • 536
  • Both of those links go to the same place. Did you mean for the first one to go to [part one](https://blog.torproject.org/blog/deterministic-builds-part-one-cyberwar-and-global-compromise)? – Lily Chung Jul 30 '14 at 18:03
  • @IstvanChung - Thanks, was on a mobile device. Cut&Paste fail – makerofthings7 Jul 30 '14 at 18:07
  • 1
    Once a binary is verified to match the source code, only use the binary with the verified checksum for the whole archive. Even two tarballs proven to have identical files and permissions can be subtly different. We have seen this with file creation order having one install crash, and another one run correctly; in spite of every file having the same md5 cheksum. A before B in one directory traversal versus B before A in another. One of those orderings could be related to a vulnerability, etc. – Rob Jul 30 '14 at 21:09
3

I like determinism.

A compiler or any software tool is really a devious mathematical transform. It takes s (source code) puts it in a function C() and produces a binary output b.

b = C(s) every time! otherwise determinism fails and we all go mad.

So the theory goes, as long as we start with the same s, and the same C(), we will always produce the same b.

And this is good because we can perform a hash of b or H(b) and get a relatively short value which we can compare to someone else's H(b) to make sure that are binary is the one we expect.

And then change happens: s changes to s', C() changes to C'(). Oh no!

Because C(s) = b1 and C'(s) = b2 and C(s')= b3 and C'(s') = b4

and of course no two of H(b1), H(b2), H(b3), or H(b4) with ever match.

And the problem is that as the components (tool chain, environment, configuration, OS, etc) that are required to produce binary b get more numerous and interdependent it becomes harder and harder to reproduce the same b.

Wait, what if we didn't need the exact same b?

We then we are dealing with b and b' and the difference between them.

All you need to to find the difference between a reference binary b and your generated binary b' and look at what the difference means. If the source for b and b' is s then that means we are dealing with C() and C'. And thus we can correlate the difference between C() and C'() to the difference between b and b'. So even if we can not exactly reproduce b we can gain some confidence in b' by leaning what difference is cause by using C'() instead of C().

this.josh
  • 8,843
  • 2
  • 29
  • 51
  • Good points. I guess the technique for determining that the source for b and b' is indeed the same is the sticky point since the compilers are in the hands of others. – zedman9991 Sep 04 '13 at 12:14
3

Even with the same sources, the same OS, the same libraries, the same compiler and the same loader, two binaries won't match since they include information about date of compile and load operation.

On the exact same system and development environment, if you build twice the same binary it will be different and hence any hash will differ:

$ md5 nmap
MD5 (nmap) = 8ef4b7c1cb2c96ce68d9e08224419b4f
$ # make clean, make install
$ md5 nmap
MD5 (nmap) = 94467bc53973550f919293f891f245f9

On the other hand, if the symbol tables weren't stripped, then these symbol tables will match and will be a good approximation to diagnose that a binary is really built from a given source:

$ nm -a nmap >/tmp/nmap.nm.1
$ # make clean, make install
$ nm -a nmap >/tmp/nmap.nm.2
$ diff /tmp/nmap.nm.[12]
$

This is only valid for me to verify that a binary is really coming from a given version of my tree source. If I suspect an external source of tempering with everything, then even these symbol tables could be "arranged".

dan
  • 3,033
  • 14
  • 34
3

Generally, if you aren't sure you trust someone else's compilation, you would make the effort to do your own, or find someone to obtain it from whom you do trust.

But are you sure you can trust that your compiler hasn't been infected?

It's turtles all the way down. At some point you will always wind up having to make a judgement call and/or rely on virus-checkers, firewalls, and other security systems.

AFTERTHOUGHT -- This is one of the reasons that companies which distribute productized versions of open-source code exist. They police their codebase, promise their builds are clean, and (if you purchase it) they provide ongoing support. Remember, even Stallman's GNU manefesto said "software should be free, support should cost."

Keeping the downloads trustworthy is a form of support. You may get good support from a free community... but you may get better support if you throw some dollars at it. Pick your preferred tradeoff point.

I'm willing to use some random Linux build for hacking around on a secondary machine. I'd prefer something like Fedora for the personal machine that I actually rely upon. And if I was betting a business on it, I'd go with the full purchased-product version, Red Hat Enterprise or similar. (Endorsement not implied; Fedora and RHEL are just good illustrations of how one company addresses two different points on that spectrum.)

keshlam
  • 450
  • 2
  • 6
1

One of the fundamental things that you are trusting in a binary is the place you got it from. If sourceforge or download.com or whoever says its free of viruses and thats good enough for you, go for it. You're taking their word on it.

IF you don't want to trust a binary then the only other real answer is to compile from source code. Either to something like java bytecode that you can run, or a jar, OR all the way to a binary.

If you compile your own binary, yes you might end up with something that is the same as the standard binary (meaning that EVERYTHING is the SAME, a bit for bit match) Great! you happened to be running the same hardware, compiling for the same processors, nobody had accidentally left an extra line break in your copy of the code.... whether it matches or not, at that point you're trusting the code that you just had (the ability at least) to read. If you don't know C++ and you don't trust other people who have vetted the code, then tough. Learn C++ and vet it yourself.

This all boils down to you can't verify a binary unless everything matches EXACTLY. You can always verify the open source code for something though. Whether you take the time or whether you trust the analysis that some one out there has presumably done is your choice.

PsychoData
  • 296
  • 1
  • 11