21

I recently found the GitHub repository https://github.com/userEn1gm4/HLuna, but after I cloned it I noted that the comparison between the file compiled (using g++) from source, HLuna.cxx, and the binary included in the repository (HLuna) is different: differ: byte 25, line 1. Is the provided binary file secure?

I've already analyzed that in VirusTotal without any issues, but I don't have the expertise to decompile and read the output, and I've previously executed the binary provided without thinking about the risks.

Peter Mortensen
  • 877
  • 5
  • 10
mcruz2401
  • 191
  • 1
  • 7
  • 3
    If you're able to compile from source, then just use your computer version. – Daisetsu Mar 25 '19 at 05:05
  • 18
    It takes lots of effort for builds to be reproducible (deterministic) due to nature of legacy tools (because no one cared about that in past). [Debian is trying to be deterministic since 2014, still not done](https://wiki.debian.org/ReproducibleBuilds) :) – PTwr Mar 25 '19 at 08:27
  • 1
    There is a relevant post (full disclosure: mine) on OpenSource.SE with several helpful links about deterministic and non-deterministic builds: [Is there any way to assert that source code corresponds to compiled code?](https://opensource.stackexchange.com/q/2737/50) – apsillers Mar 25 '19 at 13:09
  • 1
    How do you know you can trust the source code in the repo? Do you audit every single line of code? (the 175 line source code file you linked to is small enough that you can audit it, but if it were 10,000 or 100,000 lines of code, is the source code any safer than the published binaries?) – Johnny Mar 25 '19 at 21:35

3 Answers3

58

Compilation is not a directly verifiable deterministic process across compiler versions, library versions, operating systems, or a number of other different variables. The only way to verify is to perform a diff at the assembly level. There are lots of tools that can do this but you still need to put the manual work in.

Polynomial
  • 132,208
  • 43
  • 298
  • 379
  • 35
    Even that isn't going to be reliable across optimization levels. – chrylis -cautiouslyoptimistic- Mar 25 '19 at 05:48
  • 44
    Even *if* the compiled object code is 100% identical, there may still be timestamps in the executable file's metadata which cause the resulting binaries to differ even though the code is identical. – Jörg W Mittag Mar 25 '19 at 07:00
  • 2
    Reproducible builds solve this problem. – forest Mar 25 '19 at 08:34
  • This is the real answer. Build never supposed to produce the same binary on two different machines even with same OS, compiler version and configuration. It is just stated nowhere, and no one actually assumed this, at least in C++ world. I don't like the accepted answer because it is specific to the app and does not explain this. – Croll Mar 26 '19 at 09:07
22

Polynomial tells you what may happen, and how to solve it. Here I will illustrate it:

I ran both binaries through strings and diffed them. That enough shows some completely harmless differences, in particular, the compiler used:

GCC: (Debian 6.3.0-18) 6.3.0 20170516                         | GCC: (GNU) 8.2.1 20181105 (Red Hat 8.2.1-5)
                                                              > GCC: (GNU) 8.3.1 20190223 (Red Hat 8.3.1-2)
                                                              > gcc 8.2.1 20181105

Some of the private names used are also different:

_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEaSEOS4_@ | _ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEaSERKS4_

And some sections seem to be shuffled, so the diff cannot match them exactly.

Even on the same computer, without optimisation and -O3 shows different files:

_ZNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEE6appendE | _ZNSt7__cxx1115basic_stringbufIcSt11char_traitsIcESaIcEED2Ev

Even shuffling of internal data:

Diccionario creado!                                           <
MENU                                                          <
1. Generador de Diccionarios                                  <
0. Salir                                                      <
/***                                                          <
*    $$|  |$$ |$$|                                            <
*    $$|  |$$ |$$|                                              *    $$|  |$$ |$$|                                  
*    $$|  |$$ |$$|     $$| |$$  |$$$$$$|  |$$$$$$|              *    $$|  |$$ |$$|     $$| |$$  |$$$$$$|  |$$$$$$|  
*    $$$$$$$$ |$$|     $$| |$$ |$$ __ $$|  ____$$|              *    $$$$$$$$ |$$|     $$| |$$ |$$ __ $$|  ____$$|  
*    $$|  |$$ |$$|     $$| |$$ |$$|  |$$| $$$$$$$|              *    $$|  |$$ |$$|     $$| |$$ |$$|  |$$| $$$$$$$|  
*    $$|  |$$ |$$|___  $$|_|$$ |$$|  |$$| $$___$$|              *    $$|  |$$ |$$|___  $$|_|$$ |$$|  |$$| $$___$$|  
*    $$|  |$$ |$$$$$$$| $$$$$  |$$|  |$$| $$$$$$$|              *    $$|  |$$ |$$$$$$$| $$$$$  |$$|  |$$| $$$$$$$|  
*    ----------------------------------------------             *    ---------------------------------------------- 
                                                              > -------------------
                                                              > Diccionario creado!
                                                              > MENU
                                                              > 1. Generador de Diccionarios
                                                              > 0. Salir
                                                              > /*** 
                                                              > *    $$|  |$$ |$$| 

This proves that differing binary files raises many false positives, and doesn't tell you anything about is safety.

In this case, I'd use the version compiled by myself because you have no way to know what version is uploaded, as the author may have forgotten to recompile before the last tweaks.

Davidmh
  • 336
  • 1
  • 5
  • 7
    I don't think those are different names - what's actually happened is that when the immediately adjoining data are printable, `strings` grabs slightly more text. `nm` might be a better tool for extracting identifiers. – Toby Speight Mar 25 '19 at 16:14
  • @TobySpeight good point, I shall investigate and correct. – Davidmh Mar 25 '19 at 22:06
  • …and even a honest author might be unknowingly infected by some malware. – spectras Mar 26 '19 at 03:05
  • 2
    Protip/warning: GNU Strings was [at one point](https://cve.mitre.org/cgi-bin/cvename.cgi?name=CVE-2014-8485) vulnerable to arbitrary code execution if used on a malicious file. So it may be wise to avoid running it on untrusted files, just in case. – Kevin Mar 26 '19 at 07:20
  • @Kevin any piece of software may be vulnerable to arbitrary code execution if used on a malicious file. That doesn't mean you can't use those tools to examine them, it just mean that you need to airgap the system that runs them. – Braiam Mar 26 '19 at 14:51
2

If the software is exactly the same at source level, then the question boils down to whether you can trust your compiler, system libraries and various utilities which are used during compilation. If you installed your toolchain from a trusted source and you trust your computer wasn't compromised meanwhile, then there's no reason to suspect that the binary file that you generated will be malicious, even if it differs from the "reference" build.

Dmitry Grigoryev
  • 10,072
  • 1
  • 26
  • 56