Does a binary executable have to have some critical plain-text components?

Question

When companies package binary executables, they are often encrypted, compressed, scrambled, and otherwise made so that your lazy hacker can't simply open the program up in Notepad++ and see the code.

In all the ones I've looked at, however, they each have some critical code components which are unencrypted, uncompressed, and human-readable. Here it seems the methodology is more "security by obfuscation" by creating nonsensical variable names and attempting to make the code as difficult to make sense of possible. But the fact remains that it's there in plain text to be deciphered.

Is this by necessity? I was thinking that it might have to be this way so that the OS has something sensible to execute (which then has instructions to further decompress/unencrypt the rest of the executable), but I don't know enough about it to be sure. Or is there some way to actually scramble an entire executable without having any human-readable components?

Your base assumption is wrong. Most code is not encrypted, compressed, or scrambled. Code is *compiled* from source code to machine code, because processors do not execute textual source code. — Stephen Touset, Feb 07 '13 at 18:07
@StephenTouset I understand, but in looking at the actual executables I've seen evidence of at least partial compression (the standard copyright message of Mark Adler and the inflate/deflate version numbers). So while some of it may be simple machine code, some of it certainly isn't. If binary executables were distributed only as compiled machine code, it would be simple to run it through a decompiler and get the original. — asteri, Feb 07 '13 at 18:10
I wonder if the code encryption technology by [SLP Server](http://security.stackexchange.com/a/1091/396) leaves the final private key obfuscated. — makerofthings7, Feb 07 '13 at 18:12
decompiling doesn't give you the original, just a bizzare approximation of optimized and undocumented code. — mgjk, Feb 07 '13 at 18:21
@Jeff A lot of code *uses* zlib, but that doesn't necessarily mean the code itself is compressed. Regardless, the point of that wouldn't be obfuscation of code — compiling already does that to a reasonable extent. — Stephen Touset, Feb 07 '13 at 18:28
@Jeff: Most binary executables **are** distributed only as compiled machine code. And no, decompiling doesn't give you the original source code that was compiled into that executable. — Christoffer Hammarström, Feb 07 '13 at 23:48

score 8 · Accepted Answer · answered Feb 07 '13 at 18:06

8

Ultimately, the CPU runs the code. And the CPU expects instructions in "clear text". You could envision some application code where a small initial part of the executable first decrypts the rest of the code, but this has several issues:

This forces all the code to go to RAM instead of staying on disk and be loaded on-demand, implying a higher RAM consumption and longer start times.
That "decryption" routine must, necessarily, not be encrypted.
The decryption routine knows everything needed to decrypt the rest of the code, so it can be decompiled and emulated by the attacker; it is not really "decryption" since there is no key (or, equivalently, the key is embedded in the routine, which the attacker has under his hands).

Experimentally, this kind of encryption does not hinder attackers much, so the general wisdom is that it is "not worth the effort". Unless you are in a very specific scenario where the "attacker" is a mindless automaton which may be fooled by these hide-and-seek methods -- that's the case of virus, where the "attacker" is the antivirus software.

Really encrypted code is possible if the CPU is doing the decryption itself, internally, with some key management. That's what happens in a PS3 console, for instance.

answered Feb 07 '13 at 18:06

Thomas Pornin

320,799
57
780
949

There is a field of cryptography called "white-box cryptography", whose goal is to develop algorithms that incorporate the secret required for decryption into the code itself in an indistinguishable way. It's still executable code, but nothing in it can be discretely identified as being a "key" or even a specific piece of it. The ideal would be an algorithm that an attacker could watch executing in a conceptual "white box", see every register, memory location and instruction step by step, and *still* be unable to reverse-engineer the secret. – KeithS Feb 07 '13 at 18:17
4

@KeithS: yes, this research field exists, but, as far as I know, the only tangible result it has currently produced is that "white-box cryptography looks hard". – Thomas Pornin Feb 07 '13 at 18:23
A few comments: 1) you wouldn't need to load the entire binary into memory to decrypt but could do it on the fly. If nothing else, then through modularity. 2) the bits that do the decryption don't need to hold the key. Old school techniques used dongles that held keys or even program bits. 3) you're also overlooking that previously decrypted bits can be re-encrypted when not used – Fake51 Feb 07 '13 at 20:24
This reminds me of the Morris worm. – user Mar 21 '16 at 15:27

score 3 · Answer 2 · answered Feb 07 '13 at 17:56

You are talking of two different components. One is the loader, which is not human-readable but must be machine-readable (therefore unencrypted) in order to be executed. And this has to be this way, as you say: otherwise you'd get a chunk of unexecutable data.

Several other "plaintext components" may also be present such as copyright, manifest, file info, and so on and so forth, which are human readable but are not sensitive - i.e., the developer couldn't care less if you are able to read his name. Actually he probably prefers it that way.

The loader performs several tasks of self-integrity checking, debugger-checking, and what not, and then decrypts the "true" executable in memory.

The extent to which the executable is encrypted depends on the use case. For example the binary might only keep encrypted certain critical routines having to do with copy protection, customization, or branding; so that you can open the file with a binary editor and see all the resources, strings, cursors etc. plain as day.

Or it could be encrypted with a standard "binary-agnostic-so-I-will-just-encrypt-everything" executable packer/encryptor, in which case you will see in plain text the strings belonging to the decryptor code, but not those of the original executable. Of course, decryptors are often further obfuscated to make it harder for your average hacker to recognize the encryptor and obtain the suitable decryptor (which, the more diffused the encryptor, the more it is likely to exist).

score 1 · Answer 3 · answered Feb 07 '13 at 17:53

At some point an executable has to look like an executable, otherwise the system won't know what to do with it. This usually entails a header pointing out its an executable (e.g. the MZ header in a Windows EXE), as well as some structures containing pointers to various references like the starting point for execution, followed by a blob of binary data that is the executable body of the file.

A lot of times there is also metadata attached to it that the OS uses like authenticode signatures, and attributes like publisher, version, etc.

At the bare minimum, it needs those header bits, and the executable body. That executable body would need enough clear-text code to execute the decryption/decompression mechanism, and then enter the new code for execution.

score 1 · Answer 4 · edited Jun 16 '20 at 09:49

The way an executable is compiled, and what is visible inside it varies quite a bit depending on the platform and the programming language involved. The "encrypted scrambled" portons aren't really encrypted and scrambled. It's just non-textual data. It's machine code, which is executed by the operating system.

For example, on Windows... If you were able to get your hands on a .dll created in .NET, one compared in VB6, one compared in C++ you'd likely find a big difference in how much "plain text" is visible if you open it in notepad.

.NET .dll files or .exe files aren't really compiled down to machine code, they are compiled to MSIL - a form of bytecode that is compiled to machine code by the .NET runtime. Java bytecode works the same way. There is information in there that is very easy to decompile, and for someone who knows how to read bytecode or msil, it's not gibberish at all.

C++ files, on the other hand, are compiled to machine code, and are much less harder to read (if at all) by opening in Notepad.

In other words, it has nothing to do with encrypting or scramnling, it has to do with how the file is prepared for the PC to read and execute it.

When it comes to the plain text in the files, you are correct. It does seem like a security risk, but as has been pointed out dozens of times over on StackOverflow, it's really not possible to prevent people from decompiling/examining your code, particularly if done in a language like .NET or Java. You have to assume that your code is completely open to a skilled person. (As was pointed out to me when I asked a question about how to protect the sensitive data in a Windows app built with .NET)

As for what informaiton is displayed, that's pretty much determined by the compiler and tools used to make the .exe as well.

score 0 · Answer 5 · answered Feb 07 '13 at 18:51

There are a few different things going on here. There are two main different ways that code can be run on a computer. Most programs (native applications) are compiled to what is known as machine code. It is a non-human readable set of instructions to be processed by the CPU. This machine code can include data that should be loaded to memory as part of the execution. This is why some plain-text strings are visible within the executable. These are simply data elements that will be loaded in to a memory address that the unreadable machine code will then point to. Their point is purely for display to the user and/or use as constants for string comparison. They are not actually instructions that the computer works on.

Such machine code can be run through a disassembler, however the code that will result is often very difficult to read and is completely undocumented. Thus, it is exceedingly rare for any additional protection to be put on such an executable because it really is not necessary and since the code has to be able to be read by the CPU eventually, someone who wanted to know what was going on could simply look at the instructions as they are decoded to send to the compiler.

The other major type of program is languages that use an interpreter, VM or runtime to execute their behavior. Languages such as PHP, Java and .Net (C#, VB) fall in to this category. Since they do not actually compile to machine code, they can be much more easily "decompiled" and get something that is much closer to the original code. To make this process more complicated, there is a technique known as code obfuscation. It is not encryption, but rather, simply removing the identifiers that would make it easier for someone to tell what is going on and possible altering code paths to make it more difficult for the code to be interpreted back to the original source code.

There are some platforms that attempt to allow encrypted code to be executed by using a segment of native code which handles decryption so that the actual code being used by the runtime is only available immediately before and during use, but such systems are rarely used due to the overhead they introduce.

If you are seeing plain text strings in an executable image, it is almost guaranteed to be strings to be loaded in to memory as data since if the executable was compressed or encrypted, the strings would not be visible either.

Thanks for your input! But the plain text I'm seeing isn't simply string constants. Some of these are actual code fragments. One executable in particular was interesting, which had 222 `void main()`s defined throughout a ~75,000 line file. I didn't spend the time to figure out which one was the actual entrance point for the program. — asteri, Feb 07 '13 at 18:55
@Jeff - do you know what language it was built with? That might give more insight in to why it was there. It is possible that it was built with debugging information present, in which case the function headers might still appear as human readable data for the sake of the debugger. It's also possible it is an interpreted language (like PHP) that simply bundled up the code along with something that will call the interpreter. The CPU can never read English though. It would have to either be interpreted or be a data element. — AJ Henderson, Feb 07 '13 at 19:08
I'm afraid I don't know for sure what language it was written in, however it looked like C to me (I'm notoriously bad at distinguishing C from C++ at a glance, however). — asteri, Feb 07 '13 at 20:00
If it is a C language (other than C#), then it is most likely simply debug information in the executable that you are seeing when you see the void main() as a string. — AJ Henderson, Feb 07 '13 at 20:18

Does a binary executable have to have some critical plain-text components?

5 Answers5