31

I am still in college for a Computer Security degree and took my first assembly language based class last semester. We touched upon the subject of reverse engineering and why it is an important part of fighting malware and ill-wished applications.

During my class we mainly used IDA pro, but also checked out some similar and free browser based applications.

In these applications, we were able to obtain so much information about the instructions and low level code that I was wondering why we even need a human to go over it and recreate the higher level languages (like writing a 'C' version of a piece of malware).

Here my question:

Why can't a program use the information that is present in assembly code and turn it into a simplistic language automatically?

I understand that it wouldn't look the exact same way it would when it was first written, but shouldn't it be possible to re-create it in a way that makes it easier to read and follow?

It's just something that I can't wrap my head around, thanks!

PositriesElectron
  • 1,595
  • 1
  • 13
  • 17
  • 12
    Why do you think a program can't? – schroeder Jun 22 '15 at 18:42
  • Because as far as I know there isn't a tool out there that can take (for example) an .exe and re-write it in a higher level language. Tools like IDA can help a lot, but I never heard of a program that can re-write it in a none assembly structure. – PositriesElectron Jun 22 '15 at 18:46
  • 24
    The main problem is that there is a lot of information in source code that isn't directly related to the resulting byte code. For example, a variable of width of 16 bits could be a int/uint/char/char[]/ptr etc, and its from context that you can figure out which it is. Further more, how an int is used can easily be guessed from its variable name, but again is much harder with just the byte code. – Aron Jun 22 '15 at 18:48
  • 1
    If I could I'd upvote that answer Aron. I never really thought about how important context is when writing a program, thanks! :) – PositriesElectron Jun 22 '15 at 18:50
  • 2
    Can *forward* engineering be automated?! – user541686 Jun 24 '15 at 04:26
  • @Mehrdad You mean like an Assembly Line? or [self-generating C code](http://www.codeproject.com/Articles/20725/Programming-Self-generating-Code-for-Windows-Appli)? – RoraΖ Jul 01 '15 at 17:30
  • @raz: I didn't know working on an assembly line was called engineering. – user541686 Jul 01 '15 at 19:02
  • 2
    @Mehrdad An automated assembly line that puts entire cars together, I'd consider that to be automated engineering. – RoraΖ Jul 01 '15 at 19:19
  • 1
    @raz: I certainly don't, and I don't think the majority of people think of that as engineering either. The first definition I saw when I looked up the word "engineering" in my desktop dictionary was exactly what I expected, *"application of science to **designing** things"*. You can go ahead and argue that screwing pieces together is engineering if you'd like, which you're free to do but which I'm not going to entertain. – user541686 Jul 01 '15 at 19:23
  • There is a cool thread about this on dailydave -- https://lists.immunityinc.com/pipermail/dailydave/2015-September/000992.html – atdre Sep 22 '15 at 17:41

3 Answers3

66

Short Answer

It's absolutely possible, but the accuracy and readability is a completely different matter. One clarification to be made: Reverse Engineering is not Decompiling.


Long Answer

Reverse Engineering is generally the process by which you take something (anything really) apart to see how it works. Disassembling is when you take a binary formatted file and interpret the machine code into is assembly code. Decompiling is interpreting the assembly code into a higher level language.

I believe your question is really, Why can't decompiling a program be automated? Well it can be!

There are several different Java Decompilers. Java byte-code is completely reversible due to its architecture independence. What becomes tricky is decompiling a language like C. Hex Rays does provide a C decompiler, but C is a complicated language. There are 10 different ways to accomplish the same task. What can be done in 20 lines, can be done in 3, or 10. It's the interpretation of the language that makes the automation of decompiling C difficult.

Sure you can decompile C to its most simplistic instructions. Then you get lines like **(*var1) = 3; or (*bytecode)(param1) which can be a call to a function pointer. What's worse is that you must remember that these are still just an interpretation. I can't stress that enough. What if the interpretation is wrong? This is something you have to worry about at the disassembly level, but at least there are a reasonable amount of outcomes for 5-6 bytes for an instruction. Now you have to interpret 15-20 bytes in order to figure out a function call or a for-loop. If there are anti-reverse engineering techniques then it makes the interpretation even more difficult.

Context plays a huge role. What's the difference between a function pointer, a char * pointer, and a uint32? Absolutely nothing, except the context in which its used. Compiler optimizations might use __fastcall rather than __stdcall. Which means now you have to interpret where parameters to functions are going to be; either on the stack or in a register? Inline functions, macro's, #defines will all become part of a larger subroutine. There's no real way to interpret those types of contexts.

RoraΖ
  • 12,317
  • 4
  • 51
  • 83
  • 2
    Thanks for that clarification. I just read about software Obfuscation and I can see where it might be close to impossible to automate a readable reconstruction of a .exe file... – PositriesElectron Jun 22 '15 at 19:05
  • 2
    `Reverse Engineering is generally the process by which you something (anything really)` i think you skipped a verb. – Cthulhu Jun 23 '15 at 09:01
  • 3
    @Cthulhu: it was optimised away by the StackExchange interpretor. — re. “Java bytecode is completely reversible due to its architecture independence” – that's a bit of a non-sequitur, isn't it? Java bytecode _can_ easier be reversible because it doesn't need to account for details of the various target architectures. But you could implement a completely non-reversible Java → JVM compiler, and you could certainly do a reversible C ⇌ x86 compiler, too. – leftaroundabout Jun 23 '15 at 09:55
  • If you are really talking about "reverse engineering", not just "decompilation", then the problem is that you are fighting against the [Halting Problem](https://en.wikipedia.org/wiki/Halting_problem). Nevertheless, this doesn't stop antivirus and IDS/IPS writers from creating a heuristic analysis engine which is essentially an automatic reverse engineering system. – Lie Ryan Jun 23 '15 at 13:23
  • 1
    @LieRyan My post is specific to the problem of automating decompilation. Heuristic analysis with respect to malware is still just giving the probability that a piece of software is in fact malware. Which is an initial step to reverse engineering automation, but I would argue that the heuristic engine still doesn't understand exactly what the malware does. Only that it's actions are statistically malware. – RoraΖ Jun 23 '15 at 13:29
  • @leftaroundabout I'm not sure how you could make a non-reversible Java to JVM compiler. Symbols could be stripped or obfuscated, but for the JVM to successfully run the bytecode it needs to be reversible. I agree that you could decompile back into C; my argument is that readability and accuracy are in question. – RoraΖ Jun 23 '15 at 13:32
  • @raz: yes, your answer is perfectly fine for the question; my comment was going on the tangent about the reverse engineer aspect that you brought up. `... the heuristic engine still doesn't understand exactly what the malware does. Only that it's actions are statistically malware.` on which I argue is exactly what a human reverse engineer would be doing. Even a moderately experienced reverse engineer is still a lot better in this task than the most sophisticated heuristic engines though, but the methods aren't really that different. – Lie Ryan Jun 23 '15 at 13:38
  • I'd say that "disassembling" is the process of taking something apart; whereas "reverse engineering" is understanding (from so doing) how it can be put back together again. – eggyal Jun 23 '15 at 17:45
  • @LieRyan: The problem in question isn't really the Halting Problem; it's *strong AI.* Reverse engineering, at its core, is the process of taking a work of engineering and analyzing it to answer the question "how does this work?" (And sometimes the question "what does this do?" as well.) This is a question that requires real intelligence and comprehension to be meaningful. – Mason Wheeler Jun 23 '15 at 17:46
16

It is possible to automatically re-create something that looks like C-code from assembly, but the amount of guess-work that the decomplier would have to do is monumental.

Compilers are very complicated things that do complicated transformation on the source code. Optimizations, macro/pre-compiler substitutions, code in-lining, type and error checking, static linking, etc. The decomplier would have to guess (or pick defaults for) which compiler you used, which compiler flags were set, which version of which OS it was compiled on, which libraries it was compiled against, etc.

So, if you took some C-code, complied it, then decompiled it, the result would look nothing like the original.

And that's just to produce C-code that runs, we haven't even talked about readability yet. Variable names, function names, etc, are ripped out by the compiler and replaced with raw addresses, so decompilers like the one you're imagining typically name your functions A(), B(), C() ... and all variables a, b, c because it has no way of knowing the semantics (ie what these things are supposed to represent).

The bottom line is that anybody with a bit of assembly experience would say that reading decompiled code is actually harder than reading the raw assembly. (With a few exceptions: Java, for example, actually decompiles quite cleanly).

Mike Ounsworth
  • 57,707
  • 21
  • 150
  • 207
  • That makes a lot of sense. Like Aron mentioned in a comment earlier, context plays a big role, thanks for your answer. – PositriesElectron Jun 22 '15 at 19:01
  • 4
    I would add for example malware authors will code part of the program in assembly (like unpacking routines) and break as much as possible conventions to trick the decompiler/debugger/reverser. So the decompiled code will be garbage. – r00t Jun 22 '15 at 19:05
  • 4
    In Java, all the names are stored in the bytecode (except for local variables), and the types are also expressed in the bytecode, so the decompiler doesn't have to do much guesswork like in the case of C. – nhahtdh Jun 23 '15 at 06:18
  • @nhahtdh unless the java source code was run through an obfuscator, which any java malware will be. An obfuscator will do its best to make the decompiled code hard to read. – Mike Ounsworth Jun 23 '15 at 13:11
2

I have yet to find a paper or a software that completely automates reverse engineering, but there are some fields that are particularly interested in reverse engineering automation, like forensic analysis. Reverse engineering automation is not (at least currently) meant to totally automate the whole process of reverse engineering, but at least some parts so that you can scale your procedure to a whole filesystem. This is for example described in this article:

Introduction

This article will address the growing need for automation in reverse engineering (Reverse Engineering Automation), how the introduction of automation into the research process can save valuable time and aid the retrieval of information without which, either our research would be less thorough, or it would be conducted on a significantly smaller scale.

Also in this article, I will present examples of automation scripts, and will make them accessible via my Github account. I also invite you to add further examples, beyond those examples found in this article, and to send me Pull Requests, so that I can add them.

Why is Automation Needed in Reverse Engineering?

First of all, I feel it is important to point out that Reverse Engineering Automation saves investigation time but does not replace the rest of the process, and secondly, for reasons including the following:

  1. Repeated actions.
  2. Dynamic code fragments that are detected at a later stage of the program (Crypters, Packers).
  3. Bypassing software protections, such as SSDT.
  4. Allocating memory of the software running within Ollydbg

The article goes on to describe a few tools like OllyDBG-Python and OllyDBG-Playtime, and then some code snippets which is also available here and here in opensource.

gaborous
  • 121
  • 3
  • 1
    You might want to describe details related to and/or include quoted statements from the linked article. – RoraΖ Sep 22 '15 at 18:22
  • Ok, I added an excerpt of the intro and the links to the source code of the examples and tools. Thank you guys for the feedback :) – gaborous Sep 22 '15 at 20:23