Decompilation

Decompilation is the reverse process of compilation i.e.
creating high level language code from machine/assembly language
code. At the basic level, it just requires to understand the
machine/assembly code and rewrite it into a high level language,
but things are not as simple as they seem, particularly when it
comes to implementing a decompiler. Throughout this discussion,
we will be using the C language for the high level language, and
the 8086 assembly language for the low level language.

The ethics of decompilation

Is decompilation legal, and is it allowed?

There are many situations when decompilation can be used...

To recover lost source code. You may have written a
program for which you only have the executable now (or
you got the exe of a program you wrote long back, from
someone else!). If you want to have the source for such a
program, you can use decompilation to recover it. In all
rights, you are the owner of the program, so nobody is
going to question you.

Just as stated above, applications written long back for
a legacy computer may not have the source code now, and
you may need to port it to a new platform. Either you
have to rewrite the application from the scratch, or use
decompilation to understand the working of the
application and write it again.

Say you have code written in some language for which you
cant find a compiler today! If you have the executable,
just decompile it and rewrite the logic in the language
of your choice today.

To discover the internals of someone else's program (like
what algorithm they have used...)

Usually all software are copyrighted by the authors. This
means, copying or expressing the same idea in another program is
prohibited. Hence if you are using decompilation to discover the
internals of a program and if that particular part is breaching
the copyright of the owner, you are liable for legal action.
However, there are some permitted uses of decompilation, like the
first three cases stated above. Also, decompilation of parts of
software which do not come under the copyright laws (e.g.
algorithms) is permitted. In any case, it is better to contact
your legal advisor if you are doing any serious work with
decompilation.

In all practical purposes, decompiling programs which were
created by you can't be questioned! After all, you are the owner
of all rights to the program. But be careful if you are trying it
out on someone else's programs.

Is decompilation possible?

Let's take a look at a normal C compiler. When a C program is
compiled, the first stage of the compiler will generate a very
rudimentary assembly language output (or nearly equivalent to
it), which is nothing but a line-by-line translation of the C
source code. If any optimizations are chosen to be done, the next
stages perform the code optimization to replace redundant
instructions and improve the overall efficiency of the output
program. This output is then linked with the standard libraries
for any library function calls, and saved in the executable
format of the platform.

If no optimizations are performed while the code is generated,
it is very easy to understand what the output code does, and an
equivalent C code can be written/generated automatically. Note
that we can only generate "an equivalent C code", and
not the "same C code" which was compiled to get this
executable. In other words, it is always impossible to get the exact
source code, but we can generate an equivalent program
which will function in the same way.

However, things are a bit different if any compiler
optimizations are used while building the original executable. It
turns out to be more difficult to understand the flow of the
program now, and the more rigorous the optimizations are, the
worse are our chances to figure out what the code is doing
exactly.

One more factor which joins the opposition is that each
compiler generates code in its own way. There are very few points
where all compilers generate similar code. This means that we
need to tailor the decompilation procedure for each compiler, so
that if we know which compiler was used to generate this
executable, we have a better chance of understanding the code.

For example, the simple "if" statement can be
compiled in many ways, like the one given below.

Code written for critical applications like games, where
every machine cycle counts for the performance, sometimes
self-modifying code is used. For example, if the same
condition is used at many places in a critical loop, it
is evaluated once and the code after that is modified to
branch to the same location without evaluating the
condition again. But with today's processor speeds, this
technique is fast becoming antique!

2. User-defined datatypes

User defined data types, like "struct"s,
"typedef"s, "union"s and bit-fields
add more to the confusion of the decompiler. Though it is
easy to use structures while writing the code, it is
almost impossible to figure out if a variable is a part
of a structure or is it a basic type on its own, by
looking at the compiled output.

3. Use of
processor-specific instructions/optimizations

Inspite of all these reasons, it is still possible to
decompile a binary executable (though not 100% automated, and not
100% accurate!). Carry on reading...

How to decompile?

A simple approach to decompile a binary executable, is to
first parse it and separate it into functions (C style). Once we
know where the entry point of the program ("main" for a
C program), we can start decompiling that function, and any other
function it calls. This way, we can focus on one function at a
time carefully.

To use this approach, the first thing that should be known is
the entry-point of the program. For a normal C program, it is the
"main" function, and for a Win32 program, it is the
"WinMain" function. But to find out where these
functions begin, we must analyze the executable and figure out
which compiler was used, because each compiler has its own
entry/exit code added to the program which we need not decompile.
If the compiler used is known, we can trace out where the
"main" function is starting, and from thereon, till we
get a "return" instruction, we can separate out the
function.

A simple method to decompile a code
snippet

Let's say we have separated out the instructions for a
function. To understand how this piece of code works, we need to
emulate the processor, and interpret each and every machine
instruction! (though it seems a roundabout way, there is nothing
called the best method to understand the working
of the code, so this is not the only way. If you think this can
be done in some other way, please try it out and i'd be
interested to hear about it.). And as we are interpreting the
code, we need to combine logical group of instructions into
simple high level language statements. And voila! We've
decompiled the function!

The high level language statements can be grouped as
assignment statements, condition evaluation and branches, and function calls. While
interpreting the machine code for a function, we have to look for
instructions which fall in one of these categories, and
accordingly generate the high level code.

Understanding assignment statements and expressions

Take a look at the following code snippet.

mov ax, [bp+4]
mov bx, 20
mul bx
add ax, 4
mov [bp+4], ax

Sample
expression eval & assignment.

It is not difficult to understand what it does. To put it in
steps,

AX is loaded with value from memory [bp+4]

BX is loaded with value 20

AX and BX are multiplied, and the result is in DX:AX (of
which we will ignore DX for the time being)

A value of 4 is added with AX

The result in AX is stored in memory location [bp+4]

And we can easily write a program which will interpret this
code. Let's say we have two variables wAX, wBX. The above 5 steps
can be translated by our program as

wAX = [bp+4]

wBX = 20

wAX = wAX*wBX = [bp+4]*20

wAX = wAX+4 = ([bp+4]*20) + 4

[bp+4] = wAX = ([bp+4]*20) + 4

And if we substitute a variable name "i" for
"[bp+4]", we have arrived at the high level language
statement

i = ( i*20 ) + 4;

That was pretty easy, wasn't it? But then, when do we decide
that we have got a full high-level language statement? Whenever
there is a instruction which stores some value to external
memory, it is a full high-level language statement by itself.
Just think about it. All the following high-level language
statements inevitably end with a memory store operation!

i = 0;
x = i*30 + (j>>1)+(i/2)
k = (i & 2) + z;

Understanding condition evaluation and branches

A condition evaluation almost always uses a
"compare" instruction. So while interpreting the
machine code, if we get a "cmp" instruction, we
translate it into a "if" statement, and the branch
instruction that follows this compare will lead us to the block
of code which gets executed if the condition is true, and if it
is false.

When we are interpreting this code and come across the
"cmp" instruction, the value in our variable wAX
is "[bp+4]" (which is a memory variable). So we now
form a condition between "[bp+4]" and "10".
But what are we comparing for? That can be found using the branch
instruction which comes next. Since we find a "jnz"
instruction ("jnz" means "jump if not zero",
or "jump if not equal"), we can conclude that this is a
equality compare i.e. we are comparing for
"[bp+4]!=10". The rest is pretty easy, and after
replacing "[bp+4]" with a name "i", we can
generate the C code for this as below.

if (i != 10) goto lab1;
j = 15;
goto lab2;
lab1:
j = 20;
lab2:

After a bit of rearrangement, and removing the
"goto" statements, we can get the 'beautified' C code
as

if (i!= 10)
{
j = 20;
}
else
{
j = 15;
}

But beware! Conditions need not always be
evaluated using "compare" instructions. Even arithmetic
instructions can set/reset the conditions flag bits in the
processor's FLAGS register, and it is these condition flag bits
which are actually used to determine if a branch should be taken
or not! A classic example is the C statement "i--", as
shown below.

if (i--)
j = 20;
j++;

sub [bp+4], 1
jz lab1
mov [bp+2], 20
lab1:
add [bp+2], 1

Sample
"if" statement

Branch
without "compare"

So if we come across such code, we must look at the previous
arithmetic instruction, and find out which variable/expression is
actually used as the condition.

Interpreting function calls and return values

Function calls are a bit different. The usual "C"
convention of function calling, is to push the parameters (last
parameter first!) onto the stack, and then call the function. The
function will read the values from the stack, and the return
value is usually placed in some register (usually the
"AX" register). So the trick we can use here is,
whenever we find a "push" instruction (which pushes a
value on the top of the stack) we make a note of it. And
eventually when we reach a "call" instruction (which is
actually a subroutine branch instruction), we look at our
variables to see what all values/variables have been pushed to
the stack, and put them in reverse order to create the high-level
C language statement for the function call!

The example for this method is given below. Please note that
the code on the right is obtained after replacing
"[bp+4]" with "i" and "[bp+2]" with
"j".