Wednesday, December 31, 2008

Re-creating source code, Part 4

Now, on to decompilation...

I talked briefly before about disassembly, which is the first step in decompilation.

So, for example, if the output from your disassembler is:

   push  bp
   mov   bp,sp
   mov   word ptr [d_0021],0000h
c_000c:
   cmp   word ptr [d_0021],+0ah
   jge   c_0026
   mov   si,[d_0021]
   shl   si,1
   mov   ax,[d_0021]
   mov   [si+d_0023],ax
   inc   word ptr [d_0021]
   jmp   short c_000c
c_0026:
   pop   bp
   ret

what is the corresponding C code?

Well, let's look at this in pieces:
   push  bp
   mov   bp,sp
this is standard C entry code for a function:
   fn()
      {
Notice that no local variables are allocated -- this would shown by
a statement like:
   sub   sp,+02h


The next standard piece is at the end:
c_0026:
   pop   bp
   ret
this is standard C exit code for a function:
      }

In the actual code block, we have a 2-byte integer being set:
   mov   word ptr [d_0021],0000h
So we know we have a global integer:
   int d_0021;
that is set in a C statement:
   d_0021 = 0;
note that we don't know what the integer was originally named.

Next we check the variable we just set:
c_000c:
   cmp   word ptr [d_0021],+0ah
   jge   c_0026
In C:
   if (d_0021 < 10)

We notice that d_0021 is being used as an index (the SI addressing
in the following):
   mov   si,[d_0021]
   shl   si,1
   mov   ax,[d_0021]
   mov   [si+d_0023],ax
and that this is a 2-byte data reference (the SHL SI,1 is a
multiply by 2). This also tells us that we have a global array
of integers:
   int d_0023[some size];
If we think about it, since we know that the index variable is
limited to 0 .. 9 by the previous IF statement, we can guess that
the size is 10. Or, we could just look furthur down in the
disassembly:

d_0021  dw    0000h
d_0023  dw    000ah dup (0000h)

and realize that this is really
   int d_0023[10];
Again, we don't know what it was originally named.

The previous block also shows d_0021 being used as a data value
as well as an index and a counter.
   d_0023[d_0021] = d_0021;

Next, we have:
   inc   word ptr [d_0021]
Or, in C:
   d_0021++;

And, finally, we have:
   jmp   short c_000c
which goes back to the
   if (d_0021 < 10)
statement. Since GOTO is almost never used in C, this means that
the original translation was wrong, and this really should be
   while (d_0021 < 10)

Putting this all together, the decompiled fuction is:

int d_0021;
int d_0023[10];
fn()
   {
   d_0021 = 0;
   while (d_0021 < 10)
      {
      d_0023[d_0021] = d_0021;
      d_0021++;
      }
   }


Compare this to the original C code:

int i, j[10];
main()
   {
   for (i = 0; i < 10; i++)
      j[i] = i;
   }

And you'll note that this is the same ambiguity I mentioned before. Also, while I could have determined that this was really the main() function, not a general function, it would involve a few more steps that I didn't want to get into at this time.

No comments: