Monday, December 29, 2008

Re-creating source code, Part 3

Before I talk about decompilation I need to talk a little about compilers and compilation.

If you don't know it already, a compiler is a computer program that translates source code into machine code. Source code is (supposedly) readable by people, while machine code is what the processor actually executes.

Source code is different than languages that people use with each other, in that language has very sloppy "rules" and much is based on context. With source code, there is no room for sloppyness, as the compiler cannot guess what you really wanted to do.

(Side note: Some computer languages do try to guess what you are trying to do and "help" you. I hate them. They never get it right and are much harder to deal with than a "dumb" compliler. If the computer could guess what I wanted and write it for me, I wouldn't be writing it myself.)

If you have any interest in compilers, look at _Let's Build A Compiler_ by Jack Crenshaw. It's a bit dated now (it uses Turbo Pascal to write the compiler examples. While this is still available from places like the Borland community museum, when I played with it I wrote the examples in C instead and targeted the processor as the x86 instead of the 68000.), it will give you the basics of token parsing, recursive decent expression handling, and code generation.

One important thing to consider when looking at compiler in regards to decompilation is that different source code sequences produce the same machine code sequence. For example,

int i, j[10];
main()
   {
   for (i = 0; i < 10; i++)
   j[i] = i;
   }

produces the same code as

int i, j[10];
main()
   {
   i = 0;
   while (i < 10)
      {
      j[i] = i;
      i++;
      }
   }

While this may be a trivial example, it shows that you can't assume that anything you have decompiled is actually the "original" source code.

No comments: