Wednesday, December 31, 2008

Re-creating source code, Part 4

Now, on to decompilation...

I talked briefly before about disassembly, which is the first step in decompilation.

So, for example, if the output from your disassembler is:

   push  bp
   mov   bp,sp
   mov   word ptr [d_0021],0000h
c_000c:
   cmp   word ptr [d_0021],+0ah
   jge   c_0026
   mov   si,[d_0021]
   shl   si,1
   mov   ax,[d_0021]
   mov   [si+d_0023],ax
   inc   word ptr [d_0021]
   jmp   short c_000c
c_0026:
   pop   bp
   ret

what is the corresponding C code?

Well, let's look at this in pieces:
   push  bp
   mov   bp,sp
this is standard C entry code for a function:
   fn()
      {
Notice that no local variables are allocated -- this would shown by
a statement like:
   sub   sp,+02h


The next standard piece is at the end:
c_0026:
   pop   bp
   ret
this is standard C exit code for a function:
      }

In the actual code block, we have a 2-byte integer being set:
   mov   word ptr [d_0021],0000h
So we know we have a global integer:
   int d_0021;
that is set in a C statement:
   d_0021 = 0;
note that we don't know what the integer was originally named.

Next we check the variable we just set:
c_000c:
   cmp   word ptr [d_0021],+0ah
   jge   c_0026
In C:
   if (d_0021 < 10)

We notice that d_0021 is being used as an index (the SI addressing
in the following):
   mov   si,[d_0021]
   shl   si,1
   mov   ax,[d_0021]
   mov   [si+d_0023],ax
and that this is a 2-byte data reference (the SHL SI,1 is a
multiply by 2). This also tells us that we have a global array
of integers:
   int d_0023[some size];
If we think about it, since we know that the index variable is
limited to 0 .. 9 by the previous IF statement, we can guess that
the size is 10. Or, we could just look furthur down in the
disassembly:

d_0021  dw    0000h
d_0023  dw    000ah dup (0000h)

and realize that this is really
   int d_0023[10];
Again, we don't know what it was originally named.

The previous block also shows d_0021 being used as a data value
as well as an index and a counter.
   d_0023[d_0021] = d_0021;

Next, we have:
   inc   word ptr [d_0021]
Or, in C:
   d_0021++;

And, finally, we have:
   jmp   short c_000c
which goes back to the
   if (d_0021 < 10)
statement. Since GOTO is almost never used in C, this means that
the original translation was wrong, and this really should be
   while (d_0021 < 10)

Putting this all together, the decompiled fuction is:

int d_0021;
int d_0023[10];
fn()
   {
   d_0021 = 0;
   while (d_0021 < 10)
      {
      d_0023[d_0021] = d_0021;
      d_0021++;
      }
   }


Compare this to the original C code:

int i, j[10];
main()
   {
   for (i = 0; i < 10; i++)
      j[i] = i;
   }

And you'll note that this is the same ambiguity I mentioned before. Also, while I could have determined that this was really the main() function, not a general function, it would involve a few more steps that I didn't want to get into at this time.

Monday, December 29, 2008

Re-creating source code, Part 3

Before I talk about decompilation I need to talk a little about compilers and compilation.

If you don't know it already, a compiler is a computer program that translates source code into machine code. Source code is (supposedly) readable by people, while machine code is what the processor actually executes.

Source code is different than languages that people use with each other, in that language has very sloppy "rules" and much is based on context. With source code, there is no room for sloppyness, as the compiler cannot guess what you really wanted to do.

(Side note: Some computer languages do try to guess what you are trying to do and "help" you. I hate them. They never get it right and are much harder to deal with than a "dumb" compliler. If the computer could guess what I wanted and write it for me, I wouldn't be writing it myself.)

If you have any interest in compilers, look at _Let's Build A Compiler_ by Jack Crenshaw. It's a bit dated now (it uses Turbo Pascal to write the compiler examples. While this is still available from places like the Borland community museum, when I played with it I wrote the examples in C instead and targeted the processor as the x86 instead of the 68000.), it will give you the basics of token parsing, recursive decent expression handling, and code generation.

One important thing to consider when looking at compiler in regards to decompilation is that different source code sequences produce the same machine code sequence. For example,

int i, j[10];
main()
   {
   for (i = 0; i < 10; i++)
   j[i] = i;
   }

produces the same code as

int i, j[10];
main()
   {
   i = 0;
   while (i < 10)
      {
      j[i] = i;
      i++;
      }
   }

While this may be a trivial example, it shows that you can't assume that anything you have decompiled is actually the "original" source code.

Monday, December 22, 2008

Re-creating source code, Part 2

As I said before, I'm going to talk more about disassembly.

With a von Neumann architecture processor like the x86, there has to be an algorithm to distinguish code from data.

Long ago, when I was writing anti-virus software, I wrote my own disassembler called codegen. The algorithm I used to separate code from data was based on noticing that processor instructions came in 4 "flavors" of flow control:


NORMAL flow control is standard instruction like ADD or SUB. The instruction does not change the instruction sequence, and so the next instruction is the one after the current instruction.


GOTO flow control is an instruction like JMP. The instruction changes the instruction sequence, and the next instruction is the target of the instruction. It is not known if there is an instruction after this one, that must be determined elsewhere.


CALL flow control is an instruction like CALL or INT or JNE. The instruction can change the instruction sequence, and there are instructions both at the target of the instruction and after the current instruction.


EXIT flow control is an instruction like INT 20H. While the processor does not really "stop" processing, there is no target instruction associated with this instruction, the program has exited and it is not known if there is an instruction after this one, that must be determined elsewhere.

These 4 rules, combined with the known entry point for the program, can find most of the code in a program. By examining the instructions found more closely, it can be determined if they access data, and if so where, thus finding the data in the program.

There are some places where this algorithm needs a little human intervention, such as an interrupt vector set by the program (which is data access that determines a code location that is not able to be found by the 4 "flavors" of flow control described above), or a memory indirect JMP (for an x86 processor, this would be an instruction like
JMP BX
or
JMP WORD PTR [SI+1234H]

Thursday, December 18, 2008

Re-creating source code, Part 1

As I talked about last time, I've started re-creating the source to DeSmet C version 2.51 (aka PCC 1.2d, and the version I use the most).

How do you re-create the source to a program you don't have? Well, if you think about it, you have the program -- sort of.

You have the executable.

When the compiler source is run through a compiler, you get an executable program. So, there is a mapping from the source code to the executable. What you want to do is reverse, that and go from the executable to the source.

There are two levels to doing this, disassembly and decompilation.

Disassembly is fairly straight forward, it translates the object code into assembly statements. For an x86 program, there can be a little ambiguity, as code and data are in the same memory space (von Neumann architecture) , and so you have to have some algorithm to determine if the object byte you are looking at is code or data, but this is a minor obstacle (typical embedded processors are Harvard architecture, and so this problem does not exist).

I'll talk more about disassembly in my next post.

Monday, December 15, 2008

Back again...

AAAARRRRRGGGGGHHHHHHH!!!!!!!!!!!!!

Ok, that makes me feel a little better.

Insane kitty is getting wise to the "medicine hidden in the food" trick, and I had to catch her without her being medicated. Got her to the vet, and it's not what we had originally though (infection with abcess). Test results for fungal infection (ringworm) should be back today, but we expect it to be negative. Now, assuming it is negative, we just have to catch her twice a day to put ointment on it and hope it heals...

On the technical side, I've mentioned my favorite C compiler before. It really is.

I've been using it long enough so that my fingers "think" in this dialect. And a few years ago, I found myself doing (as usual) a big project in DeSmet C. And started thinking about C compilers and playing with them.

So, I went on a hunt for the source to the compiler (the compiler vendor had long since vanished). With a little help from Google and Superpages, I managed to find the people, and recovered the source.

But the one thing I couldn't get was the source to the version I use all the time, and the most common version out there -- version 2.51 (found around the net as PCC 1.2d). So, a while back I started re-creating it. I'll talk more about that next time.

Monday, December 8, 2008

No time today...

to do a real post. Insane kitty needs to go back to vet today, and I have to take off from work (among other things) to drug her in time for her to calm down enough to catch her to take her to vet...

{ sigh }

Friday, December 5, 2008

Geek credit...

When you're a computer geek, you really should be proud and proclaim it to the world.

Almost everyone carries a calculator (check out your cell phone).

Having an RPN calculator shows low level geekdom.

Having a calculator watch shows moderate level geekdom.

Having a calculator you built yourself shows high level geekdom.

Having a RPN calculator watch that you built yourself will make all other geeks bow down before you (with the possible exception of those few who built a nixie tube watch).

Wednesday, December 3, 2008

And if you think that's bad...

As I mentioned before, sometimes my programming doesn't sit well with my employer.

It's not because I'm a bad programmer (at least I think so), but other people may have other opinions.

But what if you were trying to write bad code? What if you were trying to confuse, confound, and mislead someone else?

In that case, you should enter the IOCCC (The International Obfuscated C Code Contest). Look over the past winners, and see if your code is legendary.

Monday, December 1, 2008

Microcontrollers and code protection (with pictures!)

A microcontroller is generally considered to be a microprocessor with built-in program memory (usually FLASH) and data memory, plus (maybe) EEPROM and peripherals. It is a "system on a chip".

If you look at many embedded hardware designs, you will see that the circuitry varies from incredibly complex (think cell phone) to fairly straight-forward (although considerable effort may have been put into the printed circuit board layout).

For the straight-forward embedded system, the electronics are reasonably easy to copy. The real value of the system is in the code controlling it. To help protect this, most (if not all) microcontrollers include some sort of code protection, so that the program memory cannot be read out of the microcontroller. That way, someone cannot copy your design without having to re-develop the code.

Of course, there is a value to getting someone else's code instead of developing your own. And where there is money, there are people to do the job. This does not make them bad people, but it does make you consider if it would be quicker / cheaper to get someone else's code instead of writing your own.

Flylogic can, among other things, break into a microcontroller and read out the program code. There are even valid reasons for the company that produces the embedded system to do this (the programmer dropped dead and nobody can find his source code). They also do security analysis of chips to make sure that they are as hard as possible to break into.

And, if you check out their blog, you will see how they tear apart microcontrollers, and they have many wonderful and beautiful photos of the actual silicon chips they contain.