Wednesday, December 31, 2008

Re-creating source code, Part 4

Now, on to decompilation...

I talked briefly before about disassembly, which is the first step in decompilation.

So, for example, if the output from your disassembler is:

   push  bp
   mov   bp,sp
   mov   word ptr [d_0021],0000h
c_000c:
   cmp   word ptr [d_0021],+0ah
   jge   c_0026
   mov   si,[d_0021]
   shl   si,1
   mov   ax,[d_0021]
   mov   [si+d_0023],ax
   inc   word ptr [d_0021]
   jmp   short c_000c
c_0026:
   pop   bp
   ret

what is the corresponding C code?

Well, let's look at this in pieces:
   push  bp
   mov   bp,sp
this is standard C entry code for a function:
   fn()
      {
Notice that no local variables are allocated -- this would shown by
a statement like:
   sub   sp,+02h


The next standard piece is at the end:
c_0026:
   pop   bp
   ret
this is standard C exit code for a function:
      }

In the actual code block, we have a 2-byte integer being set:
   mov   word ptr [d_0021],0000h
So we know we have a global integer:
   int d_0021;
that is set in a C statement:
   d_0021 = 0;
note that we don't know what the integer was originally named.

Next we check the variable we just set:
c_000c:
   cmp   word ptr [d_0021],+0ah
   jge   c_0026
In C:
   if (d_0021 < 10)

We notice that d_0021 is being used as an index (the SI addressing
in the following):
   mov   si,[d_0021]
   shl   si,1
   mov   ax,[d_0021]
   mov   [si+d_0023],ax
and that this is a 2-byte data reference (the SHL SI,1 is a
multiply by 2). This also tells us that we have a global array
of integers:
   int d_0023[some size];
If we think about it, since we know that the index variable is
limited to 0 .. 9 by the previous IF statement, we can guess that
the size is 10. Or, we could just look furthur down in the
disassembly:

d_0021  dw    0000h
d_0023  dw    000ah dup (0000h)

and realize that this is really
   int d_0023[10];
Again, we don't know what it was originally named.

The previous block also shows d_0021 being used as a data value
as well as an index and a counter.
   d_0023[d_0021] = d_0021;

Next, we have:
   inc   word ptr [d_0021]
Or, in C:
   d_0021++;

And, finally, we have:
   jmp   short c_000c
which goes back to the
   if (d_0021 < 10)
statement. Since GOTO is almost never used in C, this means that
the original translation was wrong, and this really should be
   while (d_0021 < 10)

Putting this all together, the decompiled fuction is:

int d_0021;
int d_0023[10];
fn()
   {
   d_0021 = 0;
   while (d_0021 < 10)
      {
      d_0023[d_0021] = d_0021;
      d_0021++;
      }
   }


Compare this to the original C code:

int i, j[10];
main()
   {
   for (i = 0; i < 10; i++)
      j[i] = i;
   }

And you'll note that this is the same ambiguity I mentioned before. Also, while I could have determined that this was really the main() function, not a general function, it would involve a few more steps that I didn't want to get into at this time.

Monday, December 29, 2008

Re-creating source code, Part 3

Before I talk about decompilation I need to talk a little about compilers and compilation.

If you don't know it already, a compiler is a computer program that translates source code into machine code. Source code is (supposedly) readable by people, while machine code is what the processor actually executes.

Source code is different than languages that people use with each other, in that language has very sloppy "rules" and much is based on context. With source code, there is no room for sloppyness, as the compiler cannot guess what you really wanted to do.

(Side note: Some computer languages do try to guess what you are trying to do and "help" you. I hate them. They never get it right and are much harder to deal with than a "dumb" compliler. If the computer could guess what I wanted and write it for me, I wouldn't be writing it myself.)

If you have any interest in compilers, look at _Let's Build A Compiler_ by Jack Crenshaw. It's a bit dated now (it uses Turbo Pascal to write the compiler examples. While this is still available from places like the Borland community museum, when I played with it I wrote the examples in C instead and targeted the processor as the x86 instead of the 68000.), it will give you the basics of token parsing, recursive decent expression handling, and code generation.

One important thing to consider when looking at compiler in regards to decompilation is that different source code sequences produce the same machine code sequence. For example,

int i, j[10];
main()
   {
   for (i = 0; i < 10; i++)
   j[i] = i;
   }

produces the same code as

int i, j[10];
main()
   {
   i = 0;
   while (i < 10)
      {
      j[i] = i;
      i++;
      }
   }

While this may be a trivial example, it shows that you can't assume that anything you have decompiled is actually the "original" source code.

Monday, December 22, 2008

Re-creating source code, Part 2

As I said before, I'm going to talk more about disassembly.

With a von Neumann architecture processor like the x86, there has to be an algorithm to distinguish code from data.

Long ago, when I was writing anti-virus software, I wrote my own disassembler called codegen. The algorithm I used to separate code from data was based on noticing that processor instructions came in 4 "flavors" of flow control:


NORMAL flow control is standard instruction like ADD or SUB. The instruction does not change the instruction sequence, and so the next instruction is the one after the current instruction.


GOTO flow control is an instruction like JMP. The instruction changes the instruction sequence, and the next instruction is the target of the instruction. It is not known if there is an instruction after this one, that must be determined elsewhere.


CALL flow control is an instruction like CALL or INT or JNE. The instruction can change the instruction sequence, and there are instructions both at the target of the instruction and after the current instruction.


EXIT flow control is an instruction like INT 20H. While the processor does not really "stop" processing, there is no target instruction associated with this instruction, the program has exited and it is not known if there is an instruction after this one, that must be determined elsewhere.

These 4 rules, combined with the known entry point for the program, can find most of the code in a program. By examining the instructions found more closely, it can be determined if they access data, and if so where, thus finding the data in the program.

There are some places where this algorithm needs a little human intervention, such as an interrupt vector set by the program (which is data access that determines a code location that is not able to be found by the 4 "flavors" of flow control described above), or a memory indirect JMP (for an x86 processor, this would be an instruction like
JMP BX
or
JMP WORD PTR [SI+1234H]

Thursday, December 18, 2008

Re-creating source code, Part 1

As I talked about last time, I've started re-creating the source to DeSmet C version 2.51 (aka PCC 1.2d, and the version I use the most).

How do you re-create the source to a program you don't have? Well, if you think about it, you have the program -- sort of.

You have the executable.

When the compiler source is run through a compiler, you get an executable program. So, there is a mapping from the source code to the executable. What you want to do is reverse, that and go from the executable to the source.

There are two levels to doing this, disassembly and decompilation.

Disassembly is fairly straight forward, it translates the object code into assembly statements. For an x86 program, there can be a little ambiguity, as code and data are in the same memory space (von Neumann architecture) , and so you have to have some algorithm to determine if the object byte you are looking at is code or data, but this is a minor obstacle (typical embedded processors are Harvard architecture, and so this problem does not exist).

I'll talk more about disassembly in my next post.

Monday, December 15, 2008

Back again...

AAAARRRRRGGGGGHHHHHHH!!!!!!!!!!!!!

Ok, that makes me feel a little better.

Insane kitty is getting wise to the "medicine hidden in the food" trick, and I had to catch her without her being medicated. Got her to the vet, and it's not what we had originally though (infection with abcess). Test results for fungal infection (ringworm) should be back today, but we expect it to be negative. Now, assuming it is negative, we just have to catch her twice a day to put ointment on it and hope it heals...

On the technical side, I've mentioned my favorite C compiler before. It really is.

I've been using it long enough so that my fingers "think" in this dialect. And a few years ago, I found myself doing (as usual) a big project in DeSmet C. And started thinking about C compilers and playing with them.

So, I went on a hunt for the source to the compiler (the compiler vendor had long since vanished). With a little help from Google and Superpages, I managed to find the people, and recovered the source.

But the one thing I couldn't get was the source to the version I use all the time, and the most common version out there -- version 2.51 (found around the net as PCC 1.2d). So, a while back I started re-creating it. I'll talk more about that next time.

Monday, December 8, 2008

No time today...

to do a real post. Insane kitty needs to go back to vet today, and I have to take off from work (among other things) to drug her in time for her to calm down enough to catch her to take her to vet...

{ sigh }

Friday, December 5, 2008

Geek credit...

When you're a computer geek, you really should be proud and proclaim it to the world.

Almost everyone carries a calculator (check out your cell phone).

Having an RPN calculator shows low level geekdom.

Having a calculator watch shows moderate level geekdom.

Having a calculator you built yourself shows high level geekdom.

Having a RPN calculator watch that you built yourself will make all other geeks bow down before you (with the possible exception of those few who built a nixie tube watch).

Wednesday, December 3, 2008

And if you think that's bad...

As I mentioned before, sometimes my programming doesn't sit well with my employer.

It's not because I'm a bad programmer (at least I think so), but other people may have other opinions.

But what if you were trying to write bad code? What if you were trying to confuse, confound, and mislead someone else?

In that case, you should enter the IOCCC (The International Obfuscated C Code Contest). Look over the past winners, and see if your code is legendary.

Monday, December 1, 2008

Microcontrollers and code protection (with pictures!)

A microcontroller is generally considered to be a microprocessor with built-in program memory (usually FLASH) and data memory, plus (maybe) EEPROM and peripherals. It is a "system on a chip".

If you look at many embedded hardware designs, you will see that the circuitry varies from incredibly complex (think cell phone) to fairly straight-forward (although considerable effort may have been put into the printed circuit board layout).

For the straight-forward embedded system, the electronics are reasonably easy to copy. The real value of the system is in the code controlling it. To help protect this, most (if not all) microcontrollers include some sort of code protection, so that the program memory cannot be read out of the microcontroller. That way, someone cannot copy your design without having to re-develop the code.

Of course, there is a value to getting someone else's code instead of developing your own. And where there is money, there are people to do the job. This does not make them bad people, but it does make you consider if it would be quicker / cheaper to get someone else's code instead of writing your own.

Flylogic can, among other things, break into a microcontroller and read out the program code. There are even valid reasons for the company that produces the embedded system to do this (the programmer dropped dead and nobody can find his source code). They also do security analysis of chips to make sure that they are as hard as possible to break into.

And, if you check out their blog, you will see how they tear apart microcontrollers, and they have many wonderful and beautiful photos of the actual silicon chips they contain.

Monday, November 24, 2008

Interruptions...


Well, as mentioned before, sometimes life gets in the way of doing stuff.

Last Tuesday, it interrupted me again.

My wife and I foster kittens for a local cat rescue group. As we have found out, sometimes this is a lifelong (the cats life, hopefully not ours...) obligation.

We have four cats that, while technically not "ours", are going to be with us for (their) lives. Two refuse to have anything to do with being adopted out, two are insane.

Yes, insane. We have been fostering them since they were about 10 weeks old, and someone else was fostering them before that. It was a litter of six kittens. Four of them were normal. These two were afraid of people, and never got any better (we've taken in half-feral kittens from TNR projects that we have made into lap cats. Not these two...). I call them my "indoor ferals".

One of them, Tange (pictured above), has, for some reason, terrible gingivitis. We finally got her to the vet to get her teeth cleaned. Things were so bad, the vet wound up removing 7 of her teeth (poor kitty).

Because of this, she needs special food and medication until she heals. But, since we can't put on hands on her, we've had to put her in a playpen that we normally use to keep kittens so that we don't have to hunt her down and catch her twice a day. Of course, she acts about how you would expect...

So, I haven't had much time for the last week or so. I'll try to get back into it soon.

Monday, November 17, 2008

I can't see what I'm doing

Being an "old fart", I have special glasses to view a computer screen. Not my normal bifocals for both distance and reading, these are for "intermediate distance" like looking at a computer screen.

Today, I can't find them. Maybe the cats are using them...

In the mean time, go check out this mechanical computer that you might remember from the back pages of comic books when you were young (if you're an "old fart" like me...)

Friday, November 14, 2008

Sometimes I annoy my employer with my programming...

...but not the way you might think.

I'm not a Windows programmer. Most computers run Windows (or some "windowing" operating system, like OS/X or Linux). It's what people expect these days.

But I'm a luddite about computer programming. 640K was good enough for Bill Gates (just kidding... it's often attributed to him, but there are no citations). It's good enough for me. In fact, I almost never use more than 128K of RAM (the old DOS small memory model -- it's what my favorite C compiler generates).

And if you're programming for an old-style DOS computer, the Disk Operating System (DOS) and Basic Input/Output System (BIOS) are very helpful, but you don't have to use them. If you want to talk directly to the hardware, you are free to do so (just like an embedded system...)

I've tried to get my feet wet in Windows programming, but never could. I always wound up with something like "I want to call function _foo() here. What do you mean I can't -- it's not in my class inheritance!?!? And, despite what the manual says, the compiler won't let me call it classlessly. Grrr... time to drop this steaming pile of ****". Or, trying to do a simple system call, you need all sorts of funky constructs like "try ... catch" for error handling. Of course, you still need an error return, so your program knows if it can proceeded. But you still need to do tons of special error handling to satisfy Windows (and the compiler), on top of what you need to write to make sure your software works smoothly.

The upshot of all this is that, when working on an embedded system, chances are that you need to simulate the "other half" as well. It may be a simple potentiometer that varies the simulated input to the system. It is often a computer program that pretends to be another part of the embedded system. Or it could be a PC program that sends setup / calibration information to the embedded system.

When I need to write the "other half", I write it as a DOS program. It works. It's fast and (for me) very easy (I have about 20 years of software to draw on for snippets).

So, I now have a program that took me a day or so to write that lets me talk to / calibrate / whatnot the embedded system I'm working on. Great! Except for one thing...

When it's time for me to hand the project off to someone else, they ask me how to setup / calibrate / whatnot the system. What I use is my DOS program. I can give them a copy, but:

1) They don't like it (not pretty enough, I guess...)
2) Management does't like it (not corporate standards or something)


So, why not get the equivalent Windows program written? It can't be that hard, can it? It only took me a day or two, and I can give them all the algorithms, data structures, and whatnot.

But it never works that way. It often takes a couple of weeks to get the equivalent Windows program written. Sometimes they never do get it written.

And this is why my employer gets annoyed...

Thursday, November 13, 2008

What is "embedded"?

In looking at a computer, some systems are called "embedded". Looking at the hardware, some of them (like ATM's) could be a normal desktop system (though not in a standard case), and run Windows.

So, what makes an "embedded" system? It's the fact that the system is dedicated (or even specifically designed) to run a specific application. The ATM hardware is dedicated to running the ATM application. Or it could be a dedicated control board, like an Arduino.

It's not something that you can write a termpaper on, then fire up the latest FPS. It's a computer that is designed to do only 1 (or a limited number of) specific tasks, and they cannot be changed by the end user.

Monday, November 10, 2008

What is a computer?

As a side note: Well, I really do intend to update M W F. All I have to do is put life aside occasionally...

But, to continue my last post, the question of "what is programming?" leads into the question of "what is a computer?" and that gets to be a tricky little question.

Most people think that a "computer" is what they are looking at / using when they read this blog. But that doesn't even begin to cover what it is.

Looking back to the start of the computer age, you have systems like ENIAC (Electronic Numerical Integrator Analyzer and Computer) and the like, where you have a room full of vacuum tube electronics.

Or a modern microcontroller, the size of a a grain of rice!

And everything in between. How many people look at their TV remote control and think "computer"? You may not be able to program it, but someone wrote the software for it...

Professionally, I write software for very small embedded systems. And have a lot of fun doing it.

Tuesday, November 4, 2008

USA Election Day!

If you are in the USA, today (11/4/2008) is election day!

VOTE!

Monday, November 3, 2008

What is programming?

Before I talk (more) about programming and what I do for fun and profit, the question should be asked:

What is computer programming?

Back "in the day", I wrote in Assembly for speed and compactness, and in C for complexity. I, and the people I knew, looked down on people who though that putting numbers in a spreadsheet (in Lotus 1-2-3) were "programming".

But, if you look at computers today and what is being done with them, the real answer becomes "whatever is written that controls the operation of a computer". So, writing in C, or Assembly, or Java, or Cobol, or even writing a web page is programming, but posting on your blog (or Slashdot or Digg or ... is not.

But, this also leads into the next question "What is a computer?", which is the subject of my next post on Wednesday.

Tuesday, October 28, 2008

Sometimes you just want to have fun...

I got back from a small "vacation" on Sunday (went to Minneapolis, with my wife, for a trade show, and stayed to see family/friends I have there).

I'm having trouble getting my mind back on work, so instead of thinking about physics, microprocessors, and anything like that, go check out the Roller Coaster Data Base and make your plans.

Monday, October 20, 2008

One last spacey post...

It's time to talk about other things, but if you have any interest in space, please take the time to join NSS, the National Space Society.

And check out The Moon Society and The Lunar Project while you're at it.

Ad astra per aspera!

Friday, October 17, 2008

The Artsy side of life

One of the wonderful things about "old-school" science fiction is the wondeful cover art. Sometimes realistic, sometimes inspiring, sometimes lurid, and always eye-catching, how often do we measure current progress against visions of the future from the 1940's and 1950's (where's my flying car, damnit)?

If you have a week or two to take a trip down memory lane (and find some roads you never even knew existed), check out VISCO, the Visual Index of Science Fiction Cover Art.

And, for a wonderful way to randomly browse covers quickly, check out Jim Bumgardner's CoverPop of SF cover art.

Wednesday, October 15, 2008

Would you like a little science with your fiction?

Not surprisingly, I'm a science fiction fan.

Personally, my preference is for hard science fiction over
soft, though things like BSG are among my favorites...

But, for those of you who want to think about the science of the rockets, space travel, and such things, Winchell Chung has a wonderful website full of math and science that will keep you occupied for days, if not weeks.

Monday, October 13, 2008

M.A.R.C.H. back to the future...

Hmmm... I posted my first entry on the 9th, and it says it was on the 1st. Intersting...

I'm going to try to update this blog M W F. And this will be broken next week (Oct 22nd and 24th, 2008) because I will be out of town on business. If I can find a spare moment, I'll see if I can post something.


M.A.R.C.H. stands for Mid-Atlantic Retro Computing Hobbyists.

In "the good old days", computer programming meant you worried about how many bytes (not megabytes / gigabytes) you were using, how many machine cycles something took, and the I/O state of the system.

A lot like embedded programming today.

Wednesday, October 1, 2008

Welcome!

Well, since every else is doing it...

This blog will be less about my personal life, and more about things that I find interesting. There are billions of webpages out there, and no way to sample even a faction of them. But, as I "stumble upon" things, I'll put them here and maybe guide you to things that you will find interesting or useful.

I'm an "old fogie" of computers -- started in 1974 with teletypes over modems to a mainframe (HP2000C runing timeshare BASIC), other mainframe systems (CDC Cyber).

Eventually, I started programming microcomputers (notably Teraks in UCSD Pascal) and an S-100 bus system with a Seattle Computer Products board set (using SCP DOS version 0.1, serial number 11!).

After college, I worked at IBM, programming the space shuttle (which isn't the great job it sounds like), and later wound up doing anti-virus programming on the IBM-PC.

As time passed, I got out of PC programming and entered the embedded software world, which is actually surprisingly close to programming an IBM-PC in assembly -- you have complete control of the system, and have to worry about what is going on in hardware at the microsecond level.