Friday, January 16, 2009

Re-creating source code, Part 8

OK, enough suspense...

DeSmet C was written in DeSmet C.

But which version? Aye, there's the rub.

I don't know.

As I said before, later versions of the compiler have additional functionality and improvements. Part of the improvements is better code generation and register utilization.

For example, the C source file

int i;
char *j
main(argc,argv)
   int argc;
   char *argv[];
   {
   for (i = 0; i < argc; i++)
      j = argv[i];
   }

compiled with either version 2.51 or 3.03 generates (for the "j = argv[i];")

mov   si,word i_
shl   si,1
add   si,word [bp+6]
mov   si,word [si]
mov   word j_,si

but compiled with version 2.40, we get

mov   ax,word i_
shl   ax,1
mov   si,ax
add   si,word [bp+6]
mov   si,word [si]
mov   word j_,si

In addition, the x86 architecture has both "short JMPs" (for locations within
-126 to +129 bytes of the current location) and "long JMPs" (for locations within the current 64K segment). (It also has FAR JMPs, but we don't have to concern ourselves with this for the small memory model of DeSmet C).

All conditional JMPs are "short JMPs" (also called near JMPs). If your conditional has to send you to a location that is furthur away, you have to do a reverse-sense conditional JMP around a long JMP.

The compiler tries to use short JMPs wherever possible (both unconditional and conditional) to produce smaller executables. But, trying to guess how far away a forward JMP is is tricky, and sometimes it errs on the side of safety and uses a longer JMP just to make sure that it doesn't try to go "out of bounds". The exact rules, however, vary from compiler version to compiler version.

The result of all this is that, even after re-creating the source to version 2.51, I cannot get the same executable as the distributed 2.51. All the code sequences are equivalent, but the details of register useage and short JMP / long JMP are different. Very close, but no cigar.

No comments: