Assembly is Too High-Level: Signed Displacements

For those that don’t know about unsigned and signed data types, it’s not all that complicated. One byte can hold a total of 256 possible values. If these values were only positive numbers and included zero, we would have a number range of 0-255. But what if we wanted negative numbers? The byte is divided; we now have a range of -128 through 127. When including zero, this is all 256 possible values. The data is formatted as https://en.wikipedia.org/wiki/Two’s_complement

With a 32 bit register, we have four bytes, giving us 0-4,294,967,295 for unsigned values and -2,147,483,647 through 2,147,483,647 for a signed value.

When using instructions that use memory locations for an operand, we are allowed to use literal displacements. For example, we can use an address/pointer stored in EAX and add 128 to the address (not value). This 128 is called a displacement, or offset. This displacement can be up to 32 bits, but if the displacement is small enough, it can be encoded as 8 bits.

One thing to note however, is that this displacement is a signed value. At least for me, the documentation of this fact was subtle. And in practice, if you operate at the level of abstraction that assembly language provides and don’t know this subtlety, it is possible to run into some unfortunate bugs.

A good example is the two following lines of assembly language:
lea ebx, [eax – 1337]
lea ebx, [eax + 4294965959]

As it turns out, when it comes to machine code, both of these instructions are completely identical. Remember to look up to the range of a 32 bit signed value, the range stops after around 2 billion; it doesn’t go all the way up to 4 billion as in the above lea instruction example. This 4 billion value would be valid as an unsigned 32 bit number, however. Nasm doesn’t appear to care about warning us; it just encodes this positive 4 billion number as if it were unsigned. But when you actually execute this instruction, it is surely treated as a signed number (which turns out to be -1337).

Exhibit A:

We see that the 2nd line of assembly has the eax-1337, even though the machine code has a value of 0xfffffac7 (2’s compliment little endian). My assembly source actually used lea ebx, [eax + 4294965959] to generate that line. Even though my source added a large source number, and the machine code appears to be a large number (if it weren’t 2’s compliment), the disassembled version in edb (not how my original source was written) is what is actually executed. We know this because 0x80000000 – 1337 is 0x7ffffac7 (the result that gets stored into ebx shown in the screenshot).

Here’s another thing to think about that can get glossed over when not paying attention to the details. In 32 bit, negative 1337 is represented as 0xfffffac7 in 2’s compliment. if this value were to be interpreted as unsigned, it’s value would be 4,294,965,959. This is where I got this value to use in my examples above.

Does this same logic work in 8 bit? Say we use negative 100. This is 0x9c in 2’s compliment and as an unsigned 0x9c is 156. So to review, these two instructions are the same:
lea ebx, [eax – 1337]
lea ebx, [eax + 4294965959]

Does that mean that these two are too?
lea ebx, [eax – 100]

The answer is no. Note the below screenshot showing the difference between adding 127 and 156:

As a signed number, 156 would be too big to fit into 8 bits, so nasm graduates this to 32 bits instead, and it actually gets interpreted as positive, where as the 4 billion positive number is unable to graduate to anything higher than 32 bits, as there is no machine encoding for this (even in a 64 bit architecture; this encoding is not part of the Mod/RM encoding).

What about this instruction?:
lea ebx, [eax + 4294967292]
Below is the same instruction represented in hex instead of decimal:
lea ebx, [eax + 0xfffffffc]

The resulting machine code is 8d58fc. Where’d all the f’s go? They are leading f’s, just as there’s a such thing as leading zeros. As in 00003 is the same thing as 3. In twos compliment, f’s in a negative number are leading f’s. This number represents -4, this number is small enough to fit into 1 byte; all the leading ff bytes can be dropped.

And on the other hand, the below assembly instruction:

lea ebx, [eax + 0xfc]

Has the machine code of 8d98fc000000; a bunch of leading zeros (looks like trailing due to little endian) get added to the machine code. nasm must pick between the ambiguities of 0xfc. Does the author mean ‘+ 252’ or do they mean this to be a signed value of ‘-4’? Nasm chooses to interpret 0xfc as positive, but if it were to put this data into only one signed byte, it would actually represent negative 4, therefore has to be placed into 4 bytes.