ARM Assembly is Too High Level: Moving by Shifting

The syntax for the register form of a Logical Shift Left is:
LSL{S}c Rd, Rm, #imm5

This will take a value stored in the source register of Rm and shift the bits left by the amount defined in #imm5. The result of this operation is stored in the destination register of Rd. Like many instructions on ARM, you can make it conditional defined by a condition code (c) and define if you want the flags to be set as well with {S}.

The real question (for freaks like me) is what happens if you use #0 as the shift value, doesn’t it just effectively move Rm into Rd? Wouldn’t LSL r0, r1, #0 do the same thing as MOV r0, r1?

To say that the answer is ‘yes’ should be obvious and also boring. The interesting thing is that ARM decides to make the machine encoding for LSL Rd, Rm, #0 equivalent to MOV Rd, Rm. Look at the encoding graphic near the bottom of the post for further context of this and the following info.

What about LSR? It effectively does the same thing as our MOV when imm5 is 0, but the machine encoding is off by one bit. Interestingly, LSL can have shifts from 0-31 (well, the manual says 1-31…), but LSR has shifts from 1-32, and there is no fucking way to force it to shift by 0. If directly writing the binary for the imm5 field of this instruction, and setting it to 00000, it actually ends up being LSR Rd, Rm, #32; that’s how they fit that 32 in there (as 11111 is still 31 in binary). You can look to the last image in this post for a comparison how the source assembly is encoded.

So the next question is what happens when we just write LSR Rd, Rm, #0? If what the ARM manual says is true about the 1-32 range, this instruction should be illegal (it kind of is). In the case of the ‘as’ assembler, it changes this to a MOV Rd, Rm, but with the exact same machine encoding as the LSL. See, the 3 bits between the imm5 and Rm field should be ‘010’ for an LSR, it is ‘000’ for the MOV and LSL. So the LSR Rd, Rm, #0 gets encoded as an actual different instruction even at the machine encoding level.

This is all interesting. It just goes to show that it is regular behavior for an assembler to just take your shitty assembly and do whatever it fucking feels like with it. This is, I should remind everyone, because assembly language is too high level.

For reference, here is a nice little graphic illustrating the machine encodings of each instruction discussed:

Also for reference is some source assembly instructions (cat’ed out) followed by a disassembly of the assembled program (using objdump):