Small code

A project that I’m working on at the moment calls for a
very small footprint. This post is about how to make code
really small for the ARM architecture.

As you can probably guess, I’m interested in operating system
code, so as an example, I’ve taken a very simple piece of operating
system code, and stripped it right back to demonstrate some of the
techniques to use when optimising for space. So here is the snippet
of code we are going to optimise:

In this very simple operating system we have threads (thread data
is stored in struct thread). Each thread has a set of 32
signals (encoded in a single word notify_bits). This poll
is used by a thread to determine if it has been sent any signals. The
mask parameter is the set of signals that the thread is
interested in checking. So, a thread can check if a single signal has
been thread, or if any signal has been set, or if a specific subset of
signals has been set. The function returns the signals that are
available (which is simply the bit-wise and of
notify_bits and mask). It is important that
the function clears any signals that have been returned. This makes
sure that if poll is called twice the same signals are
not returned. This is achieved in lines 10—12.

So, our goal is to try and get this code as small as possible.
And every byte counts! First off, we just try and compile
this with the standard ARM gcc compiler. I’m using version 4.2.2.
So we start with: $ arm-elf-gcc -c poll.c -o poll.o.
We can then use object dump to work out what the compiler did:
$ arm-elf-objdump -dl poll.o.

So, our first go gets us 100 bytes of code. Which is 10 bytes
(or 2.5 instructions) for each of our lines of code. We should be able
to do better. Well, the first thing is we should try is to use
optimisation: $ arm-elf-gcc -c poll.c -o poll.o -O2. This
gives us a much better code generation output:

So this got us down to 28 bytes. A factor 4 improvement for one
compiler flag, not bad. Now, -O2 does some standard
omptimisations, but -Os, will do optimisations
specifically for reducing the amount of code. So trying:
$ arm-elf-gcc -c poll.c -o poll.o -Os. This gives
a little bit better code-gen:

Down to 24 bytes (6 instructions), is pretty good. Now, as you can
see the generated code has 32-bits per instruction. The some of the
ARM architectures have two distinct instruction sets,
ARM and Thumb. The Thumb instruction
set uses 16-bit per instruction, instead of 32-bit. This denser
instruction set can enable much smaller code sizes. Of course there is
a trade-off here. The functionality of the 16-bit instructions is
going to be less than the 32-bit instructions. But lets give it a
try. At the same time, we will tell the compiler the exact CPU we want
to compile for (which is the ARM7TDMI-S) in our case. The compiler
line is: $ arm-elf-gcc -c poll.c -o poll.o -Os -mcpu=arm7tdmi
-mthumb.
Which produces code like:

So, now we are down to 16 bytes, so in Thumb we need 8 instructions
(2 more than ARM), but each is only 2 bytes, not 4, so we end up with
a 1/3 improvement. To get any further, we need to start looking at our
code again, and see if there are ways of improving the code. Looking
at the code again:

You may notice that the branch instruction on line 5, you may
notice that this is actually redundant. If result is
zero, then ~result well be 0xffffffff. Given
this thread->notify_bits &= 0xffffffff will not change
the value of thread->notify_bits. So, we can reduce this
to:

This gets us down to 6 instructions (12 bytes). Pretty good since
we started at 100 bytes. Now lets look at the object code in a little
bit more detail. If you look at address 0x8, the instruction simply
moves register r1 into register r0 so that
it in the right place for return. (Note: The ARM ABI has the
return value stored in r0). This seems like a bit of a waste, it
would be good if there was a way we could have the value stored
in r0 and not waste an instruction just moving values
between registers. Now, to get a better understanding of what the
instructions are doing, I’m going to slightly rewrite the code, and
then compile with debugging, so we can see how the generated code
matches up with the source code. So first, lets rewrite the code a little
bit:

You should convince yourself that this code is equivalent to the
existing code. (The code generated is identical). So, we can now line
up the source code with the generated code. On line 03 (unsigned
int tmp_notify_bits = thread->notify_bits), this matches up
with address 0x0 (ldr r3, [r0, #0]). So, register
r3 is used to store variable
tmp_notify_bits. The parameter thread is stored in
register r0. Now, line 04 (mask &=
tmp_notify_bits) matches directly with address 0x2 (ands
r1, r3). Register r1 matches directly with the
mask parameter. The important part of restructuring the
code as we have done, is that it becomes obvious that we can directly
using the mask parameter instead of needing an extra variable like in
the previous code. As we continue line 05 (tmp_notify_bits &=
~mask), matches directly to 0x4 (bics r3, r1). The bics
instruction is quite neat in that it can essentially do the &=
~ in one 16-bit instruction. Line 06
(thread->notify_bits = tmp_notify_bits) stores the
result back to memory, matches directly to 0x6. Now the final line of
code (return mask;), needs two instructions 0x8 and 0xa
(adds r0, r1, #0 and bx lr). Now the reason we need to instructions
is because mask is stored in register r1, and the return
value needs to be in register r0. So how can we get mask
to be stored in r0 all along? Well, if we switch the
parameters around poll(unsigned int mask, struct thread
*thread), the mask will instead be stored in
r0 instead of r1. (Note: We don’t
necessarily have to change the interface of the function. If we want
to support keep the interface we can use a simple macro to textually
swap the parameters.) If we compile this, we get the following
generated code: