Why we have to wait for Android on the Neo 1973

So after a day of much fun and hacking, I sadly blog here in the
face of defeat. Thwarted by
binary only distribution and non-forwards compatible
architectures. This tale of woe documents my attempt to get the Android
stack running on the FIC Neo 1973 phone.

This post describes what I did to try and get this working, and
ultimately why it isn't going to work until you get the source for the
stack. And if you find the story excruciatingly just skip to the
conclusion.

The day started out promisingly. I took the diff of the kernel
I had earlier produced, and started to hack it down into something a
little more manageable. First I got rid of the patches that enabled Qemu, since I only
care about running this on the real hardware. Then I got rid of the
patches that enabled the goldfish platform. The goldfish platform is
the hardware platform that the Android SDK simulates. I don't need
that for running on the Neo, so gone! Next there was a whole big
patchset for enabling yaffs2. The
Openmoko kernel already has yaffs2 patched in there, so this just
causes confusion. Once all that is done, the final
patch is much more manageable; 8000 lines rather than 30000
lines.

So with that in place, I pulled down the 2.6.22.5 kernel.org kernel, and then applied the
Openmoko patchset with quilt. With that in place I applied my stripped
down diff. Now this was against 2.6.23, rather than 2.6.22 so there
was a bit of fuzz and a couple of failed hunks. (That sounds more,
like gangster
slang than hacking!). Anyway, after fixing up the patch, I now have patch
that applies cleanly against the Openmoko kernel. (Not that it will do
you much good as you are about to see.)

So I took the recommended default config for openmoko, and ran trusty
make oldconfig. This prompted for a couple of new options:

PANIC_TIMEOUT=0
CONFIG_BINDER=y
CONFIG_LOW_MEMORY_KILLER=y

I ended up removing the low memory killer option because it had
compile errors. I was going to go back and fix, but in the end didn't
really matter.

This was a lot of progress in my first hour or two of hacking. I then
proceeded to waste a whole bunch of time on stupid stuff. I'll save you
all the gory details, but highlights follow.

Firstly just trying to get stuff running on Neo 1973 proved a bit
of a challenge. I eventually found some known binaries, and workd ut how to get them onto the phone:

The next challenge was actually just getting the kernel I had built
from to work with the rootfs rather than just the binary kernel. This
was particularly difficult to actually debug because there were no
error messages or panics, just the kernel sitting at Freeing
init memory, and no more. As far as I could tell it was the
same kernel source, and I had used the default config. So after lots
of messing around (different compilers, with/without modules, etc), I
stumbled upon the fact that the user mode binary applications on the
rootfs image are all using the new EABI, as opposed to the old ABI. It
turns out that special kernel support needs to be enabled for EABI,
and this isn't in the default openmoko kernel config. (I
don't have a good reference for EABI vs. OABI. Linux devices has a story,
but the floating point stuff is really only one small part of the
differences.). Anyway, after enabling the CONFIG_AEBI
things started going a lot smoother. I really wish that more people
enabled the /proc/config.gz option, it would have made
life a lot easier.

So at this point I had a kernel with the Android patches loading
and running the standard OpenMoko distribution. Next step was to run a
different rootfs. Taking the filesytems
I had extracted earlier as well as the rootfs I had extracted (see
this post
for details), I combined these and used mkfs.jffs2 to build
an android jffs filesystem. (For reference the full command is:
$ sudo mkfs.jffs2 -x lzo -r android-root-image -o android.jffs2 --eraseblock=0x4000 --pad -n -squash).
At this point I thought I was home free. How very wrong I was.

On booting this I was back at the dreaded Freeing init
memory, with no other output. Confused with this, I compiled
a very simple hello world program to see if this would work
as a replacement init (just for testing). This didn't work either.

With this failure I tried another tack. I would revert to my known
good, of the working openmoko rootfs, and install my hello program on
this rootfs just to test it. I didn't think I would have a problem
here. It turns out it failed to run. Luckily the openmoko rootfs
has gdb, which is great for fixing problems like this. Firing up
gdb soon let me to the real problem.

ARMv4 vs. ARMv5

So, it turns out that my hello binary (and all the android
binaries) are compiled for an ARM926Ej-S chip. This is a problem
because the neo1973 has an ARM920T core. Now you would think that
ARM926 and ARM920 would be pretty close. But if you thought that you
would, unfortunately, be wrong, wrong, wrong! The ARM926EJ-S implement
the ARMv5TEJ instruction set, but the ARM920T implements the ARMv4T
instruction set. So what happens in my hello program is that we hit an
ARMv5 instruction, which is undefined in the earlier ARMv5 ISA, which
generates an undefined instruction trap to the kernel, and the kernel
responds by sending SIGILL to the running
process. Assuming that the program hasn't installed any special signal
handlers this will kill the process. And this is what was happening to
my hello program, and what I assumed was happening to
init as well. (Of course, assumptions make an ass out of
u and me, or in this case, mostly me.)

Now I really wasn't going to be daunted by a pesky little thing
such as the CPU not implementing the instructions stand in my way!
(Note: I could of course have compiled hello for ARMv4
architecture, but that isn't an option for the rest of the stack, and
I was only interested in getting hello running so I could get the rest
of the stack running). So, in an act of stupid defiance, I decided,
if the CPU can't implement the instruction, I'll do it myself.

Luckily the kernel provides a neat infrastructure for managing
undefined instructions, and even emulating them. So the first instruction
to emulate was the ARM clz instruction. This is the instruction
that counts the number of leading zero bits. The code below implements this.
The only other thing to do is ensure that this hook is
registered at startup using: register_undef_hook(&clz_hook);

One thing that may not be clear from the comments is that ARM
supports conditionally executed instructions. The top 4 bits
of the instruction are its condition field. Depending on the condition
field, and the value of the N, Z,
C and V flags (which are stored in
the CPSR register), the instruction may or may not be executed. This
is used to avoid having to branch for all if statements and
the associated problems... but you didn't come here for an introduction
to computer architecture. To correctly implement this, some code is needed,
and I clag it here for posterity.

OK, one down. That wasn't so hard. The next one gets a little bit
tricker. The compiler will use the BLX instruction if it
is available. This is the Branch, Link and Exchange instruction.
There are two versions of the instruction, and at this stage we only really
care about version 2. In this version the address to branch to is stored in
a register, and a flag indicates whether or not an exchange
is required. (You can ignore exchange for now, more about that later.).
This instruction is a little bit more effort to implement, but it is not too hard:

After this, success! Hello world ran correctly. Of
course this emulation isn't going to be particularly fast, but it is
still infinitely faster than not running at all. (Well, OK, not
really, divide by zero is undefined, not infinite.) At this point we
were feeling pretty good with ourselves. At this point I must
acknowledge Carl and Matt for there
assistance with this.

Thumb interworking

So now I really thought I was home free, but wrong once again. (A
pattern emerging maybe?) So first a bit of a primer on ARM's Thumb
mode (so punny!). ARM has two different instruction sets, the
ARM instruction set, and the Thumb
instruction set. The Thumb instruction set is a 16-bit instruction
set, which has a higher code density than the ARM instruction set.
Now the neat thing about this is that you can actually combine both
ARM and Thumb instruction in the same program. So if your
compiler is smart, it should be able to use both instruction sets for
optimisation. The CPU knows whether code is executing in ARM or Thumb
mode by a bit in the CPSR register. When the bit is set the
instruction stream is assumed to be 16-bit Thumb instruction. Now if
you are running in ARM mode, and want to enter Thumb mode, you need to
do an exchange operation, which is part of the
bx and blx instructions. Now it turns out
that Android is compiled with Thumb mode, so this means it uses
blx to switch from ARM to Thumb mode. So at this stage
I ended up needing to implement blx (version 1)
function. This is shown below:

Now, we get a bit further. But still no go. It turns out that Thumb also has
a new BLX instruction in V5. So, we have to go through and emulate
this instruction for Thumb as well. Below is the code for that.

Now if you are still with me, and actually read the code, you might
recognise some pretty interesting code. Spot it? No? OK, so the
problem is the way in which ARM code returns to Thumb mode. The
blx instruction updates the link register with the return
address. In Thumb mode it also sets the lowest bit. This ensures that
when bx is called from ARM mode it will jump back into
Thumb mode. It turns out that having to use bx to return
from functions is a bit of a pain, so in ARMv5, the architecture was
updated so that if you popped values from the stack into the
pc register, the CPU would also check the low bit and
switch to Thumb mode if required. Unfortunately ARMv4 doesn't do this.
Rather than checking the lower bit, it simply ignores it and masks
it off, which means you jump back to the return address but remain
in ARM mode, so you end up executing 16-bit instructions as though
they were 32-bit instructions. It may not surprise you to learn that
this generally doesn't work so well.

Which gets us to the truly evil code found above. As well
as setting the low bit, we also go and set the top bit of the LR.
When the ARM code returns from the function, rather than going
to the correct location, it ends up at an unmapped location, which
causes a pre-fetch abort. The prefetch abort handler was then
updated to handle this error case.

Now, at this stage, we have something pretty hacked up, but all
these hacks are pretty solid. Unfortunately it still doesn't work. We
have successfully ensured that ARM code returns correctly when called
from Thumb mode, what we have failed to do is ensure that Thumb code
returns correct to ARM code. In ARMv4, this is only possible through
the bx instruction, which correctly sets the Thumb bit,
in the CPSR. Unfortunately on ARMv5, the pop instruction
was extended to also correctly update the thumb bit. But we aren't on
an ARMv5, so it is simply ignored. Which means we get stuck in Thumb
mode and can't correctly return to ARM code.

The prefetch abort trick works to an extent the other way as well, e.g:
for getting from Thumb, back into ARM, but it relies on the ARM code
using the blx instruction. Unfortunately this isn't always
the case, and it is perfectly reasonably for code to use a bl
followed by a bx. As none of these trap it is not possible to
put our magic fake value into the LR register.

The only other option left at this stage is some kind of code
scanning technique. In this we scan the object code looking for the
unsafe pop instructions, and replace them with undefined
instruction so that we safely emulate them with the ARMv5
behaviour. Unfortunately ARM makes this approach basically impossible.
It is not possible to tell if any block of code is Thumb or ARM
instructions. More importantly, it is impossible to determine if a
random word in the text segment is actually an instruction, or is in
fact a literal value. Simply scanning for pop could
actually modify some constants, which would lead to potentially subtle
bugs. If ARM had separate execute and read permissions we could use
the MMU to distinguish between code and data, but unfortunately the ARM
MMU can't really do this. Which means that this approach is basically a
no-go, at least not without some pretty nasty
heuristics, or some really awesome static analysis. Of course we could
just emulate every instruction, but this isn't exactly appealing to me.
(And the performance would really suck!)

Conclusion

In summary, Android is compiled for ARMv5, Neo 1937 is ARMv4. These
instruction sets are not compatible. Therefore Android will not run on the
Neo 1937. Solutions to this problem would be either:

FIC releasing a version of the Neo based around an ARM926 core.

Google compiling for ARMv4 and making that available.

Google releasing the source and someone else compiling for ARMv4.

My guess is none of those three things is going to happen any time
soon (although I'll be really happy to be disproved!), so it is better
to focus on trying to get this running on an actual ARMv5 based
chipset. (E.g: PXA270, i.MX21).

Finally, thanks to Jaq, Carl, David and Matt
for providing inspiration and advice.

Update: Thanks to andrzej for spotting the bug in my
clz() emulation. It should of course be 32 - fls(), not fls(). This is now
updated.