November 7, 2012

So, I started reading [The Definitive Guide to the Xen Hypervisor] (again :P), and I thought it would be fun to start with the example guest kernel, provided by the author, and extend it a bit (ye, there’s mini-os already in extras/, but I wanted to struggle with all the peculiarities of extended inline asm, x86_64 asm, linker scripts, C macros etc, myself :P).

After doing some reading about x86_64 asm, I ‘ported’ the example kernel to 64bit, and gave it a try. And of course, it crashed. While I was responsible for the first couple of crashes (for which btw, I can write at least 2-3 additional blog posts :P), I got stuck with this error:

when trying to boot the example kernel as a domU (under xen-unstable).

0x2000 is the address where XEN maps the hypercall page inside the domU’s address space. The guest crashed when trying to issue any hypercall (HYPERCALL_console_io in this case). At first, I thought I had screwed up with the x86_64 extended inline asm, used to perform the hypercall, so I checked how the hypercall macros were implemented both in the Linux kernel (wow btw, it’s pretty scary), and in the mini-os kernel. But, I got the same crash with both of them.

After some more debugging, I made it work. In my Makefile, I used gcc to link all of the object files into the guest kernel. When I switched to ld, it worked. Apparently, when using gcc to link object files, it calls the linker with a lot of options you might not want. Invoking gcc using the -v option will reveal that gcc calls collect2 (a wrapper around the linker), which then calls ld with various options (certainly not only the ones I was passing to my ‘linker’). One of them was –build-id, which generates a .note.gnu.build-id” ELF note section in the output file, which contains some hash to identify the linked file.

Apparently, this note changes the layout of the resulting ELF file, and ‘shifts’ the .text section to 0x30 from 0x0, and hypercall_page ends up at 0x2030 instead of 0x2000. Thus, when I ‘called’ into the hypercall page, I ended up at some arbitrary location instead of the start of the specific hypercall handler I was going for. But it took me quite some time of debugging before I did an objdump -dS [kernel] (and objdump -x [kernel]), and found out what was going on.

The code from bootstrap.x86_64.S looks like this (notice the .org 0x2000 before the hypercall_page global symbol):

One solution, mentioned earlier, is to switch to ld (which probalby makes more sense), instead of using gcc. The other solution is to tweak the ELF file layout, through the linker script (actually this is pretty much what the Linux kernel does, to work around this):