After my previous post about abandoning the JIT case in favor of a quick static compiler with a distributed type system, I spent some time in December adapting the TCC codebase to run on Viridis infrastructure. I was finishing up with that, but hadn’t gotten to testing when the holidays hit, and then I left IBM at the end of January to start at AMD and so I sort of avoided freetime programming with the knowledge that soon I’d be knee deep in establishing a new project.

The good news is that I’m paid to hack on x86 now instead of PowerPC and, once again, I owe a new job to experience gained with Viridis. In a lot of ways I’m far more familiar with x86 anyway. I was paid to write PowerPC assembly on occasion, and I was intimately familiar with the guts of the architecture thanks to being conversant in Linux /arch/powerpc and the programmer’s manual, but that’s quite different than actually sitting down and writing that support from scratch like I’ve done on x86. I hit the ground running at AMD, and firing up QEMU to debug on x86 makes me think of Viridis every time.

Anyway, the bottom line is that I expect to start hacking again soon, even though it’s going to be a while before code actually hits the public repo. My next step is to learn how TCC generates assembly in memory and how I’m going to link that assembly to the kernel and potentially other libraries/programs. My immediate goal is to be able to compile some C source into a program that can use kernel syscalls in the Linux fashion. After that, I can begin the more experimental stage of development, modifying TCC to have a “user mode” setting that accepts input in a C dialect without memory explicit pointers or unrestricted type casting, including giving consideration to the structure of the program fragments and how they’ll be scheduled below the task level.

A week ago I detailed Vua, a memory safe language that unprivileged Viridis programs have to use in order to exist within a single memory context.

I also mentioned I have a decent amount of preliminary work done generating a bytecode for this language, and it seems like the next logical step is to interpret that bytecode and then, likely, JIT’ing it. This is a pretty standard path, and JITs are usually good compromises between loosely typed dynamic languages and the hard earned performance of tuned assembly.

However, now that I’ve converted modified Lua syntax into modified LuaJIT bytecode to a first approximation, and I’m pondering writing a VM to execute this bytecode, I’m wondering if it’s not a better idea to just skip writing the VM and making some simple modifications to the bytecode to serve as an intermediate representation of the program rather than an (easily) executable artifact itself. This way, limitations on the bytecode could be lifted (like the set instruction size and the need to encode constants in 8 bit fields) and Vua can still have a version of LuaJIT’s super fast parser, but instead of writing a VM, I could just write an architecture specific backend to the compiler to convert this intermediate bytecode into assembly.

My reasoning here is that, the more I think about it, the less I think a VM is really buying us much. In a scripting environment, where Lua and JITs in general find a lot of traction, a lot of code is short running, likely execute once and only once, so skipping the expensive compilation stage actually improves performance, and then JITing handles the potentially longer running, more intensive parts of the code.

Well, short running doesn’t describe much of our usecase. In fact pretty much everything at the system level is going to be running the entire time the machine is up, so even “long running” sounds like an understatement compared to “always running”. In that context, it seems likely to me that the overhead of AOT (ahead of time) compiling everything is going to be recouped over the running life of the system, especially if we have a kernel controlled cache of true binaries on disk.

Another advantage of the VM is the debuggability. The VM knows a lot about the code it’s running and when something breaks it’s usually the VM that handles it. It knows the exact bytecode instruction, the exact error, the line in the source, where on a traditional binary the kernel can only be so specific (e.g. this program caused a hardware exception, or misused a kernel interface). The VM also provides a body of common code in software, which conveys certain advantages to debugging (e.g. giving the ability to stop whenever a given table is read or written without having to identify every single place in the code a read or write is done).

But Viridis is free from the problem of accuracy because of the 100% opt-in to Vua. The “virtual machine” for Vua could actually be the physical machine. Viridis can know the exact instruction, exact error because it also knows where the source is and compiled it to assembly itself. As for common code, I’m willing to give up this ease in favor of GDB style debugging, or simply debug recompilation, especially since I bet that this common code is actually bad for performance. Yes, it’s more likely to be in cache, but similar to the highly optimized instruction dispatch code in LuaJIT’s VM, we likely gain more by having far fewer and more predictable branches than we do by preventing cache misses.

I also like the idea of static compilation because it gives us a freer hand with assembly, and particularly register usage. The LuaJIT VM still obeys the platform’s C ABI because it’s designed to cooperate with arbitrary C code. We don’t have that restriction (there is no arbitrary C code, just kernel that we can warp however we want), but even without it, the most efficient use of registers in a VM is to pin certain info and locations into known registers and then use them consistently in specific instruction handlers. For example, the LuaJIT VM always has a register with the current instruction in it. Obviously that makes sense when every handler is likely to need information from the instruction to complete its work, but static compilation doesn’t need to care about that. Same with having a register that always points to the constant lookup table, etc. etc. During execution, VM “registers” are almost always just stack locations because the VM’s handlers aren’t flexible enough to accommodate using more than the handful of registers it’s defined to be consistent.

Which isn’t to say that the VM handlers are poorly written, it’s just that adding comparisons to deal with hardware registers and stack “registers” in the same code obviates any performance gain from using the hardware registers in the first place. You could specialize the bytecode, but then your bytecode isn’t portable and you double the amount of code required to deal with pretty much every instruction.

Anyway, tight register usage is fine when you’re trying to keep to yourself and co-exist with C… but with Viridis, if the VM never uses a register, say R11, it just doesn’t ever get used outside of the kernel and that’s obviously not acceptable.

Aside from the VM, the benefits of JIT, like being able to use runtime feedback optimization, aren’t off the table with static compilation. Meanwhile other interesting, if heretical, avenues of optimization open up. Like what if you could do extreme inlining across the program->library->kernel barriers? I’m interested in creating a system where there is no set ABI, except for interacting with the kernel. Of course that’s all future work.

The only real problem I see immediately is that static compilation makes a loose type system quite a pain. If a function can be called with multiple types, or getting a value from a table can yield multiple types, that’s hard to rectify in pure assembly – which is part of why successful JITs carefully select pieces of code to compile. I need to do more research into how this should be dealt with, but I’m confident that a combination of guards and repetitive compilation can be used to manage dynamic typing on the fly.

This started as a July status, but I kept wanting to dig just a little deeper to give a more detailed update.

As I mentioned in the June status, the core idea of Viridis is to create a zero copy operating system with a finer grain unit of execution than the process or thread. The most fundamental difference is that the entire operating system will exist in a single memory context which should reduce the context switch overhead to practically nothing, although with the cost of having to self enforce memory protection.

I won’t get into more detail about the execution model, because I’m not there yet, but in this single-context it’s clear that we’ve thrown running native binaries out the window, at least for untrusted code. Because of this, the first step towards the goal is writing a compiler for a memory safe language which is what my focus has been for the last couple of months.

Vua

Vua is what I call this language, and as you might guess it’s based on Lua in syntax. I chose to use Lua as a base for a few reasons:

It has relatively minimal, C-like syntax. Lua has few bells and whistles (although it does have some), it’s memory safe, and it’s based around just three basic data types (numbers, strings, tables).

LuaJIT exists, is fast, and has a permissive license (MIT). I am still rolling my own for various reasons, but it’s nice to have a working model to look at and I have included a small amount of LuaJIT code directly (mostly bytecode definitions).

Of course, that’s not to say that Vua is Lua, in fact they’re already incompatible with each other in some very minor ways and the compiler/bytecode intepreter/JIT is going to need to be further tweaked to squeeze more predictability (i.e. “2” + 1 is an error, not 3) and performance out of the resulting native instructions. I have ideas on this, but mostly come down to replacing CData with a table-esque “struct” data structure that can be read and written byte accurately using a new long integer type “reg”, both with specialized bytecode. The default numeric type is also an integer.

The big thing I’m concerned about at this point is how garbage collection will work, or if it’s smarter to keep memory management explicit. In a traditional system, explicit memory management is obviously better for performance both in terms of fewer instructions and running memory usage since programs explicitly release memory as soon as they’re done with it. Maybe that’s the end of the discussion, and honestly that would be less work in the long run because a lot of effort goes into effective garbage collection. The only thing that makes me question this position is that stale references to freed memory are security issues in this system. Traditional processes get punished for use-after-free by faulting, or just screwing some other piece of their own data. In this system, whatever gets placed in that address space after the memory is freed could be entirely unrelated to your program. Then again, Viridis has control over the running code either through interpretation or compilation and we’re counting on some level of forced opt-in to safely deal with memory buffers, so maybe that’s a non-issue.

Anyway, at the moment I’m hedging my bets on this front, in the hopes that as I develop in the language (leaking memory everywhere) the best answer will become apparent. There is no GC at the moment, but the memory allocator and object header has some space reserved for it’s use.

Thoughts on Performance

Memory tracking aside, the big issue that comes to mind with using a compiler like this is performance. JITs, or other interpreter strategies have some advantages with runtime optimization, but fundamentally you’re adding a compile stage and/or interpreter overhead versus just loading a big chunk of instructions and letting the CPU go at it. It would seem that it’s impossible to beat that, but I’m hoping that this obstacle will be overcome partially through avoidance (perhaps bytecode or marked up native code could be cached instead of generated each time), partially through mitigation (making the compiler faster and better), and partially through the structure of the system favoring small, non-blocking operations on shared memory far moreso than any existing operating system paradigm.

Of course all three of these strategies will have to be employed just to break even on the performance front, but if we can come close then hopefully the other benefits of using Vua at a system level will make this a net gain. After all, even using the CPU isn’t technically as fast as rendering your algorithms directly into silicon but we put up with the abstraction because it’s fast enough and the extreme versatility of the processor (the fact that you don’t have to spend years in logic design, verification and manufacturing) more than make up for the performance loss. No piece of software is going to represent quite that level of quantum leap, but the core conceit of Viridis (and the JVM, and pretty much dynamic languages in general) is that a small amount of performance is worth trading for other concerns, like debuggability, portability, or expressivity.

Progress

It’s always best to look a the Viridis Git to see what I’ve been doing. To summarize, Vua’s parser and bytecode generation is almost complete. It needs to have its flow control and table manipulation fleshed out, but it’s capable of generating a function prototype (compiled bytecode) with most of the same bytecode instructions as LuaJIT.

At the moment, and the reason I’m procrastinating and writing this post instead of working on it now, I’m just about to being coding the actual virtual machine to execute the prototype.

I’ve also made a number of improvements on the kernel front, like moving to gas from NASM and calibrating TSC to allow for rough wall clock timing feedback, but most of my effort is focused on getting Vua up to snuff.

In a break from the in-depth technical format of the previous posts, in this post I’ve elected to give a brief snapshot on development and just intersperse links to interesting bits and pieces and let the code speak for itself. My reasoning for this shift is that it’s just too time consuming to write in-depth technical articles to what amounts to development snapshots of a final product that probably won’t resemble its early stages very much.

One thing is certain in kernel development, and that’s that there is a ton of work to be done. Even mature kernels like Linux are constantly in flux, and for a kernel like Viridis this means that there are about a thousand things to be done at any given time. Since I’m the sole developer, the currency of the project is my attention and that’s not infinite. This leads to, shall we say, scaffolding where some “subsystems” aren’t anything more than a shadow of what they need to be. For example, the vsalloc implementation I wrote about in the last MM article… utter garbage, and I disclaimed that it was garbage, but considering handing out virtual addresses (at the time) was just the last small hurdle in having a working page allocator, I just needed something in place that would give me a valid answer for the handful of calls I made to it. Or the physical page allocator itself, it currently takes one gigantic alloc of 0.3% of the total memory available and eventually that will be unacceptable. For now though, with no drivers up or programs running it’s just not worth the effort until there’s something to do with the other 99.7% of memory.

Anyway, the result of the scaffolding code is that there are corners of the codebase that just aren’t worth writing about in detail because they’ll cease to exist the minute they become inadequate. Like vsalloc, in the current git it’s been reworked entirely to use a generalized domain allocator that is just all around better… but of course I haven’t written a single sentence about it until now. And that’s just stuff I’ve already written about, which is small fraction of what I’ve actually implemented.

Bringing us up to speed, here’s the stuff I’ve implemented in Viridis and haven’t had time or desire to dissect:

EFI. Viridis can be loaded directly by EFI firmware, transition to Runtime (ExitBootServices) and can use GOP framebuffers directly.

I could write pages about all of these, but I haven’t because that time is better spent implementing more stuff.

Which brings me to another transition. Viridis has transitioned from a toy to an experiment.

Initially I was going to implement this kernel as a sort of self-contained universe with its own quirky userspace, and a toolchain completely separate from GNU, but still very much modeled after Linux even if it wasn’t POSIX compliant. Running native ELF binaries, one memory context per process, similar schedulers, similar movement of data through the system.

Now I think, if I’m going through all of this trouble anyway, why not attempt to do something really novel? If I’ve already decided to throw away any sort of portability, then why not think about a really interesting way to organize the OS and see if you can get some interesting behavior? So I came up with a concept, and now I’m going for it.

I don’t really want to get into more detail on it, because this is still pre-prototype stage and it might be a complete pipe dream (although I’ve thought about it a lot and can’t see any obvious shortcomings), but the core idea is to fit the entire userspace more into the reactive programming paradigm where the system is oriented around data and signaling changes to that data in the fewest instructions possible – and I don’t just mean practically the fewest where you’re jumping between processes and every layer has a buffer it’s reading into and writing from, blocking on this or that, I mean literally the fewest where changes are propagated without kernel intervention if at all possible.

It’s going to require an entirely different concept of what a task is and how it’s scheduled and executed. In other words, I aim to slaughter some sacred cows and create something wholly unlike the Unix and NT descended kernels that dominate computing today. The question is whether that’s possible without making a lethal mistake, like becoming a worthless cooperative multitasking system, or losing an inordinate amount of performance or readability.

Anyway, this and a lot of other related questions, I am asking in code and we’ll see if I get any interesting answers. Worst case, and this is pretty likely, I end up with an even quirkier self contained universe to put on my mantle and admire. Best case, I revolutionize computing learn something interesting that I can write a paper about or apply elsewhere.

Concepts

Page Allocation

In our last article, we jumped through all of the hoops to get our kernel into a 64 bit C environment. Now we can start looking toward writing drivers, and the most important driver we have to get started on is the one for the device that we’ve already started to use. Memory.

Last time we had to wrangle the concept of paging just enough to get us into the kernel, and most of the basic stuff we did, I’ve also implemented in C.

But just handling paging doesn’t mean we’re done with memory. Now, we need to be able to write systems that allocate and free memory without caring about the underlying mechanisms and, most importantly, without giving out memory that’s reserved or already used.

That’s where the page allocator comes in. This allocator is the only allocator that’s aware of physical memory directly and has the coarse grain control given to us by the hardware. The page allocator will know what sections of memory are reserved, what sections are already used, and what sections can still be handed out.

This is not the general purpose smaller allocator (which would implement kmalloc) but that system will be one of the primary users of the page allocator eventually.

The Buddy Allocator

The main concept of this article is the “buddy allocator” which is a specific type of allocator designed to allow us to track large amounts of memory with comparatively little overhead. This is the approach used by the Linux kernel and, similar to Viridis, can be found in mm/page_alloc.c of the Linux source.

The idea of a buddy allocator is that you have multiple linked lists, each one with a number of blocks of memory that can be easily searched. Each list only contains blocks of memory of a certain size. In Linux and Viridis, the top size is 8M and, since it’s the smallest allocated unit, 4k is the bottom size, which means that we have 12 linked lists

An 8M top block size might not seem like a lot, but it’s bigger than most allocations are going to be and a 1 GB block of memory only divides into 128 8M blocks, so it’s a good compromise between usefulness and resource usage.

When we setup the buddy allocator, we divide the given memory into these chunks, but for simplicity’s sake, let’s say we’re booting in an embedded environment and 8MB is actually our total amount of memory. Easy, that’s represented by one big 8 MB chunk, with an address of 0 and no next. All of the other lists are empty.

Then, someone comes along and uses the page allocator to grab a single 4k page.

That 8M block is then divided up. 8MB gets split into 4M 4M. Still can’t get 4k out of it, so one of those 4M chunks gets split, and so on until that 8MB is split into 4M, 2M, 1M, 512k, 256k, 64k, 32k, 16k, 8k, 4k, 4k and one of those 4k chunks is moved to the allocated list of the right size (used_pages[0]), before getting handed off to the requester.

Now, the cool part. Those two 4k blocks we created are “buddies” because they’re contiguous and their addresses are going to differ by exactly one bit, the bit that represents the size of the block. These two blocks are 4k in size, 4k in bytes is 0x1000, and that single bit in 0x1000 is the bit that will be different between those two. 0x2000 and 0x3000 are buddies (they only differ by the 0x1000 bit). 0x3000 and 0x4000 are not buddies even though they’re contiguous because their addresses differ by more than one bit (bits 0x4000, 0x2000, and 0x1000 are different).

This might seem a bit confusing, but it’s actually extremely easy to calculate an address’s buddy of a certain size using XOR.

The reason that this is useful is that when we free the pages, we know the address, and we can find the size (since we have to find the freed address in the used lists) so we can calculate the allocation’s buddy and if it’s also free, we can then merge those two blocks into one bigger block.

So if our lists look like 4M, 2M, 1M, 512k, 256k, 64k, 32k, 16k, 8k, 4k and that 4k page’s buddy gets freed, it will merge back into an 8k block, whose buddy is also free so it merges into a 16k block whose buddy is free… and so on back until you have your pristine 8M block back.

Advantages of the Buddy Allocator

This system has a lot of advantages, especially if you plan on having heavy usage.

It’s really fast to search for memory to allocate. Finding a block of free memory (or realizing you don’t have enough memory to satisfy the request) takes at most MAX_ORDER comparisons so it runs in constant time no matter how much or little memory you’re tracking whereas a simple linked list or a bitmap would require an exhaustive linear search to determine this. Don’t underestimate this feature, as eventually we may be tracking and fragmenting terabytes of memory and allocating in linear time instead of constant time would add significant time overhead, especially in the failure case.

It’s a great compromise on memory usage. A bitmap would be the most memory efficient fully functional allocator. For 1GB of memory, with a bitmap bit-per-page approach, you’d have a constant 262,144 bits, or 32,768 bytes, or 32k worth of overhead per gigabyte. The buddy allocator, with the absolute very worst case fragmentation (i.e. all of memory allocated in 4k chunks) you’d have 4MB worth of overhead per gigabyte (128 times as much overhead). However, best case, 1GB totally unallocated, you have only 2k worth of overhead (128x 8MB chunks, 16 bytes of overhead apiece) or 1/16th the overhead of the bitmap. In short, with real use cases (in which will not approach maximum fragmentation unless we’re stupid about it) we get roughly comparable memory usage with a much faster data structure. A simple singly- linked list would have similar memory usage (or even slightly better if you didn’t restrict block sizes to be convenient powers of 2).

It’s dead simple. This advantage can’t be overstated in a fundamental kernel data structure. Any CS 101 student is familiar with how to manipulate singly-linked lists. The data structure used to represent free memory is the same one used to represent used memory (a failing of a bitmap). There’s no awkward bit masking and manipulation or weird SSE optimizations as their would be for a bitmap, and the buddy system means the logic to merge blocks is exceedingly simple – checking for the existence of a single known address at a single known block size, rather than having to search for essentially any adjacent free block.

Quirks of My Implementation

Now I’ve made some choices that compromise some of the above, but they’re not inherent to the algorithm. Crucially, both of these quirks could be re-visited at a future time and changed without altering the base algorithm we use.

I sort the lists. While true constant time allocation is possible with this algorithm, I’ve decided to keep the blocks sorted by address. This turns the split / re-sort operation into a linear operation because you have to search for the correct place to patch in the block, instead of just prepending it. Strictly this makes allocation linear as well, but in practice it’s linear over a much smaller subset of blocks and it’s still constant time when there are already blocks of the correct size. On the other hand sorting makes searching take less time, so freeing pages and searching for blocks to merge gets a small boost. The reason I sort, however, is that it makes it easy to test and examine the data in a debugger to verify the allocator’s memory map looks right.

I assume worst case fragmentation. Obviously, the kernel needs this data structure to be positively bulletproof and to make that true, while retaining simplicity, I’ve made the setup code reserve enough memory to have a page block struct for every single page in the system. That means we’re always hitting the worst case of 4M of overhead per 1G of memory and thus using 0.3% of total memory, but it also means we don’t need to complicate the page allocator by making it call itself either.

Virtual Address Allocator

The buddy allocator isn’t the only one that the kernel needs even though it forms the backbone of the page allocator.

Remember from our previous work that in paging there is a difference between the physical memory address and the virtual address to use it when paging is enabled. The page allocator controls how physical memory is handed out, but we still need to allocate that virtual address so we don’t overwrite someone else’s mapping.

The good news is that we already know the size of the virtual address space we’re dividing up. It’s architecture specific, but constant (in our case we’ll just assume it’s a full 64 bits) so we don’t really need to care about what is “free”, only what we have already allocated. We also get a totally free virtual address space for every context (process / thread) we create, so we don’t have to worry too much about performance (the number of virtual allocation ranges is likely to be small per context) but obviously we can’t waste too much memory if we’re potentially going to have thousands of copies of it.

The bad news is that we’re going to want to know a lot more about these address allocations. We’re going to want to know not just address and size, but also what is expected to be there if the kernel loaded it (like what file data is there), or which driver requested the mapping. This is not only useful for debugging purposes, but later when we look at something like having swap memory, or implementing an mmap() style syscall.

Fortunately, for now, we can afford to effectively stub this out. Our implementation is a simple linked list that will merge adjacent allocations, but it doesn’t do anything smart with the name and, like our page allocator, it assumes a constant amount of pre-allocate overhead – in this case one page which will probably be enough until we start forking processes and threads at which point we’ll need to revamp the virtual address allocator to be smarter anyway.

Early Page Allocator Implementation

The code for the page allocator is browseable in git. I won’t cover the entirety of it (it’s almost 1000 lines long with comments), but I will highlight a few of the pieces.

Also note that I wrote some very basic mapping functions that essentially implement what we did in assembly to get to long mode, but in C. That is, take a physical address and a virtual address, go through the page tables, fill it in, and reset CR3. I don’t think these need to be directly covered except to say that the “early” variants expect to be pointed at the beginning of some amount of blank pages and update that pointer as necessary, where the standard variants will call out to the page allocator to get pages when necessary.

The Chicken and the Egg

The trickiest part of initializing the page allocator is getting it to account not only for the memory that’s present in the system, but for everything you’ve already done, as well as its own overhead, before you already have the allocator setup.

There are three sources for this information.

For the system memory, we get a pretty good memory map from GRUB and that was passed to main() in EDI.

For the kernel memory, we still have the linker trick (kernel_size) to know the size of the binary, but we also passed ESI from head.asm which contained the computed end of the structures we “allocated” after the binary just to get into long mode.

The final piece is how much overhead the page allocator needs to allow for itself, which is a function of how much system memory there is. As I mentioned in the concepts section, I’ve chosen to just prepare for maximum fragmentation to make the page allocator code itself bulletproof (i.e. the page allocator may tell you it has no memory left, but it will never run out of memory itself).

First we’ll start with figuring out exactly how much memory there is, which will give us a size for how much memory we need to reserve to setup the page allocator.

You can see here that I’m traversing the GRUB provided memory map to find both the total amount of memory, as well as the largest address. We use the largest address instead of total memory to compute the maximum overhead because we will have to mark reserved or non-existent chunks of memory as “allocated” even if they can’t be handed out.

After we know exactly how many pages we’ll need for the allocator in the worst case, we traverse the GRUB memory map again to find a piece of usable memory that’s big enough to hold all of those pages. We have to be careful to avoid allocating space that our kernel is using already, as well as accounting for how many pages it will take to actually map the pages the page allocator needs (sheesh!).

Pretty straightforward. One note is that the __overhead_pages function on line 44 accounts for exactly how many pages of page table structures we’d need to map a number of pages at a certain address based on how many page structure boundaries. Mapping 2 pages, for example, could take between 3 and 6 overhead pages for the tables depending on what address those two pages are mapped at, assuming that the page table is completely empty except for the PML4 page we have to already have.

So at the end of this bit of code, we either screwed up and returned -1, or we have an address that should be able to fit all of our page allocator overhead into it.

Then we move on, we point free_pages_address at our memory block, which will begin with the page overhead. unused_page_blocks we point to the memory after the overhead, the memory that we’ll actually be using and __init_unused_page_blocks then initializes that variable to be one giant singly linked list of page block structures that the page allocator can use.

We start by assuming all of system memory is free from 0 to max_pages. Then, we iterate over the GRUB memory map for a third and final time, reserving each of the unusable sections. Then we reserve the structures GRUB gave to us, our stack, and the entire space of memory that we used for the kernel including the early paging structures (up to end_structures) and the memory we just set aside for the page allocator itself.

I won’t spend too much time quoting code, but the two functions that are integral here follow.

Here you can see that we’re using the unused block linked list like a stack where we just pop one off of the front when we need it and push one when we discard (elsewhere on free). Note that this is not how we actually free, this does no checking or merging and is only used during init when it’s assumed that you’re not relying on these features.

This provides a better example of how the order lists work. We search each free block list from 8M down to 4k and, when we find a block that contains the address we’re trying to reserve, we split the block and then advance to work on a smaller region.

Page Allocator Use Functions

At this stage we have all of the system physical memory tracked, which is the hard part, but here is the text for the actual core allocate and free functions. Note that these are still entirely based on physical addresses.

Simple Virtual Address Allocator

As I mentioned in the concepts section, the virtual address allocator is a much simpler affair, but it’s required for the page allocator to really be useful. We want consumer functions to be able to do page_alloc(order) and get back a valid pointer of the right size without having to manually specify a virtual address.

NOTE: this allocator is extremely basic and at this point is only used to properly mark parts of the virtual address space as taken. The name can be ignored and discarded, and there is a restrictive limit on the number of virtual blocks that can be used. These caveats are only acceptable in the current state where virtual address allocation is just a hurdle for the page allocator but later this code will have to be improved.

The initialization begins with getting a page from the early allocator through a helper function (essentially just page_alloc_phys + mapping to the given address). At the moment, this 4k chunk of space is assumed to be enough to handle all of our virtual blocks and the pop function will hang if we go beyond that to indicate that that needs to be implemented (a problem that was sidestepped in the page allocator by pre-allocating everything).

NOTE: The only space that we pre-map into the allocator is what we’ve allocated from the VSALLOC_HEAP_VIRTUAL ourselves. This is because when we eventually use vsalloc we give a hint address that functions as a minimum value. We probably should reserve a massive chunk for the kernel just to be safe, but at this point that’s unnecessary.

The interesting part of the virtual address allocator is the following

As you can see, these functions keep the virtual block list in order, and on allocate will attempt to merge blocks that are adjacent or overlapping and split them apart when you free a subset of a block. As I mentioned above, name is currently ignored when merging blocks, which is bad behavior, but again this is intended to be just enough to get the page allocator running.

Final Page Allocator

Now that we (nominally) have the ability to get a physical page and a virtual address, we can combine these two into our final product.

Basic VGA Support

It might seem like an off the wall tangent to go from page allocation to display support, but x86 machines have a primitive frame buffer that’s exposed at a known physical address (0xB8000) with a window of 80 columns and 25 rows, each with one byte for a character and one byte for a color foreground / background. This has already been eliminated from the page allocator by being part of a reserved block in the GRUB memory map, but now that we have the ability to allocate virtual address space and map pages reliably, it’s trivial to start putting characters on the screen.

Nothing too mystifying. I’ve also implemented a very nice version of vsnprintf() using GCC’s built in var args (stdarg.h) and some CS 101 buffer manipulation which gives us a very nice printk on top of a more general console layer to abstract through.

I feel the need to mention this here because I don’t want to spend an entire article on it, but the next one I write I’ll likely be using console output to debug.

The Code

The tag for this article is ‘page-alloc-and-vga’. You can browse it here.

Next Time

Next time, we’re going to look at handling interrupts in C, and using the xapic / x2apic timer (which will include doing CPUID for detection, MSR, and MMIO functions). If I’m feeling ambitious we might actually start scheduling kernel threads too.

Concepts

Long Mode

Long Mode is 64-bit mode. The AMD64 spec changes a lot of the behavior of the processor based on whether it’s in Long Mode or not. The opposite of Long Mode is Legacy Mode in which an AMD64/Intel64 processor will run like a Pentium (albeit a fast one).

Descriptors (aka Selectors)

Descriptors are an old way to describe memory. x86 and x86_64 have three descriptor tables, each of which hold a number of descriptors of different types. For now, only the GDT and IDT are interesting.

GDT

In the GDT (Global Descriptor Table), descriptors define code and data “segments” (and others we won’t cover here). The x86 and x86_64 processors have “segment registers” (CS for code and DS through GS for data access) that are byte references into this table. Each of these descriptors roughly contains a start address, size, a type, operation size (16,32 bit) and permissions info.

In 32-bit kernels, a code and data descriptor are setup for the kernel (privilege 0, supervisor) and also for userspace (privilege 3, user) for a total of 2 code and 2 data descriptors (among others that are irrelevant right now).

In Long Mode, a previously reserved bit is defined as the “Long bit” which tells the processor that the descriptor is 64-bit. In this case, the base and size of the descriptor are ignored and many of the obsolete type values are invalid. 32-bit descriptors can be used when Long Mode is enabled however, which is how 32-bit compatibility mode works (code run with a CS register pointing to a 32-bit code descriptor is in 32-bit compatibility mode).

IDT

The IDT (Interrupt Descriptor Table) includes a bunch of descriptors that tell the processor what to do when it receives an interrupt (which can be everything from a timer going off, an error, or a hardware notification). This table includes a separate (but similar) type of descriptor called the Interrupt Gate that includes a code segment (CS setting, referencing the GDT) and an address to jump to when an interrupt is received. In Long Mode, this descriptor is extended to allow a 64-bit target address and the code segment must be a 64-bit one.

Paging

Paging is the mechanism that replaced segmentation (descriptors) to manage memory. It’s extremely powerful and flexible. The core idea of paging is the separation of “virtual” and “physical” address spaces.

Without paging, when you reference memory @ 0x1000 you’re referencing the 0x1000th (4096th) byte of physical memory. If you access memory beyond the end of physical memory, you’ll error.

With paging, 0x1000 is a “virtual” address, which means that it can be mapped to any page (4k chunk) of physical memory. 0x1000 “virtual” could be 0x7f8000 “physical”.

Each process in a typical kernel has its own context, which includes its set of mappings from virtual to physical addresses. When one process is running, another process’ memory isn’t reachable (unless by design for something like threading). Separate contexts for each process also means that multiple programs linked at the same (virtual) address can happily run simultaneously because they occupy separate physical memory.

There are other advantages to paging as well. Fine-grained permissions, write monitoring, and page faults allow us a lot of flexibility with how we handle memory, but just for getting to Long Mode we need to make one context (the kernel context), that maps the kernel’s link address (virtual) to the memory GRUB loaded the kernel to (physical) so that we can enable paging and continue to run.

We’ll briefly get in to the mechanics of paging in this article, but it will come up over and over as memory management is one of the core tasks of the kernel. We’ll encounter paging again when we write our memory manager, and yet again when we start forking processes, and yet again when we deal with IO.

Implementation Note

Why Assembly?

It’s definitely possible to write a kernel in C with a bare minimum of inline instructions when you need to do something special (like loading the GDT/IDT registers, or switching paging contexts), but this would entail a lot of double checking assembly output, and a key problem with doing this is that addresses are not as easy to manipulate in C without a whole lot of casting pointers and other ugliness. This is especially an issue for our initial code because the link address (virtual address, the one you get when referencing a symbol) is different than our load address (where GRUB puts us and where addresses should be before we enable paging).

Why NASM?

For all the assembly in Viridis, I’ve decided to use NASM instead of the built-in GCC assembler as. The reasoning behind this is that I find NASM syntax to be clearer than as (even with -masm=intel), in addition to supporting the BITS directive to mix 32 and 64-bit code in the same file (which is important for this chapter).

I’m using the GCC C pre-processor (cpp) on top of the NASM files which seems like a hack, but it’s intended to allow us to share headers between NASM and C and avoid having to keep two sets synchronized.

The Road to Long Mode

Chapter 14 covers the power on state of the chip and then covers the initialization processes. Fortunately, GRUB has already gotten us into Protected Mode so we needn’t worry about that part, although the Multiboot Specification section 3.2 mentions that the GDT it setup is probably no longer valid so we should set that up in known memory.

According to 14.5 (Long Mode initialization) we need to do the following. I’ve re-ordered them to the order in which we’ll actually do them.

The GDT must be setup with a 64-bit Code Segment

The IDT must be setup with 64-bit Interrupt Gates

PAE paging structures must be in place

At which point, you can move on to 14.6 which describes the mechanics of enabling long mode (set EFER[LME]), and activating it by enabling paging (getting EFER[LMA] set which confirms that Long Mode has actually engaged).

Initial Tweaks

When we last left off, we were loading a simple C “kernel” that did nothing but loop in place forever. This time we’re actually going to do something complex, so there are a handful of miscellaneous tweaks I’ve made to simplify things.

linker.ld

If you remember, last time we setup a linker script to force our GRUB signature to be properly placed in the resulting binary.

The first thing of note is the addition of .text_early. Before we setup paging, we’re dealing with physical addresses only, so this section will include all of the code that expects us to be using physical addresses directly. This is so, for example, we can use call populate_gdt and the address will be the correct physical address, rather than a currently invalid virtual address.

The second thing of note is that we get three linker variables. These are extremely useful. kernel_start,kernel_end, and kernel_size. We’ll use these when setting up paging to make sure that we’ve included our entire kernel.

Essentially, a bunch of directives and an infinite loop. We’ll add on to this later, but for now let’s talk about how we’ll host our segment descriptors.

Hosting the GDT

After we’re dropped into our code, we want to load our own GDT with three descriptors. One for 32-bit code (that we’ll use after we load the GDT but before we enable paging), one for 64-bit code (that we’ll use after paging) and a data descriptor that will work in either mode.

Structure

The structure of a single GDT descriptor is shown in figures 4-13 (legacy code and data), 4-20 (long code), and 4-21 (long data) in the AMD64 Programmer’s Manual v2. Legacy and long data descriptors are compatible (as 4-21 shows, only the valid bit matters for long data descriptors).

In short, we use FLAG_USER to show that these are code and data descriptors (FLAG_SYSTEM would indicate a system type descriptor which is an entirely separate thing). FLAG_R0 because we are ring/privilege 0, the most privileged (kernel). FLAG_P for present, FLAG_32 for 32-bit operations, and FLAG_4k because we’re going to use a 4k size instead of bytes.

For each of these descriptors, we’re going to cover the entire 32-bit address space so base is 0 and the segment limit (4k size) is going to be 0xFFFFF (the maximum).

This is a 3 argument NASM macro that will take our constant values and jockey their bits around to create an entry directly in the binary (DW and DB are pseudo instructions that embed words (16-bits) and bytes directly). If the &,|,>> confuses you, you may want to refresh your memory about bitwise operations.

A bit of dissection. ALIGN 8 makes this long aligned, for access performance (processors take longer access byte offsets from unaligned bases. This is irrelevant to the actual operation of the GDT however.

GDT and GDTEND are labels we’ll use later. CODE_SEL_32 and friends are the calculated byte offsets from the beginning of the GDT. These are the values that we’ll place in our CS and DS-GS segment registers.

The rest of the lines are calls to our entry macro, using the flag combos we already defined and the base / limit we already discussed.

Also included here is the NULL descriptor that is required to be the first descriptor (and thus used if a segment register is loaded with 0x0).

The GDTR

Software can define as many descriptors in the GDT as they like (within reason), so the GDT can be any size. Since it can also be any place, the processor would have to store two registers worth of information to know the location and limit of the GDT. Well, hardware generally doesn’t use two registers when it can make do with one so instead of pointing the processor directly at the GDT, we point it at the GDTR a structure of known size.

Figures 4-7 and 4-8 in the programmer’s manual describe the GDTR in legacy and long mode. Fortunately, the only difference is the size of the address allowed for the GDT and it’s easy to make the two compatible.

Load it up

To load the address of the GDTR, we make use of the special lgdt instruction, and put it into a basic assembly function to be called from our general code.

GLOBAL populate_gdt
populate_gdt:
lgdt[GDTR]
ret

Hosting the IDT

Structure

The IDT is very similar to the GDT, it’s a list of descriptors that we’ll point to with an IDTR (identical to the GDTR). The structure and purpose of each descriptor is a little different however, and it’s described in figure 4-24 of the AMD64 Programmer’s Manual v2.

There is one difference from the GDT and that’s that we can’t macro this one so easily. The reason is that each IDT entry describes a code descriptor (from the GDT) and an address to jump to if an interrupt is received (among other permission information). Each address points to an ISR (Interrupt Service Routine) and is only known at link-time, but we have to do the same masking and shifting we did with the GDT which requires a value at compile-time.

We could resolve this one of two ways. We could make it macro-able by assigning the ISRs to a section that we could give a known address in the linker script (so we would know the ISR addresses before link-time). Or, we could just init the entries at run-time. I believe that linker sections would be a good solution except for one thing: even though right now all of our ISRs are two byte stubs and they can be easily indexed a grouped, later we’re going to programmatically change the IDT entries anyway (albeit in C as drivers are loaded and request interrupts) and later we might want this boot code to put in actual values for built-in drivers which may not be of constant size and grouped into the ISR section.

Then again, perhaps we’ll route all of the IDTs to a single interrupt master function and the section approach would work perfectly with a little modification to the section sizing.

Here are the relatively few interesting values for the IDT. The 64-bit code descriptor offset we defined in gdt.asm, the FLAG_INTERRUPT which is basically the type, as well as the common FLAG_R0 and FLAG_P for Ring 0 (most privilege) and present. The number of IDT_ENTRIES is referenced as “VECTOR” in section 16.2 of the programmer’s manual.

Now we define all 256 of our ISRs. Each ISR is a simple infinite loop which may appear useless to us but for right now, we’re not capable of actually handing any of them and our debug environment can give us the current instruction pointer (IP) to tell us where we’re looping if we take an interrupt.

Init and Load It

NOTE: We are just going ahead and setting up a 64-bit IDT, as you can see from our usage of VIRT_BASE in this chunk as well as the expanded side of the IDT entries. We will have interrupts masked (cli) until after we’re in a 64-bit environment, so this is okay. When we load the GDT, we’ll load the 32-bit address just so we’re self-hosted when we enable Long Mode, but then we return with a fixup function that will convert the IDTR to a 64-bit address afterwards.

Okay, so we’ve created a working IDTR, and a bunch of stub IDT entries. Similar to the GDT, we have to populate the IDT, and load the IDTR. This time it’s a bit more complicated because we’re actually initializing the IDT at runtime instead of compile time.

This looks a lot more complex than it is. EAX is the pointer we’re writing to. EBX is the current isr address we defined, and ECX is just a holder to shift bits around. After we’ve done that for each ISR, we load the IDTR just like we loaded the GDTR.

At this point, we still can’t take an interrupt (since our table isn’t a valid 32-bit table) bit as soon as we switch to 64-bit, and update the IDTR, we’ll be prepared to take an interrupt… we just won’t be able to do anything intelligent for now.

Our Init At This Point

I’ve thrown a lot of code at you for doing all of this initialization, but we haven’t actually invoked it yet. It’s time to look at head.asm, which is the very earliest code in our kernel.

This gets us far enough that we’re hosting our own IDT and GDT, and all of our segment registers, CS->GS and SS, are properly set to offsets within our new GDT. Trickily, CS can only be changed with a jump that specifies the CS, instead of a register move like all of the others.

Note that we define the stack related stuff in early.h, which we’ll get to later. For now, STACK_PAGES_PHYS is 0, and S_PAGES is 2, for an 8k stack. If you recall your basic C courses, the stack grows downward (i.e. the second thing on the stack will have a lower address than the first) so we init our stack registers, ESP and EBP, to point to the highest address.

We’re doing great, but now it’s time to move on to the tough part. Getting our initial page tables setup.

Initial Paging

To translate a single virtual address into its physical counterpart, you have to do four table look ups. Each table is a single, 4k page, containing 1024 (32-bit) or 512 (64-bit) entries. That entry will contain a physical address, as well as some flags, that will either be the next table’s physical address, or on the last table, the physical address that corresponds to the virtual address you started the lookup with.

The indices into this table are built directly into the virtual address. You can see how these break down in figure 5-1 of the Programmer’s Manual. These structures allow us to define some useful macros in include/early.h

Since we’re currently in assembly, and these are C preprocessor macros, we have to be careful to only insert constants into them that the C preprocessor can resolve immediately. So, even in assembly we can do PTE(CONSTANT_ADDRESS) but PTE(eax) would fail as NASM can’t convert that into the appropriate assembly.

We’re going to assume that GRUB didn’t put us right next to an unusable memory chunk, and that we can use the pages immediately after the kernel for these structures. Because these page structures are so sensitive to data, we zero from the end of the kernel (page aligned) up to EDX pages after it.

We’re mapping both the target virtual address (VIRT_BASE | KERNEL_START => KERNEL_START) as well as the identity map (KERNEL_START => KERNEL_START) so that when first start paging, we’re still executing in a valid address space.

As I mentioned above, translating from a virtual address to a physical address requires a series of lookups. Once we enable paging, however, we’ll only have direct access to the memory that we have recorded into the page tables, there’s no way to read or write physical addresses anymore. Logically, then, the memory the page tables resides in should be mapped into itself so that you can modify them without disabling paging (which would likely be a disaster anyway). Fortunately, the design of x86_64 architecture, like the design of the i386 before it, uses a recursive data structure – meaning that each each table’s entries are compatible. So an entry in the first lookup table, the PML4, looks the same as a valid entry in the PDP, PD, and PT data structures as well. This means that if you make an entry in the PML4 that is the PML4’s address itself then the PML4 is also a valid PDP that has the PML4 mapped into it (because it’s the same physical memory location) which makes it a PD with the PML4 mapped into it, which makes it a PT with the PML4 mapped into it, which means that the PML4 itself is a leaf node, so its physical memory is mapped somewhere in your address space.

I’ll cover the process of finding the virtual addresses of arbitrary page structures in the next article. For now it’s enough to know that because the same recursive effect occurs for each page table structure, mapping the PML4 into itself is the best way to ensure that all of your page tables are accessible at all times.

This is the first thing we tackle while we setup the PML4.

PML4

NOTE The following is a lot of basic pointer math in C, and it’s not too bad in assembly either. I’ve grouped them such that EAX is always the address we’re writing too, and EDX is always the value we’re going to write to it.

8 being the size of each PML4 entry, 510 being the index. We chose 510, second to last, because the virtual kernel mapping is going to be in entry 511 so we’re grouping kernel resources at the far end of the address space.

Here we map both the target virtual address (VIRT_BASE | KERNEL_START => KERNEL_START) as well as the identity map (KERNEL_START => KERNEL_START)

/* Now, map two PDPs, one where we want our kernel, and one
* where we'll end up after we start paging but before we jump to our
* kernel address. Here's the break down of our two addresses:
*
* KERNEL_START (0x100000) - where we are running when paging turns on
* PML4E: 0
* PDPE: 0
* PDE: 0
* PTE: 256
*
* VIRT_BASE | KERNEL_START - where we want to run
* PML4E: 511
* PDPE: 510
* PDE: 0
* PTE: 256
*
* We're going to be lazy and merge these together because we're mapping
* the identical content and because we'll clean up immediately after
* paging is enabled. Looking at the above it's clear that we should
* eliminate PML4E[0] and PDPE[0], but leave the identical PDEs and PTEs
* in place.
*
* We can also see that, if we are flexible in the number of PTs, we'd
* have to have 512 of them before we'd have to allocate another PD. Since
* 512 PTs can map 1GB of memory, I don't think that's an issue for our
* kernel, thus we're safe hardcoding 1 PML4/PDP/PD page.
*/
/* First, the scrap mapping */
mov eax, ebx
add eax, 8 * PML4E(KERNEL_START)
mov edx, ebx
add edx, PAGE_SIZE
or edx, (PF_RW | PF_P)
mov dword [eax], edx
/* Now, the real entry */
mov eax, ebx
add eax, 8 * PML4E(VIRT_BASE | KERNEL_START)
mov dword [eax], edx

So now we’ve setup the PML4 @ EBX, that links to itself, as well as the PDPs we’re about to contruct in the following two pages.

Note the PAGE_SIZE offsets to avoid the PML4 in the pointer (EAX) as well as the 2*PAGE_SIZE offset in the data we’re writing it. We’re being lazy, as I mentioned in the comment, by having a single PDP. Technically here we’re creating four separate addresses that map the kernel, but we just have to remember to clean it up when we’re done.

Page Directory

Now we come across the first bit of paging initialization where we don’t know exactly how many page table structures we’re setting up at compile time. Fortunately, at run time we already calculated where the last page table ends and kept that value in ECX, so just map every page aligned address between EBX + 3*PAGESIZE (end of kernel + a page each for PML4/PDP/PD aka the start of the Page Tables) down to, but not including ECX, into EBX + 2*PAGESIZE (end of kernel + a page apiece for PML4/PDP, aka the start of the Page Directory).

Note that we bring in ESI here, but only so that the comparison between EDX and ECX can be made without having to or our page flags into ECX and clean it up. It’s just sitting there anyway.

We loop writing consecutive page directory entries from EDX up to ECX. These addresses are now our page tables.

Page Tables

Similar to the page directory setup, we’re not entirely sure how many page tables or entries we’ll need at compile time, but at runtime we know the details. We’ve set the stack at STACK_PAGES_PHYS, and we know exactly how many pages we’re going to use for that regardless of the size of ther kernel, so we macro that, and since we’ve chosen STACK_PAGES_START to be right on top of the kernel (STACK_PAGES_START = KERNEL_START – (S_PAGES * PAGE_SIZE)), we can just keep EAX incrementing to write the PTEs for the kernel itself from KERNEL_START down to EBX.

Not much to add on the comments, but note that once again we have to perform a jump in order to switch code selectors to the new 64-bit selector.

Now that we’re in 64-bit paging mode, we have just a few more housekeeping duties before we call main(). The stack registers are now invalid, so we update them before we do anything else. The fixup functions just write the new virtual addresses to the GDTR and IDTR and reload them.

Finally, remove the kernel identity map from the PML4, and the extra PDP entry, which reduces the number of kernel address ranges back down to one, where it should be, and jump into main.

/* Update our stacks to the new paged mapping, resetting the stack */
mov rax, (VIRT_BASE + (KERNEL_START - 8))
mov rsp, rax
mov rbp, rax
/* Move GDTR / IDTR addresses to new 0xFFFF8....... */
call fixup_gdtr
call fixup_idtr
/* Now that we are executing in 0xF... space, and our
* stacks are there too, we can clean up the 0x0 kernel
* mapping that had to be in place for us to successfully
* return from setting CR0.PG
*/
/* Unfortunately I can't just use the C macros directly */
/* Also note that it's important to do this from the bottom up, on qemu (if
* not hardware), the PDPE mapping disappears with the PML4E mapping, despite
* cr3 not being reset
*/
/* mov rax, PDPE_ADDR(KERNEL_START) */
mov rax, 0xffffff7fbfc00000
mov dword [rax], 0x0
/* mov rax, PML4E_ADDR(KERNEL_START) */
mov rax, 0xffffff7fbfdfe000
mov dword [rax], 0x0
/* Reset CR3 to update. */
mov rax, cr3
mov cr3, rax
/* Move to rax first to avoid linker relocation truncation. */
mov rax, main
call rax
jmp $ // We should never get here.

The Code

To look at anything not covered explicitly here, like the *DTR fixup functions, where I got those PML4E/PDPE address values in the cleanup, or the build system, you can browse the code. The tag for this article is “long-mode”.

Next Time

The next section will be almost entirely C, thankfully, and will center around writing an industrial grade page allocator similar to Linux’s.

The Code

The kernel we’re going to run is extremely simple and pointless, but it will form the basis for our future experiments and get our codebase kicked off.

void main (void)
{
while(1) {}
}

Nothing to see here, just a loop to keep the CPU in place.

Compiling

Now, how do we get this to work? Well, if you’re interested, you can compile this as a (pointless) Linux program with the standard gcc main.c -o kernel but that will generate a binary with a lot of stuff in it that we don’t want, and can’t have even if we did. Looking at the output of objdump -d kernel (a tool that will come in handy later) you can see a lot of symbols and sections from glibc stuff. From the end, for example:

These ELF sections are from libc which this binary has been implicitly linked against. These sections allow GCC to insert things like constructors and destructors into your code, let it interact with the operating system to do things like argv and other magic that is 100% irrelevant to our kernel.

No, we need to find some flags to GCC that let us ignore everything else and just compile what’s written in our source. No libraries, nothing. A brief look through the GCC manpage leads us to:

-ffreestanding
Assert that compilation takes place in a freestanding environment.
This implies -fno-builtin. A freestanding environment is one in
which the standard library may not exist, and program startup may
not necessarily be at "main". The most obvious example is an OS
kernel. This is equivalent to -fno-hosted.
-nostdlib
Do not use the standard system startup files or libraries when
linking. No startup files and only the libraries you specify will
be passed to the linker, options specifying linkage of the system
libraries, such as "-static-libgcc" or "-shared-libgcc", will be
ignored. The compiler may generate calls to "memcmp", "memset",
"memcpy" and "memmove". These entries are usually resolved by
entries in libc. These entry points should be supplied through
some other mechanism when this option is specified.
One of the standard libraries bypassed by -nostdlib and
-nodefaultlibs is libgcc.a, a library of internal subroutines
which GCC uses to overcome shortcomings of particular machines, or
special needs for some languages.
In most cases, you need libgcc.a even when you want to avoid other
standard libraries. In other words, when you specify -nostdlib or
-nodefaultlibs you should usually specify -lgcc as well. This
ensures that you have no unresolved references to internal GCC
library subroutines. (For example, __main, used to ensure C++
constructors will be called.)

These look like good suspects. -nostdlib is the real workhorse option, stripping the glibc cruft from our binary. -ffreestanding is less important, but it will suppress GCC complaining about our main function being non-standard at least.

Excellent. Much more concise and understandable. main() is just setting up an empty stack frame and looping in place infinitely.

Linking

We have a number of problems with our current ELF output. The first of which, as ld told us above, is that it doesn’t know what the starting address is, so it guessed. The second is that the link address GCC chose is completely arbitrary and isn’t a good default. And the third is that, if you use objdump -D (capital D) to dump all of the sections of the file, we still have two extraneous sections, .eh_frame and .comment that are wasting space.

Both of these problems can be solved with a linker script which will tell the linker, ld

This script keeps the relevant sections (.text, which is code, .data which is inited data, and .bss which is basically un-inited data) by grouping them together. It discards the extra GCC sections (.comment, and .eh_frame) by placing them in the ld special “/DISCARD/” section. It also sets the output format as 64-bit x86 ELF, which is correct for our kernel to be loaded by GRUB, and sets the entry point to main().

Most importantly it sets the link address for code to 0xFFFFFFFF80100000, but load the code to physical memory 0x100000 with the AT directive. If we omit this AT directive, GRUB will attempt to load to 0xFFFFFFFF80100000 physical and unless you’ve got 16 million terabytes of memory in your VM it will complain about being out of memory and subsequently fail.

Why are we linking at 0xFFFFFFFF80100000?

First let’s just note that the 64-bit architecture only supports 48-bit addresses and the top 16 bits are sign-extensions of the 48th bit. There’s a massive hole of unaddressable memory between 0x7FFFFFFFFFFF and 0xFFFF800000000000 because of this sign extension. We take advantage of this hole by using it to separate user (0 – 128 TB) and kernel (16 Exabytes -roughly- and up) addresses. This gives both halves (user and kernel) plenty of space.

However, there is one more wrinkle. When linked together there are things called ‘relocations’ which have to do with pointer math. Consider loading a pointer like int *bar = &foo. Syntactically and logically that is sound, however, as part of optimizing the 99% (non-kernel) usecase, GCC assumes that your code is going to be compiled with addresses between 0 and 2G. The result is that &foo is assumed to be four bytes by GCC, and at link time ld discovers it’s actually eight bytes (a 64 bit address) ld throws an error complaining that this relocation has been truncated (i.e. the top four bytes would be discarded if this program was run).

GCC’s 0 to 2G assumption can be controlled with the -mcmodel flag. By default, it’s set to “small” (code in 0-2G), but there are also “large” (makes no assumption about addresses but generates more inefficient assembly by assuming all pointers and jumps are going to be anywhere in the 64 bit range), “medium” (a compromise between small and large) and, most importantly, “kernel” which was added so that the Linux kernel could have the assembly efficiency of “small” with the desired virtual address separation. The downside is that “kernel” assumes the code is in -2G to MAX addresses or 0xFFFFFFFF80000000+. So, to take advantage of this compromise between address restrictions and assembly efficiency, we link at 0xFFFFFFFF80100000 and specify -mcmodel=kernel on the GCC command line.

To use this linker script, we split the compilation process into two parts. First, the compilation of the C in to object (.o) files. Then the linking of object files into an ELF binary, with the linker script.

Which now yields kernel which is a 64-bit ELF file, linked to 0xFFFFFFFF80100000 and ready to be loaded at 0x100000.

Unfortunately, on x86-64 hosts, this also generates an executable that’s positively massive (1 or 2M) compared to the amount of code we have. This is no good because it’s a waste of space and, worse, it pushes the actual sections of our code out of the 8k that GRUB is going to search for a magic header.

On x86-64 we can solve this by giving the -n flag to ld which tells it to not align the program sections at a huge offset.

The following produces a kernel under 1k on x86-64.

jack@sagan:$ ld -T linker.ld -n -o kernel main.o

GRUB Magic

If you tried to load the kernel at this point, GRUB would complain that the binary is missing a signature and you wouldn’t get any farther.

GRUB expects to find a known “Multiboot Header”. You can read the Multiboot Specification which describes what must be embedded into the binary for GRUB to recognize the ELF file as a bootable file in section 3.1.

We’ll be using more of the GRUB features when we want to take advantage of some of the values that it can give us (denoted in flags) but for right now we just want to make GRUB happy so we can load our kernel.

And, just to make it easy on GRUB, the signature has to show up in the first 8192 (8k) of the binary. Considering that ours is 3 bytes long (without the ELF header) we could place it anywhere, but let’s take advantage of our linker script to place the grub magic immediately after the ELF header.

Specifying the GRUB Signature

Using the above information and some basic information about default types on 64-bit (i.e. that unsigned int is 32-bit) we can easily create a struct to contain the information.

But now we have to ensure that the signature shows up in the first 8k of the file so GRUB can find it.

Considering the kernel is less than 1k, that’s already done and this kernel will boot. But eventually the kernel will be far larger than 8k, so we can’t rely on it.

The easy way to accomplish this is to split the GRUB signature into a separate file (grub.c) and make sure that that file’s object code (grub.o) is the first file linked into the kernel by making sure it’s the first object argument to ld. However, that seems too fragile since it’s based on the build system that we haven’t even touched yet.

In my opinion, we need to enforce that the GRUB signature is the first thing. To that end, let’s add a new code section to the linker script and tell GCC to put our grub_signature struct gs into it.

After the sync completes (should be momentarily unless you’ve got a bunch of other IO going), you can then fire up QEMU.

jack@sagan:$ qemu -hda disk.img -m 1024

Which will quickly drop you at the GRUB prompt.

grub> multiboot (hd0,msdos1)/kernel
grub> boot

And if no errors are printed, the kernel is running.

Double Checking

I wouldn’t be much of a hacker if I thought that no output and no confirmation means everything is okay. Let’s check and make sure that everything looks good.

If this was a real machine, we’d be in a hurry to get output to the screen, or flashing LEDs, or we’d be breaking out hardware debuggers to analyze the chip state in the worst case. Fortunately, using QEMU, you can use GDB on your kernel like any other piece of software. We’ll get into more detail later, but for now let’s just see if the machine is looping.

First, make sure you have GDB installed. QEMU won’t complain if you don’t.

Second, (re)start QEMU with the -s option that tells QEMU to start a gdbserver for your system on TCP port 1234. If you wanted to use breakpoints or walk through GRUB you could also pass it -S which will keep the CPU from starting until you’ve engaged GDB and issued a ‘continue’.

jack@sagan:$ qemu -hda disk.img -m 1024 -s

Now simply fire up GDB from another terminal and give it a remote target:

Assumptions

This is not a hand-holdy kind of write-up because this isn’t a hand-holdy subject. Before you start you should have a Linux machine running natively (unless you want to nest VMs which is a terrible idea for debugging). You should have qemu installed, preferably the kvm variant but that will be an unimportant distinction for quite awhile.

Reasoning

Why QEMU?

QEMU provides seamless KVM integration, and has a myriad of devices supported, and has GDB support. In my prior efforts I used Bochs which is a fine x86 emulator with a decent debugger, but it’s device selection is limited and I’ve become more familiar with QEMU in the intervening years.

Why GRUB2?

GRUB is a solid, flexible bootloader. Traditionally, toy OSes target floppy like devices and part of the learning experience is writing that initial 512 bytes of assembly and the boot signature. I’ve written too many of those already, but the honest truth is that bootloaders aren’t part of the kernel and so they are uninteresting.

The second reason is that using GRUB allows us to be hosted on a real filesystem (we use an ext2 boot partition in this article) and with a real file-format (ELF) which comes in handy when manipulating the FS from Linux or using debug tools on the binary.

The last reason is that GRUB provides an abstraction from BIOS interrupts like e820 to count memory.

And, as for why version 2 specifically, for no other reason than it’s the latest.

Create a QEMU disk

QEMU comes with a nice tool to create disks. We’re going to use the raw format because it will allow us to easily mount portions of the disk later with losetup.

First, create a decent size raw disk. I made mine 10G which is overkill for our purposes, but I’ve got plenty of room to spare.

jack@sagan:$ qemu-img create -f raw disk.img 10g

Grab a Linux ISO

I tried and tried to convince my local copy of the grub utilities to install to a partition I created in the disk.img but when booting in qemu it failed to find the secondary boot files, tossed a message and dumped to a useless rescue prompt. In the end I decided it would be 10x easier to install from within the QEMU environment so that all of the device maps and IDs would naturally sort themselves out.

I chose to install from the Arch Linuxinstall CD. In particular the August 2012 version, although future installation media will probably be okay.

Boot the ISO in QEMU

Use the following command to boot QEMU with the Linux ISO, and the disk image.

jack@sagan:$ qemu -hda disk.img -cdrom [path to Linux ISO] -m 1024

This should get you to a prompt. At this point, the architecture (x86/x86_64) doesn’t matter as the GRUB media is identical.

Setup the Disk

I used cfdisk to create two new primary partitions. 1 100MB partition for boot and 1 with the rest.

livecd:# cfdisk /dev/sda

Then create your filesystems. I want to use ext2 for the /boot and ext4 for the other partition because later I plan on having some fun with ext4.

livecd:# mkfs.ext2 /dev/sda1
...
livecd:# mkfs.ext4 /dev/sda2

Install GRUB2 to the Disk

Mount the boot FS to /mnt

livecd:# mount /dev/sda1 /mnt

And then use grub-install to copy the relevant files and setup the MBR. We specify boot directory because otherwise grub-install will try to use /boot on the live CD and will fail to map that to a usable device.

livecd:# grub-install --boot-directory=/mnt /dev/sda

It should report no errors.

Finally, unmount the boot FS

livecd:# umount /mnt

And you can either shutdown like it’s a real machine or just kill QEMU afterwards.

Testing the Boot

Now, restart the QEMU without the ISO argument to see if you can get to a GRUB prompt.

jack@sagan:$ qemu -hda disk.img -m 1024

Almost instantaneously you should get a window that looks like:

Mounting Partitions from Linux

Now that we have a bootloader in place, hopefully we won’t have to use any other outside help in our QEMU environment. However, obviously the next step requires we have a file to boot. That’s for the next post, but to get to that point we need to be able to copy the file into our disk, which is complicated by the fact that it’s not just a stupid filesystem image, but a full disk image with a partition table and everything.

To mount the partitions we need to use losetup to create a loopback devices for just the relevant parts of the image file. Conveniently losetup takes the -o (offset) and --sizelimit arguments which allow you to loopback map a portion of a file.

First, let’s take a look at the output of fdisk‘s p (print) command to get the byte offsets and sizes of our partitions.