Template Assembler

The Graybeards tell us that all of C++ rests on the back of templating, to which the skeptic responds, “Then what supports the templating?” For a true metaprogrammer like myself, the answer is obvious: “It’s templates all the way down.” Templates of templates compiling templates and begetting more templates, or at least that’s what I used to believe…

In this post, we’ll build a simple x86 assembler C++ template metaprogram. The assembler will run at compiletime and generate machine code that can be run in as part of the compiled program, similar to inline assembly. We’ll define a simple embedded domain specific language directly in C++ to express our assembly programs. By building the language directly in C++, we get a lot for free, such as powerful type checking and macro-like operations, and the syntax is not as horrific as you may imagine.

Feel free to checkout the complete source on Github. It’s far from complete, but it supports many basic operations.

I hear it’s gonna be a good time yo, and you’re gonna like it.

Overview

Assemblers are conceptually pretty simple: they map assembly mnemonics to machine code. More advanced assemblers may support symbols and macros, but, if anything, implementing a basic assembler is more tedious than challenging. At least, that is the case when targeting a RISC assembly language, the kind covered in your typical CS class and most books on the subject. When it comes to x86 assembly, that most CISCy of CISCers, all bets are off.

x86

The x86 instruction set architecture is expansive: around one thousand top level instructions by my rough estimate, with about 3500 instruction forms (overloads) among the lot. That right off is one challenge, but a manageable one. While it would be impractical to write the encoding of each instruction by hand, we can write scripts to generate code for each instruction from a specification. Instructions follow a common format, so, if we do things correctly, we’ll only have to implement something like memory address encoding once.

But there’s more to x86 assembly’s complexity than operation breadth. Take addressing modes. There are so many goddamn addressing modes that the mov operator itself is Turing-complete. And a whole lot of complexity comes from x86’s legendary backwards compatibility (fossil records indicate that T.Rex cut his teeth, so to speak, programming x86 assembly in the Cretaceous Period, and that was 70 million years ago).

Crushing that code - Charles R. Knight, 1897

It’s impressive just how much engineers have been able to cram into the instruction set over the years, including the transitions from 16 to 32bits and from 32 to 64bits, but, at the machine code level at least, the consequence of such augmentation is clear.

All this to say, writing an x86 assembler is considerably more complicated than writing a MIPS assembler. That does not mean the task is unmanageable though, the assembler is still basically just a big mapping function after all, but it’s gonna be one hell of a function.

Pseudo Inline Assembly

Our compiletime assembler will take an assembly program, written in C++ as a small embedded domain specific language, and output machine code. And because the compiled program itself will be run on an x86 machine, we can evaluate the generated machine code directly within the compiled program.

For this year’s underhanded C contest, I also looked at embedding machine code in a C program, albeit in a much simpler manner. To summarize the approach: a C style function pointer is just a pointer to code and, through the magic of casting, there is nothing to stop us from converting any byte array into a function pointer. Call the resulting function and the byte array is evaluated as raw machine code.

We’ll use the same basic approach to write the machine code our assembler generates back into the program, and to invoke this machine code at runtime. This does limit the assembly program to being a function, but that’s no so bad. Functions even let us pass in arbitrary data in as arguments.

Embedded Language

Most previous Stupid Template Tricks have been run your of the mill template metaprogramming, so what do we gain by writing an embedded domain specific language to express the assembly? Well consider the following:

Both approches are forms of a domain specific language and produce the exact same compiletime result. But whereas the pure template approach uses templates exclusively to built up a computation, the metaprogram built with the embedded domain specific language can use C++ language syntax, such operators and operator overloading. Values are only used to shuttle around the types. Using C++ language features makes the language more familiar and more expressive.

As the syntax of the targeted language becomes more complicated, the benefits of the embedded language become much more clear. Trying to write good looking memory addressing in pure template code for example is a bit of a nightmare, but it’s easy when we can overload the subscript operator and the + and * operators.

Byte String

Our template assembler will take assembly code and output a byte array of machine code. We’ve worked with compiletime byte arrays before, using them as strings for parsing and playing Tetris. The only new requirement this time around is that our compiletime string must have a static memory data address that we can treat as a function pointer later.

ByteString encodes an array of bytes, exposing the complete byte array as the static data member.

Combing Byte Strings

Machine code is made up of individual instructions, and each instruction is made up of a number of components: prefixes, the operator, and operand data. Each of these components may be further broken into bit level meanings, but all of our encoding logic will operate on bytes.

Operands

Each x86 operator takes between zero and three operands. At the machine code level, an operand can be a register, immediate, or memory address, the assembler having replaced symbols within the program with their values. Our assembler will target NASM syntax, which orders operands mov DEST, SRC.

General Purpose Registers

General purpose registers are fixed size registers. For each size, an index uniquely identifies the register when encoding instructions.

The complete source also defines 16bit, 8bit, and 64 bit registers. Segment registers and SIMD regiters are currently not supported.

Immediates

Immediates are constant values, such as 0 or -42. We’ll only worry about integer immediate values for now. All the actual instruction encoding will be implemented using templates, so we have to find a way to encode a value like -42 as a type.

Immediate handles this encoding for us. It’s similar to std::integral_constant, but stores additional metadata about the value (size) and provides some common operator overloads.

Note that while all of these operator overloads take values, we only care about the types of the values and the type of the result. This allows us to use normal C++ syntax, such as operators in our embedded language, instead of writing everything with lispy looking pure template code.

Bytes and Words and DWords! Oh My!

Our assembler will support four sizes of immediates: 8bit (byte), 16bit (word), 32bit (dword), and 64bit (qword). Instead of writing byte<4>, we can abuse C++ user defined operators for our mini assembly language. The full implementation for constructing value types from C++ literals was previously covered and simply referenced here as ImmediateFromString. 4_b creates a byte, 4_w creates a word, 4_d creates a dword, and 4_q creates a qword.

Memory

x86 memory addressing is complex, both in range of addressing modes the instruction set supports as well as how these addressing modes are encoding. Let’s start by considering a few valid forms of memory addressing:

[0x1234] - Direct

[esi] - Base

[ebx + 8] - Base + displacement

[esi + ebx] - Base + index

[esi + ebx + 8] - Base + index plus displacement

[esi + ebx * 2] - Base + scaled index

[esi + ebx * 2 + 8] - Base + scaled index + displacement

[ebx * 2] - Scaled only

[ebx * 2 + 8] - Scaled only + displacement

Base Only

Heap corruption, Oh yeah!

The form [esi + ebx * 2 + 8] is the most complex of the lot, with all other modes being a subset of it. Breaking it down, its components are:

Base register (esi).

Index register (ebx).

Scaling factor literal (2).

Displacement literal (8).

All of these components are optional and may appear in any combination. The scale is limited to either 1, 2, 4, or 8 (the default is 1 if not specified), while the displacement may be a signed 8, 32, or 64 bit number.

Since memory addresses are all subsets of a single form, we’ll use a single type, Memory, to encode all memory address. Differences in type of address will be handled during encoding.

The additional size parameter of Memory is the size in bytes of the memory targeted (typically 1, 2, 4, or 8) and is used to select the correct overload for certain instructions.

_ allows writing _[eax] to create a memory address from a register, the special None type being used to indicate the lack of a register value for the index register. The second subscript overload of _ will allow us to more easily support the other memory addressing forms with a consistent syntax.

Displacement

Displacement is a signed number added to the base memory address. The binary + and - operators are overloaded for a memory address and an Immediate displacement value.

The above enables forms such as: _[ebx] + 8_b. Any number of displacements can be added to a memory address, with the compiler automatically combining all the displacements before the code is assembled.

To support the more NASM-like syntax _[ebx + 8_b], we’ll also overload the binary + and - operators for a register plus an Immediate.

The source also includes flipped versions of all these overloads, so you can write addresses like: _[2_b + ebx - 8_b].

Scaling

Scaled index part of a memory address consists of a index register and a scaling factor. The scaling factor is just a constant value, either 1, 2, 4, or 8. Scaling factors are created with * in NASM assembly syntax, so we’ll overload the * operator in C++ on a register and an Immediate scaling factor

Further overloads of the + operator add a base register to a scaled index memory address. There are a few cases to handle, depending on whether the base and index were previously converted to Memory types or not.

Double Assembled for Twice the Assembly

Consider this simple assembly fragment:

jmp a
mov eax, 4
.a
mov eax, 3

Clearly the a in jmp a should refers to the label below, but how can our poor computer reading the assembly sequentially know this? Perhaps a is never defined or perhaps it is defined so far away that a different jmp instruction overload has to be used. Checking all this in a single pass is not practical, so we’ll instead use a two pass assembler: pass one to generate the symbol table and pass two to generate the machine code.

Symbol Table

The symbol table maps symbols to program values. We’ll only use our symbol table to map labels to code indicies at the moment, but many assemblers support more advanced uses of symbols (being embedded in C++ gets us a lot of this for free).

We don’t need anything fancy, in fact, pretty much the most simple symbol table possible will work for our assembler: the symbol table as a list of key value pairs.

A more comprehensive symbol table implementation might support forward and backward symbol lookup, but let’s just keep thinks super simple and disallow redefining symbols all together. symbol_table_add inserts a new entry into the symbol table, explicitly checking that no entry currently exists.

State

The symbol table is one part of the assembler’s state, the other component being the address of the current instruction (relative to the first instruction in our assembler).

To keep things as simple as possible, both pass one and pass two of our assembler will operate on essentially the same assembly program data. The differences in behavior between the two passes will come from different implementations of the state object each pass takes.

Both pass one and pass two specialize the BaseState for common functionality.

template<template<size_t,typename>classself,size_tic,typename_labels>structBaseState{/// Location of current instruction.
staticconstexprsize_tindex=ic;/// List of valid labels in the program.
usinglabels=_labels;/// Increment the instruction counter by `x` bytes.
template<size_tx>usinginc=self<index+x,labels>;};

In pass one, we expect forward reference symbol lookups to fail. These failure are perfectly acceptable; pass one only generates the symbol table, the symbol values themselves are not needed until pass two. Any undefined lookups return None, which will not treated as an error during pass one.

The symbol table is complete by the time we run pass two. At this stage, it is still possible that a symbol may be undefined, but, this time around, instead of returning None we’ll generate a compiletime error.

lookup_label in pass two passes the undefined no_such_label type to symbol_table_lookup as the default value. If an undefined symbol is referenced, the compiler will generate an error stating that no_such_label<...> is undefined.

You’re Nobody ‘Til Somebody Assembles You

An assembly program is just a list of instructions and assembly directives. No nesting, syntax trees, or anything like that. And the only (sort-of) directives we care for our simple assembler are program labels. Labels and instructions go in, machine code comes out.

As we’ve seen with the symbol table, the assembler must be able to thread state through the top level units during assembly, very much like the state monad in Haskell. Think of each unit of the assembly as a function, a function that that takes an input state and returns an output state along with some generated machine code. Simple instructions may return constant machine code and only increment the instruction counter of the state, whereas a label may update the state but not generate any machine code.

Instruction

Instruction is the base unit that we’ll use for all x86 instructions. It encodes the 1 to 15 byte machine code for a single x86 instruction, such as MOV or JMP. components is a list of ByteStrings or objects that can be converted to ByteStrings. During assembly, after simple rewriting pass, Instruction joins these components together into the final machine code for the entire instruction.

The list of components of an instruction may contain symbols that must be resolved before the machine code is generated. For our simple assembler, we only have to worry about replacing label references with the correct relative jump offsets.

The fmap in Instruction uses the Functor interface for a list that we’ve previously used to apply Rewrite to each element of components. fold uses the Foldable interface to combine the resulting byte strings.

Rewrite looks for sized labels, and replaces them with jumps computed relative to the current instruction counter.

The programmer must still explicitly specify the size of the expected jump on the instruction though, either 8 or 32 bits:

Asm<int>(MOV(eax,3_d),JMP("a"_rel8),ADD(eax,2_d),"a"_label,RET())

In this example, if a turns out to be more than 128 bytes away, our assembler will produce undefined machine code. A more complete assembler would select the correct jmp overload during pass two, depending on the actual size of the jump needed.

Encoding and Generating Instructions

I’ll just provide a quick overview of how all the instructions are generated here. The actual code is not all that interesting.

This supports instructions overloads easily, and also makes the language somewhat more palatable: MOV(eax, 1_d). As for actually generating all those instructions, I used a simple Javascript program to transform a x86 specification XML file into C++ source code. The coder generator is pretty horrific code, even by Javascript standards:

But it gets the job done. The xml file specifies the encoding of each instruction as well. Instruction encodings share many components, such as REX prefixes and ModR/M and SIB bytes. The actual logic for generating the encoding is mixed between C++ and the Javascript. The Javascript may know that a REX byte of modrm byte is required for example:

But logic like make_rex and modrm are implemented in normal C++. Again, the logic for these is not too interesting. Take a look at the source for the details.

Labels

Labels are the other top level units of assembly code. A label attaches a symbol to an address in the the assembly code itself and are purely an assembly language construct, they generate no code, and indeed machine code has no real concept of labels at all.

During assembly, labels update the state to map the label symbol to the current instruction counter with add_label on the state object.

When dealing with macro-like functions later, we’ll use the block to generate a sequence of assembly instructions from function parameters. block adds no actual logic and is only used to pass around types.

I am given birth to nothing but machine code

Bringing everything together, assemble converts an assembly program into machine code at compile time. Pass one is run first on the program to generate symbol table, then pass two is run with the resulting symbol table. The result of assemble is a ByteString of machine code.

Macro-ish

One quick final thought on the power of embedding the assembly in C++. Because the assembly source is just C++ values and types, we can use any template metaprogramming techniques to manipulate the assembly code. One simple example is writing basic macros within the language:

Conclusion

C++ templates metaprogramming enables the development of fairly powerful embedded domain specific languages and can make templates metaprograms much more expressive. x86 assembly may not be the most practical application of this, but I feel that this is an interesting little project and, with a little work, we could probably make our simple assembler more portable and support a pretty good subset of x86 assembly language.

Check out the complete source and send a pull request or open a bug if you would like a new instruction supported or find a bug (there are plenty of them).

PS. While this post started off well enough, I can’t help but feel that relying on runtime evaluation of the assembly program is somewhat of a debasement. Why must we rely on runtime at all? Why not evaluate the machine code with templates? Replace the rotten old runtime with a clean new compiletime! The time is coming! More Template! More Template!