If you read my article The One Instruction Wonder you know that I like to build my own own CPUs. While that's great (if geeky) fun, it also means you have to build your own tools. And while building CPUs gives you some geek street cred, writing another assembler doesn't really do it.

For a while I'd hack together simple assemblers using a crude C program or maybe awk. But eventually I got tired and decided to write one last cross assembler. (The complete source code and related files are available here). I made a few observations:

My PC has way more memory than any of my target computers

My PC is ridiculously fast so optimizing the assembler too much is pointless

CPUs change (at least they do when you build them yourself) so the assembler has to change easily

I'm basically lazy

That last point is key. Being lazy means you borrow the best of what you can find to help you avoid doing real work. My previous experience told me that I liked using C or C++ for its output options, but I really liked using awk for its easy string matching. Sure, I can add string matching libraries to C, but then if I were that motivated, I wouldn't be lazy. Or, if you prefer, then I'd have to spend less time working on my CPU and more time working on my assembler.

After thinking about the problem for a few days, I realized one more thing. Pretty much every assembler I've ever seen accepts input like this:

somelabel: opcode arg1,arg2 ; comment

Sure, there's some that use slightly different syntax, but that pretty much sums up about 99% of the assembler's basic syntax. Forget labels for a minute and dump the comment. What if I could get my assembler to look like a C macro? Like this:

opcode(arg1,arg2);

I could easily write some C macros that would fill in an array with the right bit values for the instruction. Remember, I said my PC has way too much memory, so the assembler will just assemble to an array image of the target's memory. Then after its all done, I just have to dump the array out in my format of choice. Simple.

What about the labels? Well that's a little more complicated. Labels are a special case, as you'll see shortly.

The resulting assembler uses four files:

soloinc.awk processes "include" files for the assembler

solopre.awk converts assembler lines into C macros

soloasm.c, the core routines that all assemblers use

axasm, ashell script driver that ties it all together

The preprocessor (soloinc.awk; see Listing 1) is simple enough. It just copies files from its input to its output unless it sees a ##include token at the start of a line. When that happens, the program pushes the current file onto a stack and starts printing out the included file (expanding any ##include tokens it finds in the included file, of course). The preprocessor also emits #line directives to the C compiler to help the C compiler identify any errors at the correct location.

The second line includes a .inc file that is specific to the target processor.

The final line of the preamble includes the label definition file (passed in on the command line as an argument).

The solopre.awk program generates this label definition file as well (although it hasn't created it at the time the preamble is written). However, the file will exist by the time the compiler reads the output. So this is a handy way to let the awk script output information about labels throughout the processing and then have the C compiler read it all up front. The alternative would have been to make multiple passes through the source: one to collect label information and a second pass to do output. Using the includes is simpler and, well, lazier.

When the awk script encounters a label, it does two things. First it outputs a DEFLABEL macro to the label definition file. Then it writes a LABEL macro to the output file. This means the assembler macros can "know" about forward reference labels since all labels will be defined up front with a DEFLABEL macro (remember, the compiler will read the label definition file as part of the preamble before any user code appears). This does require a compiler (like gcc) that support variable declarations that don't appear at the start of a block.

As the program converts assembly to macros, it also converts the opcode to uppercase. That means you can ignore case when writing assembler code, but not when defining opcodes in the target.inc file. The script also takes special note of STRING and STRINGPACK pseudo operations. These create DATA (or DATA4) statements with the characters of a string either byte by byte (STRING) or with the characters packed into a 32-bit word (STRINGPACK). Obviously, if you want to use these, your assembler definition will need to handle DATA and DATA4.

When the awk script completes, the shell script driver compiles the resulting C program along with soloasm.c and executes it. The soloasm.c file contains a main() and some code to output in different formats. So how does the assembly actually occur?

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

Video

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!