Statically Recompiling NES Games into Native Executables with LLVM and Go

2013 June 7

I have always wanted to write an emulator. I made a half-hearted attempt in college
but it never made it to a demo-able state. I also didn't want to write Yet Another Emulator.
It should at least bring something new to the table. When I discovered
LLVM, I felt like I finally had worthwhile idea for a project.

This article presents original research regarding the possibility of
statically disassembling and recompiling
Nintendo Entertainment System
games into native executables.
I attempt to bring the reader along from beginning to end in a sort
of "let's build it together" fashion.
I assume as little as possible about what the reader knows
about the subjects at hand, without compromising the depth of the article.

Discovering LLVM

I became giddy with excitement when I discovered LLVM.
Generate code in LLVM IR format,
and LLVM can optimize and generate native code for any of its supported backends.
Here's an incomplete list:

ARM

STI CBEA Cell SPU [experimental]

C++ backend

Hexagon

MBlaze

Mips

Mips64 [experimental]

Mips64el [experimental]

Mipsel

MSP430 [experimental]

NVIDIA PTX 32-bit

NVIDIA PTX 64-bit

PowerPC 32

PowerPC 64

Sparc

Sparc V9

Thumb

32-bit X86: Pentium-Pro and above

64-bit X86: EM64T and AMD64

XCore

Furthermore, you can write additional backends to support additional targets. For example,
emscripten provides a JavaScript backend.

To seal the deal, the LLVM project offers clang, a
C, C++, Objective C, and Objective C++ front-end for LLVM.
This means that because of the way LLVM is designed, emscripten allows you to compile C, C++,
Objective C, and Objective C++ code for the browser. Wow!

Just to show you how refreshing this technology is, have a look at the following C code:

Aha! In C, & has lower precedence than ==. In addition to
outputting with fancy terminal colors, clang found the bug in our program and issued a warning.
Once we fix that up and run the code again, we get the expected behavior.

Note: It has been shown to me that gcc -Wall produces a similar warning
for the same code example here. I apologize for the bogus example. However,
my point still stands.
LLVM has a much better explanation
of the philosophy and differences in error messages between clang and gcc.

There are many instances where cryptic gcc errors in your code are made clear by clang.
In the words of my friend
Josh Wolfe,

Clang makes gcc look like a rusty old grandpa.

This is exciting because clang is merely a front-end to LLVM. What this means is that
if you generate LLVM IR code, you share the same code generation backend as clang.
With this I felt that powerful optimization and wide target platform support were within my
grasp.

What if we could translate NES assembly code into LLVM IR code? We could completely,
statically recompile NES games into native binaries!

NES Crash Course

Take a break from LLVM for a moment and consider the Nintendo Entertainment System.

Usually a mapper: custom hardware used to provide extra ROM (and sometimes other stuff)

The NES uses memory-mapped I/O, which means that, to interact with hardware, you read from and
write to special memory addresses. Take a look at a simplified version of the memory layout
of the NES at runtime:

Start

End

Description

$0000

$07ff

RAM. Unless they use a mapper, NES games only get 2KB of RAM!

$2000

$2007

These memory addresses are hooked up to the PPU and hence are used to control
what is being displayed on the screen.

$4000

$4015

These are hooked up to the APU and are used to control the sound that plays.

$4016

$4017

These are used to read the state of the game controllers.

$8000

$ffff

This is Read Only Memory. The game code itself is loaded here, so games can embed data
in their program code and read it in this address range.

Let's break this down. First, note that the entire addressable range of the NES is
64KB. Thus all addresses can be represented by 2 bytes, or 4 hex digits. Next,
note that in 6502 assembly, $ is standard for hexadecimal notation.
So the address range is $0000 to $ffff.
Of the 64KB addressable range, 29KB is (usually) completely unused!

Note the $8000 - $ffff range. When you create an NES game, you
write 6502 assembly code, which is assembled into a 32KB binary, which is loaded into the $8000 - $ffff range during bootup.

Let's write some example 6502 code to illustrate.

Creating a Custom ABI for Testing Purposes

One of the challenges of NES development is that a basic "Hello, World" program is
pretty complicated.
This is why I created a custom Application Binary Interface to make it a little bit easier.
Refer back to the memory layout above, and recall that many addresses are unused.
I decided to add a couple things:

Address

Description

$2008

Write a byte to this address and the byte will go to standard out.

$2009

Write a byte to this address and the program will exit with that byte as the return code.

With this add-on ABI, I was able to create a simple 6502 application that prints
"Hello, World!\n" to stdout and then exits.
In 6502 land, that looks like:

; Remember that this code gets loaded at memory address $8000.
; This instruction tells the assembler that the code and data
; that follow are starting at address $8000.
; For example, the 'W' character in the "Hello, World!\n\0" text
; below is located at memory address $8007.
.org $8000
; This is a label. The assembler will replace references to
; this label with the actual address that it represents, which
; in this case is $8000.
msg:
; 'db' stands for "Data Bytes". This line puts the string
; "Hello, World!\n\0" in bytes $8000 - $800e when this
; code is assembled.
.db "Hello, World!", 10, 0
Reset_Routine:
LDX #$00 ; put the starting index, 0, in register X
loop:
; This instruction does 3 things:
; 1. Take the address of msg and add the value of register X to get a pointer.
; 2. Read the byte at the pointer and put it into register A
; 3. Set the zero flag to 1 if A is zero; or 0 if A is nonzero.
LDA msg, X ; read 1 char
; If the zero flag is 1, goto loopend. Otherwise, go to the next instruction.
BEQ loopend ; end loop if we hit the \0
; Put the value of register A into memory address $2008. But remember,
; with our custom ABI, this means write the value of register A to stdout.
STA $2008 ; putchar
; Register X = Register X + 1
INX
JMP loop ; goto loop
loopend:
LDA #$00 ; put the return code, 0, in register A
STA $2009 ; exit with a return code of register A (0 in this case)
; These are interrupt handlers. Let's not think about interrupts yet.
IRQ_Routine: RTI ; do nothing
NMI_Routine: RTI ; do nothing
; Another org statement. Remember that NES programs are 32KB. This tells the assembler
; to fill the unused space up until $fffa with zeroes and that we are now
; specifying the data at $fffa.
.org $fffa
; 'dw' stands for "Data Words". These lines put the address of the NMI_Routine
; label at $fffa, the address of Reset_Routine at $fffc, and the address
; of IRQ_Routine at $fffe.
.dw NMI_Routine
.dw Reset_Routine
.dw IRQ_Routine

There are some things to point out here. First, note the entry points into the program.
Whereas in most programming environments, there is a main function,
in NES games, the data at $fffa - $ffff has hardcoded special
meaning. When a game starts, the NES reads the 2 bytes at $fffc and
$fffd and uses that as the memory address of the first instruction to run.

Next let's talk about the CPU. In the code snippet above I used register A and register X.
There are 6 registers total:

Name

Bits

Description

A

8-bit

The "main" register. Most of the arithmetic operations are performed on this register.

X

8-bit

Another general purpose register.
Fewer instructions support X than A.
Often used as an index, as in the above code.

Y

8-bit

Pretty much the same as X.
Fewer instructions support Y than X.
Also often used as an index.

P

8-bit

The "status" register.
When the assembly program mentions the "zero flag" it is
actually referring to bit 1 (the second smallest bit) of the status
register. The other bits mean other things, which we will get into later.

SP

8-bit

The "stack pointer". The stack is located at $100 - $1ff.
The stack pointer is initialized to $1ff.
When you push a byte on the stack, the stack pointer
is decremented, and when you pull a byte from the stack,
the stack pointer is incremented.

PC

16-bit

The program counter. You can't directly modify this register
but you can indirectly modify it using the stack.

After coming up with this well-defined task -
get "Hello World" running natively - it is time to write some code.
First things first - this project should be able to assemble our
source code into binary machine code format.

Assembly

One tried and true method to parsing source code is to use a
lexer
to tokenize the source, and then a
parser
to turn the tokens into an
Abstract Syntax Tree.
From there we can process the AST and turn it into the 32KB binary payload.

In Go land, this is straightforward thanks to nex,
a lexer, and go tool yacc, a parser that
is bundled with Go.
Here is the parser code that can build an AST for our "Hello, World" example:

Use the computed byte offsets to resolve the instructions with labels.

Final pass to write the payload to the disk.

Disassembly

Now we need to work backwards - we have the binary payload and we want to get
the source.
There is a challenge here. Remember that in assembly programs, we can insert
arbitrary data with .db or .dw statements alongside instructions.
In order to disassemble effectively, we have to be able to figure out what
is "data" and what is "code".

One possible technique is to emulate the assembly program, and then record
the ways in which memory addresses are accessed. After playing a game for a while,
you would have a pretty good record of exactly which sections are data and which are code.
I decided not to use this technique, however, since the goal of this project is static
recompilation. I want to explore just how much is possible to do at compile-time.

So what can we do?

First, recall that the last 6 bytes in NES programs are 3 memory addresses which are the
3 entry points into the program:

.org $fffa
.dw NMI_Routine
.dw Reset_Routine
.dw IRQ_Routine

Given this, a workable strategy becomes clear:

Create an AST where every single byte is a single .db statement.

Replace the .db statements at $fffa and $fffb
with a .dw statement which references an NMI_Routine label.

Calculate the address that the .db statements at $fffa and $fffb
were referring to, and insert a LabelStatement with the NMI_Routine
label before the .db statement at that address.

Mark the .db statement at that address as an instruction.

When I say "mark as an instruction", what I mean is that we should interpret the
.db statement at that location as an op code, and then use that
to replace the following .db statements as part of the instruction
as necessary. Then, based on the instruction, we want to recursively
mark other locations as instructions:

Instruction

What to Do

BPL, BMI, BVC, BVS, BCC, BCS, BNE, BEQ, JSR

Mark the jump target address and the next address as an instruction.

JMP absolute

Mark the jump target address as an instruction.

RTI, RTS, BRK, JMP indirect

Do nothing.

everything else

Mark the next address as an instruction.

The instructions that start with "B" are all conditional branch instructions.
This means that they test some condition, and then either transfer control flow to
the next instruction, or to a different label. This means that we can mark the
possible branch address and the next address as instructions.

JSR stands for "Jump to SubRoutine". This will transfer control to a
target address and then later when the RTS ("ReTurn from Subroutine")
instruction is reached, continue execution where the JSR instruction left off.

It is possible for assembly programmers to use JSR and then
inside the subroutine, do tricks with the stack to return to a different location.
This is a problem that will be tackled later. It's not an issue with our "Hello World" example.

RTI, RTS, and BRK modify control flow, but the destination
address is not constant, so these instructions do not help us know what else to mark as instructions.

As seen in this table, there are 2 types of JMP instructions: absolute and indirect:

JMP Type

Example Assembly

Description

Absolute

JMP Label_80a2

This version is used in the "Hello World" program. It sends control
flow to the operand address - which in this example is a label. This
version is convenient for disassembly because the destination address
is known statically - the address is hard-coded.

Indirect

JMP ($0123)

Uses the operand address as a pointer, sending control flow
to the address at the pointer. This will prove to be one of the big challenges
of this project. More on this later.

The instructions in the earlier table are the only ones that modify control flow.
All other instructions execute serially. Thus if we encounter one of these,
we can reliably decode the next byte as an instruction.

Here's what it looks like when we apply this algorithm to our Hello World binary code:

This technique was able to decode all of the instructions, but we can't
read the text, and it's pretty annoying having $ff -
the filler byte value - repeated
so many times. Let's add a pass to detect ASCII strings. We can do this
by counting how many characters in a .db statement in a row
are considered "ASCII" and when a threshold of 4 is reached, replace
the .db statements with a quoted string:

Better! Let's solve the problem of the repeating $ff
by adding a pass to detect where it would make sense to place
.org statements. We can do this in much the same way
as ASCII detection, but in this case we will look out for repeating
bytes rather than bytes which fit in the ASCII range. 64 seems
like a good threshold. If a byte repeats 64 times, replace all the
repeated occurences with a .org statement:

Not bad. This is as close as we can get to the original source.
It's impossible to know what label names were used, but we can
give good names to the interrupt vectors. We just turned a
binary machine code program into human-readable assembly.

Now that we can figure out the assembly source code from 6502 machine code,
we can start the fun part - converting the assembly program into native
machine code.

Code Generation

Our code generation code will generate an LLVM bitcode module.
We can then use llc to compile the bitcode into an object file,
the same as if we used gcc -c module.c and looked at the resulting
module.o.

LLVM is written in C++, but it also exposes a C interface.
This means we can integrate cleanly using cgo.
In fact, Andrew Wilkins maintains a convenient Go module called
gollvm which gives us seamless integration.

At any time we can debug the LLVM module we are generating by calling module.Dump()
which prints the LLVM IR code
for the module to stderr. Let's start by manually creating the
IR code that we want to generate for Hello World, so we know what to work toward.
We can get a head start by writing it in C and using clang to generate the IR
code for us:

; Here we declare the text that we will print.
; "private" means that only this module can see it - we do not export this
; symbol. Always declare private when possible. There are optimizations to be
; had when a symbol is not exported.
; "constant" means that this data is read-only. Again use constant when
; possible so that optimization passes can take advantage of this fact.
; [15 x i8] is the type of this data. i8 means an 8-bit integer.
@msg = private constant [15 x i8] c"Hello, World!\0a\00"
; Here we declare a dependency on the `putchar` symbol.
; When this module is linked, `putchar` must be defined somewhere, and with
; this signature.
declare i32 @putchar(i32)
; Same thing for `exit`.
; `noreturn` indicates that we do not expect this function to return. It will
; end the process, after all.
; `nounwind` has to do with LLVM's error handling model. We use `nounwind`
; because we know that `exit` will not throw an exception.
declare void @exit(i32) noreturn nounwind
; Note that we will be performing the final link step with gcc, which will
; automatically statically link against libc. This will provide the `putchar`
; and `exit` symbols, as well as set up the executable entry point to call `main`.
define i32 @main() {
; This label statement indicates the start of a basic block.
Entry:
; Here we allocate some variables on the stack. These are X, Y, and A,
; 3 of the 6502's 8-bit registers.
%X = alloca i8
%Y = alloca i8
%A = alloca i8
; Note that here we are allocating variables which are single bits.
; These represent 2 of the bits from the status register.
; After this source listing there is a table explaining each bit
; of the status register.
%S_neg = alloca i1
%S_zero = alloca i1
; Send control flow to the Reset_Routine basic block.
br label %Reset_Routine
Reset_Routine:
; This is the code to generate for
; LDX #$00
; Store 0 in the X register.
store i8 0, i8* %X
; Clear the negative status bit, because we just stored 0 in X,
; and 0 is not negative.
store i1 0, i1* %S_neg
; Set the zero status bit, because we just stored 0 in X.
store i1 1, i1* %S_zero
br label %Label_loop
Label_loop:
; This is the code to generate for
; LDA msg, X
; Load the value of X into %0.
%0 = load i8* %X
; Get a pointer to the character in msg indexed by %0, which contains the
; value of X.
%1 = getelementptr [15 x i8]* @msg, i64 0, i8 %0
; Read a byte of memory located at the pointer we just computed into %2.
%2 = load i8* %1
; Store the byte we just loaded into %A, which is the variable we have
; allocated for A.
store i8 %2, i8* %A
; Now we need to set the negative status bit correctly.
; The byte of memory we just loaded into %A is negative if
; and only if the highest bit is set.
; Perform a bitwise AND with 1000 0000.
%3 = and i8 128, %2
; Test if the result is equal to 1000 0000.
%4 = icmp eq i8 128, %3
; Save the answer to the negative status bit.
store i1 %4, i1* %S_neg
; Now we need to set the zero status bit correctly.
; Test if the byte is equal to zero.
%5 = icmp eq i8 0, %2
; Store the answer to the zero status bit.
store i1 %5, i1* %S_zero
; This is the code to generate for
; BEQ loopend
%6 = load i1* %S_zero
; If zero bit is set, go to Label_loopend. Otherwise, go to AutoLabel_0
br i1 %6, label %Label_loopend, label %AutoLabel_0
AutoLabel_0:
; This is the code to generate for
; STA $2008
%7 = load i8* %A
; Convert the 8-bit integer that we just loaded from A into
; a 32-bit integer to match the signature of `putchar`.
%8 = zext i8 %7 to i32
%9 = call i32 @putchar(i32 %8)
br label %AutoLabel_1
AutoLabel_1:
; This is the code to generate for
; INX
%10 = load i8* %X
%11 = add i8 %10, 1
store i8 %11, i8* %X
; Set the negative status bit correctly.
%12 = and i8 128, %11
%13 = icmp eq i8 128, %12
store i1 %13, i1* %S_neg
;Set the zero status bit correctly.
%14 = icmp eq i8 0, %11
store i1 %14, i1* %S_zero
; This is the code to generate for
; JMP loop
br label %Label_loop
Label_loopend:
; This is the code to generate for
; LDA #$00
store i8 0, i8* %A
; Clear the negative status bit.
store i1 0, i1* %S_neg
; Set the zero status bit.
store i1 1, i1* %S_zero
; This is the code to generate for
; STA $2009
%15 = load i8* %A
%16 = zext i8 %15 to i32
call void @exit(i32 %16) noreturn nounwind
; Terminate this basic block with `unreachable` because
; exit never returns.
unreachable
; Generate dummy basic blocks for the
; interrupt vectors, because we don't support them yet.
IRQ_Routine:
unreachable
NMI_Routine:
unreachable
}

In this code we use S_neg and S_zero, 2 of the status register bits.
These bits, along with the other status bits that we did not mention yet,
are updated after certain instructions and used for things such as branching.
Here is a full description of all the status bits:

Bit Mask

Variable Name

Description

0000 0001

S_carry

Used for arithmetic and bitwise instructions, typically to make it
easier to deal with integers that are larger than 8 bits.
We don't need to deal with this yet.
BCC and BCS use this status bit to decide
whether to branch.

0000 0010

S_zero

When a computation results in a value that is equal to zero, this
status bit is set. Otherwise, it is cleared. BEQ and
BNE use this status bit to decide whether to branch.

0000 0100

S_int

This bit indicates whether interrupts are disabled. You can use SEI
to set this bit, and CLI to clear this bit. More on interrupts later.

0000 1000

S_dec

Normally, this bit would toggle decimal mode on and off.
However, the NES disables this feature of the CPU, so it effectively
does nothing. You can use SED to set this bit, and
CLD to clear this bit.

0001 0000

S_brk

This bit is set when a BRK instruction has been executed and an
interupt has been generated to process it.
We'll ignore this bit for now.

0010 0000

-

This bit is unused. It remains 0 at all times.

0100 0000

S_over

When a computation results in an invalid
two's complement
value, this bit is set.
Otherwise, it is cleared.
BVC and BVS use this to decide whether to branch.

1000 0000

S_neg

When a computation results in a negative two's complement value, this bit is set.
Otherwise, it is cleared. BPL and BMI use this status
bit to decide whether to branch.

llc converts the LLVM IR code into a native object file, and then
gcc does the final link step, statically linking against libc
to hook up main, putchar, and exit.
By default, gcc creates an executable named a.out,
which we run, and viola!

The next step is to generate this code from our disassembly.
With the help of gollvm
we will:

Create a LLVM module.

Declare our dependency on putchar and exit.

Create the main function and allocate stack variables for the
registers.

Perform one pass over the AST to identify labeled data
and create global variables. We save the index of label name
to global variable in a map.

Perform a second pass over the AST to generate the basic blocks
and save an index of label name to basic block in a map.

Final pass over the AST to generate code inside of the basic blocks.

Here is a structure that contains the state we need while
compiling:

type Compilation struct {
// Keep track of the warnings and errors
// that occur while compiling.
Warnings []string
Errors []string
// program is our abstract syntax tree -
// we will make several passes over it during
// the compilation process.
program *Program
// LLVM variable which represents the module we
// are creating.
mod llvm.Module
// This is the object that we will use to do all the
// code generation. It's used to create every kind
// of statement.
builder llvm.Builder
// Reference to our main function.
mainFn llvm.Value
// References to the functions we declared, so we
// can call them.
putCharFn llvm.Value
exitFn llvm.Value
// References to the variables we allocate on the stack
// so we can use them in instructions.
rX llvm.Value
rY llvm.Value
rA llvm.Value
rSNeg llvm.Value
rSZero llvm.Value
// Maps label name to the global variables that we add,
// for when the code loads data from a label.
labeledData map[string]llvm.Value
// Maps label name to basic block, so that when
// code branches to another label, we can branch to
// the relevant basic block.
labeledBlocks map[string]llvm.BasicBlock
// Keeps track of the basic block we are currently
// generating code for, if any.
currentBlock *llvm.BasicBlock
// Keeps a reference to the reset routine basic block
// so we know where to first jump to from the entry point.
resetBlock *llvm.BasicBlock
}

At this point in the project we are able to recompile a simple
"Hello, World" 6502 program with a small custom ABI into native machine code
and then execute it.

Optimization

Did you notice that we never used the value of S_neg?
We only ever stored it.
This is a waste of CPU cycles. We can do better. However, we don't want to
completely remove the ability to compute S_neg -
although this "Hello World" example does
not use the value, other code might.

Optimization is an enormously complicated topic, with its own well-deserved field.
Luckly, we won't have to wrap our heads around it in order to benefit. LLVM IR code is designed to be optimized.
LLVM comes with state of the art optimization techniques, in the form of passes
that you run on a module.

Let's run several optimization passes on the module we generate before rendering
bitcode:

This pass allows us to load every "register" variable (X, Y, A, etc) before performing
an instruction, and then store the register variable back after performing the instruction.
This optimization pass will convert our allocated variables into registers, eliminating
all the redundancy.

Now, not only are we recompiling a simple 6502 program for native machine code,
but we're actually generating highly optimized code.
But it's not time to congratulate ourselves yet. This is a contrived case.
Let's see if these techniques can work on an actual NES game.

Layout of an NES ROM File

The generally accepted standard for distributing NES games is the .nes
format, originally started by the iNES emulator.
A .nes file looks like:

16 byte header with metadata, such as:

What mapper, if any, the game uses.

Whether the PPU uses vertical or horizontal mirroring.
For now, don't worry so much about exactly what this means, as much as that
it is a setting that the PPU needs to know about at bootup.

The 32 KB of assembled code which gets loaded into $8000 - $ffff on bootup.
If there is a mapper, there might be more of this program data, but we'll talk
about that later.

8KB of graphics data, known as CHR-ROM.
This gets loaded into the PPU on bootup.
Again if there is a mapper,
there might be more of this graphics data.

The first "real" ROM I tried to recompile is this demo ROM made by
Chris Covell in 1998 called
Zelda Title Screen Simulator.
If you run this ROM in an emulator, you get a title screen that looks nearly identical to Zelda 1:

Now that we're recompiling a real ROM, we'll have to do several more
things to make it work:

Include the graphics data in the binary.

Include the mirroring setting (vertical or horizontal) in the binary.

Support code gen for more instructions.

Include a runtime to create a window and render
the video to the screen.

Including the graphics data and mirroring setting in the binary is trivial.
We get all the information we need when we read the ROM file; all
we need to do is convert it to global variables in our LLVM module:

In retrospect,
I think it would have been simpler to have the assembler
support special declarations for this metadata rather than having a separate
file. But hey, it works.

You may notice there is some metadata there which is never addressed in
this article. For the purposes of this project, we don't need to think about
SRAM, battery, or tvsystem. None of the examples we look at use them.

Again, writing the code to pack the rom back together into an
.nes file is left as an exercise to the reader.

Now we observe the ROM running in an emulator and see if it still works:

Ha! It works but the sword is bent. That's actually pretty convenient - now
when we want to make interrupts work, we know how to test them.
But like true software developers, we're going to ignore any and
all problems as long as possible.

The only thing left to do then, is to add code gen support for the new
instructions. This leads us to our first challenge.

Challenge: Runtime Address Checks

Looking at the disassembled code for Zelda Title Screen Simulator,
there is one instruction that stands out as a bit more tricky to
recompile than the others:

STA ($00), Y

This is an indirect instruction. It tells the CPU to:

Interpret the values in $0000 and $0001 as a memory address.

Add the value of register Y to the address to compute a new address.

Store the value of register A to the new address.

Quite a useful instruction, but it poses a challenge for recompilation.
Because values in memory addresses $0000 and $0001 could be anything,
only at runtime will we learn which memory address is about to be updated.

Recall that the NES uses memory mapped I/O. This means that this instruction
could be saving a value in memory, but it could also be talking to the PPU,
the APU, the game pad controller, or even a mapper.
Since this one instruction could be doing any of these things
depending on runtime state, we will have to add a
runtime address check.

With this framework in place, we can recompile instructions that store to
memory addresses only known at runtime.
Code generation for the other instructions in this assembly program
is straightforward at this point.

So now we can generate a new, native program to run. But how do we
get the video to actually display on the screen?

This question brings us to the next challenge.

Challenge: Parallel Systems Running Simultaneously

To find out how to get something rendering on the screen, I looked at
Fergulator - an
already-working NES emulator written in Go.

Fergulator correctly emulates Super Mario Brothers 1 as well as
Zelda Title Screen Simulator, so it will certainly work for our purpose -
understanding how the video gets onto the screen.

The answer is easy to find in this well factored codebase.
Looking at the main loop in machine.go, the core
logic is revealed:

The CPU runs one instruction and returns how many cycles the instruction
took to run, and then the PPU is stepped for 3 times as many cycles, and
the APU is stepped the same number of cycles as the CPU.

A problem presents itself here.
Our recompiled code replaces the CPU stepping, but we still have the PPU
and the APU code to reckon with.
We can start by eliminating audio from the equation and dealing with sound
later.
But no matter how you spin it, the fact remains that the PPU and the CPU
run independently of one another, and at differing speeds.

You can choose to recompile program code and run that as the main loop, but
then after every instruction
you must emulate the PPU for 3 times as many cycles as the instruction took.
Or you can choose to run the PPU as the main loop, but then after the
appropriate amount of cycles you must emulate the CPU for one instruction.

One of the systems must be emulated.

This is how I solved the problem:

Have code generation call
rom_cycle, an external function, after every instruction completes,
passing the number of cycles the instruction took as a parameter.

Bundle in a runtime with the generated executable which implements rom_cycle
and emulates the PPU for the appropriate amount of steps.

It is a shame that we have to do some amount of emulation in this project, where the
goal is to statically recompile as much as possible.
The solution to this challenge represents a slight compromise to the project's integrity.
Yet we press on.

Given this solution, I ported the PPU code as well as the
SDL
and
OpenGL
front-end code from Fergulator to a
small C runtime,
which is compiled with clang.
Here is a snippet explaining the rom_cycle and
main functions:

#include "rom.h"
#include "assert.h"
#include "ppu.h"
#include "SDL/SDL.h"
#include "GL/glew.h"
Ppu* p;
// This function is called by the generated module after every instruction.
void rom_cycle(uint8_t cycles) {
// Check the SDL event loop and quit if the Close event occurs.
flush_events();
// Step the PPU for 3 times number of cycles that just finished.
for (int i = 0; i < 3 * cycles; ++i) {
Ppu_step(p);
}
}
// This function is our new main entry point. We rename the
// main rom entry point to `rom_start` so that we can call it
// from this function.
int main(int argc, char* argv[]) {
// Create a new instance of the PPU emulator core.
p = Ppu_new();
// The PPU code will call `render` when there is a frame ready to display
// on the screen. The `render` function performs the SDL and OpenGL
// calls to render the frame to the window.
p->render = &render;
// Remember that in the generated rom module, we export the ROM mirroring
// setting after reading it from the .nes file. Here we use it to configure
// the nametable code.
Nametable_setMirroring(&p->nametables, rom_mirroring);
// We currently only support ROMs with 1 CHR bank.
assert(rom_chr_bank_count == 1);
// In the generated rom module, we have the CHR ROM data in memory, as well
// as `rom_read_chr`, an exported function which will copy the data to a
// pointer. We use that to initialize the video RAM.
rom_read_chr(p->vram);
// This does the SDL and OpenGL setup such as creating the display window.
init_video();
// Here we call into the main entry point to the recompiled ROM code.
rom_start();
// Free up memory associated with the PPU emulator instance.
Ppu_dispose(p);
}
uint8_t rom_ppu_read_status() {
return Ppu_readStatus(p);
}
void rom_ppu_write_control(uint8_t b) {
Ppu_writeControl(p, b);
}
void rom_ppu_write_mask(uint8_t b) {
Ppu_writeMask(p, b);
}
void rom_ppu_write_oamaddress(uint8_t b) {
Ppu_writeOamAddress(p, b);
}
void rom_ppu_write_address(uint8_t b) {
Ppu_writeAddress(p, b);
}
void rom_ppu_write_data(uint8_t b) {
Ppu_writeData(p, b);
}
void rom_ppu_write_oamdata(uint8_t b) {
Ppu_writeOamData(p, b);
}
void rom_ppu_write_scroll(uint8_t b) {
Ppu_writeScroll(p, b);
}

Notice that we include rom.h. This is a header file that defines the contract
that the generated rom module will fulfill:

And so here we have it: an executable binary that depends on SDL and OpenGL, which
supposedly will execute the Zelda Title Screen Simulator when run. Does it work?

Note: There is an obvious discrepancy between the screenshot and the output listed
above. All output is real; the difference is that the article is organized, ordered,
and simplified for clarity and understanding. I did not necessarily write the code
in the same order that it is listed in this article.

So we made a small concession to solve this challenge, and we have a
simple demo NES ROM running natively.
Next let's see if we can fix the dent in that sword.

Challenge: Handling Interrupts

In NES programming, after any instruction, it is possible that
the program counter is yanked away from the next expected instruction and instead
sent to a predefined location.
This is called an interrupt.

There are 3 kinds of interrupts in NES programming:

Reset

Occurs when the user presses the reset button.
This is also where the program counter starts when the NES
powers on.

IRQ

Stands for Interrupt ReQuest. This interrupt can only be fired
if a game uses a mapper or uses the BRK instruction.
This request can be enabled and disabled with the CLI and
SEI instructions.

NMI

Stands for Non-Maskable Interrupt because there are no instructions
to enable and disable this interrupt. It occurs when the
vertical blank
begins.

When an interrupt occurs, the program counter and status register are
pushed onto the stack, and the program counter is set to the location
defined by the interrupt vector table, as seen in code listings above:

.org $fffa
.dw NMI_Routine
.dw Reset_Routine
.dw IRQ_Routine

When a program has finished processing an interrupt,
it typically executes the RTI instruction
to return control flow back to where it was before
the interrupt occurred.

We don't yet have to deal with handling the reset button
being pressed, mappers, or games which execute the
BRK instruction,
so let's work on solving this
problem for the NMI interrupt only.

The PPU performs the vertical blanking which signals the NMI interrupt,
and the PPU emulation code is in the runtime.
In order to handle NMI interrupts correctly, we have to be ready to jump
to the interrupt routine after every single instruction.

We already call into the runtime by calling rom_cycle after
every instruction. This is where the PPU emulation is performed, which
is where the NMI interrupt is generated.
Given this, a natural solution might be something like this:

Have rom_cycle return a value indicating which interrupt,
if any, has occurred.

After every instruction, check if an interrupt has occurred, and if so,
jump to the interrupt routine. Otherwise, continue.

This solution is rather ugly, however. Branching after every instruction
decreases execution speed and executable leanness. Further, it leaves a
critical problem unsolved: how to jump back to where the program counter
was before the interrupt occurred.
The solution I came up with instead feels better, but it comes with a caveat.
Here's the idea:

Switch the register variables from being allocated on the stack in
rom_start, to global variables in the generated module.

Update rom_start, the main entry point, to take a parameter
indicating which interrupt vector to execute.

In the runtime, when rom_start is called for the first time,
pass Reset as the interrupt vector to execute.

In the generated rom_start code, insert code at the beginning
of the NMI interrupt routine block to push the program counter and the
processor status to the stack.

In rom_cycle, if there is no interrupt, return as normal.
However, if there is an interrupt, call rom_start, passing
the correct interrupt vector to execute.

Generate a return statement for the RTI instruction.

The beauty of this solution is that it uses the real native stack for interrupts,
in what is probably the most efficient and elegant way to get the desired behavior.

The caveat is that if the game uses RTI for its side-effects instead
of the usual "return from interrupt" behavior, the executable will unexpectedly exit.
Further, if the game simulates returning from an interrupt without using
RTI, the game will crash due to a stack overflow.

Acknowledging these weaknesses, let's plow ahead until we are forced to solve
this problem a different way.

Let's see what it looks like to implement the plan.

Update rom_start to take a parameter indicating which interrupt
vector to execute:

In the runtime, when rom_start is called for the first time,
pass Reset as the interrupt vector to execute.
Also, in rom_cycle, if there is no interrupt, return as normal.
However, if there is an interrupt, call rom_start, passing
the correct interrupt vector to execute:

And so here we have it: an elegant but imperfect solution to the interrupt problem.
Let's see if it fixes the bent sword:

Looks like progress to me!
At this point the project can recompile a small, simple title screen demo
NES program. The real challenge awaits: can it be made to work for a real NES game?

There are many games to choose from. Ideally we would pick one that poses fewer
additional challenges. One filter we can apply is to eliminate games that use mappers,
since we have hitherto ignored mapper support entirely.

This limits the choices significantly. The only games worth noting that do not use
a mapper are:

Challenge: Detecting a Jump Table

After the value at $0770 is loaded into register A, and we return
from the Label_8e04 subroutine,
there is an uncommon indirect AND instruction,
followed by data.
What could possibly be happening here?

Super Mario Brothers 1 is using a common assembly programming technique
called a dynamic jump table.
Take a look at the Label_8e04 subroutine:

Notice that although this label is jumped to with JSR,
it never uses the RTS instruction.
Let's break it down further into readable pseudocode:

; Dynamic Jump Table. Call this label with JSR
; so that the old PC is on the stack.
; Immediately following the JSR statement should be
; .dw statements indicating the labels to jump to
; depending on the value of register A.
Label_8e04:
; Register A holds the index of the label that we wish to jump to.
; Multiply A by 2 because each table entry is 2 bytes.
ASL ; A = A * 2
; The useful indirect instructions use Y as the index, and we need
; to repurpose A.
TAY ; Y = A
; Since this label was called with JSR, the old PC is on the top
; of the stack. Here we get the lower byte since this is a little
; endian system.
PLA ; A = Stack.Pop()
; Save the lower byte of the old PC into memory.
STA $04 ; Memory[$04] = A
; Get the higher byte of the old PC off the stack.
PLA ; A = Stack.Pop()
; Save the higher byte of the old PC into memory.
STA $05 ; Memory[$05] = A
; JSR pushes the address - 1 of the next instruction to the stack.
; So we add 1 to Y to get the index of the first byte of the jump
; destination.
INY ; Y = Y + 1
; Get the first byte of the jump destination.
LDA ($04), Y ; A = Memory[Memory[$04] + Y]
; Save the first byte of the jump destination.
STA $06 ; Memory[$06] = A
; Increment Y to get the index of the 2nd byte of the jump destination.
INY ; Y = Y + 1
; Get the 2nd byte of the jump instruction.
LDA ($04), Y ; A = Memory[Memory[$04] + Y]
; Save the 2nd byte of the jump instruction.
STA $07 ; Memory[$07] = A
; Jump to the location that was just constructed.
JMP ($0006) ; Jump to address at $0006 - $0007

If we know that Label_8e04 is a jump table, we can
mark the bytes following the JSR as .dw
labels which enables us to further disassemble the program.
The disassembled snippet from earlier would look like this:

Without this jump table detection, the bytes at each of those labels
remain .db statements, unable to be decoded.
This is problematic because our strategy currently
depends on all instructions being completely disassembled so that
they can be decompiled and recompiled.

This is not a mere corner-case either. This technique is repeated in many games,
including
Super Mario Brothers 3 and Pac-Man.

Fortunately, it is straightforward to identify and process a jump table
like this without changing too much code.
I solved it with a state machine - essentially pattern-matching or
a regular expression for instructions:

Given this detection function, it is a matter of adding logic to
the JSR disassembly. If a jump table is detected,
mark the following bytes as .dw label statements.
Otherwise, continue marking the next address as an instruction
as usual.

After adding jump table detection, it looks like all the program instructions
are disassembled successfully. But there are still some tricks up the
assembly programmers' sleeves.

Challenge: Indirect Jumps

We just looked at a dynamic jump table implementation which included this
instruction:

JMP ($0006)

This is problematic because we do not know what will be in the memory addresses
$0006 and $0007 until the instruction is actually executed.

One solution is to use our own jump table.
We can use our knowledge of the address of each label
to create a basic block to jump to when an indirect jump is encountered:

This solution will work as long as the indirect jump chooses to jump to one
of the labels.
However it causes a runtime crash if the indirect jump
sets the PC to anything else.
As we will soon find out, assembly programmers not only force the processor
to do this, but even more heinous acts.

Challenge: Dirty Assembly Tricks

Super Mario Brothers 1 is an amazing technical feat.
Every last byte of the 32KB available program space is utilized.
In fact, some bytes are even dual purposed to save space.
Have a look at this code from our Super Mario 1 disassembly:

$2c is the op code for BIT absolute, which is 3 bytes - 1 byte op code
and then 2 bytes for the absolute address.

LDY #$04 is 2 bytes - $a0 for the op code and then
$04 for the immediate value.

So how this works is if you jump to Label_8220, it does Y = $00, and then
sabotages the next instruction, causing the Y = $04 to not happen. Instead,
the BIT instruction sets some status bits in a way that does not matter.
Then it picks up the next instruction, LDA #$f8,
as if you had jumped to Label_8223 with a different Y.

This occurs over a dozen times in Super Mario 1. Similarly, there are instances
where the program jumps into the middle of an instruction.

Yet another trick is adding an address on the stack and then using the
RTS instruction to jump there, kind of like a homebrew
JMP indirect instruction.
And even with indirect JMP instructions, the programmer
may choose to jump to RAM or somewhere other than a label.

These issues must be resolved if we want a playable game.
Sadly, the solution marks the final nail in the coffin of the integrity
of this project.

The solution is to embed an interpreter runtime in the generated binary:

Instead of identifying data that is read and only including that in the
generated module, include all the PRG ROM, since we don't know which
addresses may be accessed at runtime.

After every instruction, update the program counter variable.

Include a basic block in rom_start called Interpret
which reads the program counter variable, reads the op code from the PRG ROM,
performs the necessary operation, and then jumps to the DynJumpTable block.

Update the DynJumpTable block so that the default case jumps to the
Interpret block instead of panicking.

When control flow runs into data, branch to the Interpret block.

Here's a diagram to help clarify:

This strategy ensures that NES games will run correctly,
at the cost of efficiency.
Normally, doing something like updating the program counter variable after
every instruction would be optimized away, but this is thwarted by interrupts
in our case.
Because after every instruction we call rom_cycle, which in turn might
call rom_start with an interrupt, all the global variable state must be
correct before we call rom_cycle.
This defeats the entire point of this project.
At this point we might as well emulate.
In fact, with the new Interpret block, we are doing just that.

Although, to be fair, I only had to add emulation support for 6 op codes to get
Super Mario Brothers 1 working.

Video: Playing Super Mario Brothers 1 on Native Machine Code

In this video I demonstrate my (poor) Super Mario 1 skills in a recompiled executable.
I also demonstrate the movie playback feature that was instrumental in debugging.

Unsolved Challenge: Mappers

At this point in the project we have Super Mario Brothers 1 running
mostly on native code, although not very highly optimized.
We've learned that static compilation, while possible, is rendered
pointless by some of the inherent challenges that emulating a system presents.
Thus there is no reason to solve the challenge that mappers provide.

The only thing I will say about mappers here is that they present
an additional layer of complexity for static disassembly, and they
make it blatantly obvious that Just-In-Time compilation is a better
technique than static recompilation. More on that in the conclusion.

Community Support

NES seems to be a common system for first-time emulator programmers.
As such, I was happy to find a large swath of documentation online
explaining in great detail how the NES works, emulator tutorials,
fascinating optimization articles,
the invaluable
Nesdev wiki,
and more.
Even so, there is nothing like asking a question and having a knowledgeable
person answer in real-time.
The folks in #nesdev on EFNet are fun, engaging, working on all kinds
of interesting projects, and helpful.
Thanks especially to Ulfalizer, sherlock_, Bisqwit, Movax12, and
scottferg for answering
my questions, even when they seemed stupid.
The Nesdev Forums are nice as well.

There were also instances where I asked for some help with Go -
how to write good code, what is the best way to implement certain things, etc.
#go-nuts on Freenode is great for that. The channel is active - you'll
nearly always get an answer immediately.

As for my contributions back to the community, I did help scottferg
fix a bug
in his emulator, which fixed support for
Maniac Mansion.

I also filed an
llvm feature request
asking for the ability to generate
comments in IR code. This would make it much, much easier to debug
generated IR code.

Conclusion

After completing this project, I believe that static recompilation
does not have a practical application for video game emulation.
It is thwarted by the inability to completely disassemble a game without
executing it as well as the fact that multiple systems are executing in parallel,
possibly causing interrupts in the game code.
There is a constant struggle between correctness and optimized code.
Nearly all optimizations must be tossed out the window in the interest of
correctness.
Even more compromises would have to be made to start supporting advanced emulator
features such as saving state or rewinding.

A comparison could be made between a console game and an interpreted language
such as JavaScript.
There are amazingly fast JavaScript interpreters such as
V8, but they do not
work by statically compiling the script. Instead they use
Just-In-Time compilation, along with some advanced techniques, to achieve great speeds.
These techniques could be applied to JIT console game compilation.

For example, one such technique is to identify a section of code,
make some assumptions based on heuristics which allow for highly
optimized native code generation, and then detect if those assumptions
are broken.
If the assumptions are broken, the generated native code is tossed,
and emulation takes over.
However, if the assumptions are upheld, the recompiled block of code
will execute with blazing fast native speed.

This technique is much more suited to an emulator with a JIT core,
rather than trying to do everything statically, especially since
the emulator can "notice" at runtime which memory addresses are used
as instructions and which memory addresses are used as data.

Furthermore, distributing static executables that function as games
would be problematic as far as copyright infringement is concerned.
By keeping ROMs separate from the emulator executable, the emulator
can be distributed freely and easily without risking trouble.

That being said, I do feel like this project was worthwhile in that it
was intellectually stimulating and highly effective at teaching me,
and hopefully now, you, how the Nintendo Entertainment System works,
how to use LLVM effectively, and an introduction to a wide range of
problems that compilers and interpreters face.