No registers?

What do I mean by "no registers"? Well, really just whatever makes the puzzle interesting, but the basic guideline is:

No instruction's observable behavior can depend on the contents of any ordinary user-space register.

So we can't read from R[ABCD]X, R[SD]I, R[SB]P (that's right, no stack), R8-R15, any of their smaller sub-registers, or any of the x87 / MMX / SSE registers. This forbids implicit register access like push or movsb as well as explicit operands. I think I would allow RIP-relative addressing, but it's probably not useful when you're building a single executable which loads at a fixed address.

We also can't use the condition flags in EFLAGS, so conditional jumps and moves are right out. Many instructions will set these flags, but those dead stores are okay by me.

All memory access depends on segment selectors, the page table base in CR3, and so on. We trust that the OS (Linux in my example) has set up a reasonable flat memory model, and we shouldn't try to modify that. Likewise there are debug registers, parts of EFLAGS (such as the trap bit), and numerous MSRs which can influence the execution of nearly any instruction. We ignore all that too. Basically, the parts of CPU state which normal user-space code doesn't touch are treated as constants.

So what's left that we can work with? Just

the instruction pointer,

memory operands, and

self-modifying code.

But it would be too easy to self-modify an instruction into having a register operand. The above restrictions must hold for every instruction we execute, not just those appearing in our binary. Later on I'll demonstrate experimentally that we aren't cheating.

The instruction set

In a RISC architecture, every memory access is a register load or store, and our task would be completely impossible. But x86 does not have this property. For example we can store a constant directly into memory. Here's machine code along with NASM (Intel syntax) assembly:

But moving data around is going to be hard. As far as I know, every instruction which loads from one address and stores to another, for example movsb, depends on registers in some way.

Conditional control flow is possible thanks to this gem of an instruction:

ff242500004000 jmp qword [0x400000]

This jumps to whatever address is stored as a 64-bit quantity at address 0x400000. This seems weird but it's really just a load where the destination register is the instruction pointer. Many RISC architectures also allow this.

Compiling from Brainfuck

Let's get more concrete and talk about compiling Brainfuck code to this subset of x86. Brainfuck isn't the simplest language out there (try Subleq) but it's pretty familiar as an imperative, structured-control language. So I think compiling from Brainfuck makes this feel "more real" than compiling from something super weird.

A Brainfuck program executes on a linear tape of (typically) byte-size cells.

Like many Brainfuck implementations, the tape has a fixed size (more on this later) and we start in the middle. head is not a variable with a memory location; it's just an assembler constant for the address of the middle of the tape.

Since our only way to read memory is jmp [addr], the tape must store pointers to code. We create 256 short routines, each representing one of the values a cell can hold.

There are two things we need to do with a cell: get its byte value for output, and test whether it's zero. So each routine moves a byte into out_byte and jumps to the address stored at either cont_zero or cont_nonzero.

We produce most of the routines using a NASM macro. We also have functions to handle underflow and overflow, so a cell which would reach -1 or 256 is bumped back to 0 or 255. (We could implement the more typical wrap-around behavior with somewhat more code.)

The routines are aligned on 16-byte boundaries so that we can implement Brainfuck's + and - by adding or subtracting 16. But how do we know where the head is? We can't store it in a simple memory variable because we'd need a double-indirect jump instruction. This is where the self-modifying code comes in.

Recall that head is an assembler constant for the middle of the tape. So inc_cell etc. will only touch the exact middle of the tape — except that we modify the instructions when we move left or right. The address operand starts at byte 3 or 4 of the instruction (check the disassembly!) and we change it by 8, the size of a function pointer.

Also note that inc_cell and dec_cell jump to test_cell in order to handle overflow / underflow. By contrast the move instructions don't test the current cell and just jump to [cont_zero] unconditionally.

To output a byte we perform the system call write(1, &out_byte, 1). There's no escaping the fact that the Linux system call ABI uses registers, so I allow them here. We can do arbitrary computation without output; it's just nice if we can see the results. Input is messier still but it's not fundamentally different from what we've seen here. Code that self-modifies by calling read() is clearly the future of computing.

Putting it all together, I wrote a small Brainfuck compiler which does little more than match brackets. For each Brainfuck instruction it outputs one line of assembly, a call to a NASM macro which will load cont_[non]zero and jump to one of test_cell, inc_cell, etc. For the program [+] the compiler's output looks like

And now let's try a Brainfuck interpreter written in Brainfuck. There are several, but we will choose the fastest one (by Clive Gifford), which is also compatible with our handling of end-of-file and cell overflow.

Verifying it with ptrace

How can we verify that a program doesn't use registers? There's no CPU flag to disable registers, but setting them to zero after each instruction is close enough. Linux's ptrace system call allows us to manipulate the state of a target process.

We allow registers in the first part of the program, which is responsible for system calls. regs_boundary is set by looking for the symbol FORBID_REGS in the binary.

We run the target using PTRACE_SINGLESTEP, which sets the trap flag so that the CPU will raise a debug exception after one instruction. Linux handles this exception, suspends the traced process, and wakes up the tracer, which was blocked on waitpid.

while (1) {
// For this demo it's simpler if we don't deliver signals to// the child, so the last argument to ptrace() is zero.
CHECK(ptrace(PTRACE_SINGLESTEP, child, 0, 0));
CHECK(waitpid(child, &status, 0));
if (WIFEXITED(status))
finish(WEXITSTATUS(status));
inst_count++;
clobber_regs(child);
}

Notes on universality

Our tape has a fixed size of 30,000 cells, the same as Urban Müller's original Brainfuck compiler. A system with a finite amount of state can't really be Turing-complete. But x86 itself also has a limit on addressable memory. So does C, because sizeof(void *) is finite. These systems are Turing-complete when you add an external tape using I/O, but so is a finite-state machine!

So while x86 isn't really Turing-complete, with or without registers, I think the above construction "feels like" arbitrary computation enough to meet the informal definition of "Turing-complete" commonly used by programmers, for example in the mov is Turing-complete paper. If you know of a way to formalize this idea, do let me know (I'm more likely to notice tweets @miuaf than comments here).

4 comments:

I think, Turing-Completeness in the vulgar or technical sense implies that not more memory than polynomially in input length is allocated (or used as program space). This is both because of practical constraints (it does not seem feasible) and theoretical considerations (having program size arbitrary, compilers would spend arbitrary time; every concievable function can be encoded as a table [however, the problem is still enumerating that table]; such a table might be quite long; there appears to be nothing to learn about programming this way).