I'm not sure if this might be overkill--it's a bit lengthy, especially
considering it's "undebugged". Still, I was a bit surprised at the
results, and others might be as well. Also, I haven't seen anyone else
explicitly address the question of the cost of using machine registers for
language registers.

-- begin article --
[In the old compilers archive there were some estimates by David Keppel and
Robert Firth of the cost of doing threaded interpreted code. Since there
was a recent question about optimizing for an interpreted stack machine,
I figured I'd add the following estimates, which are a bit more thorough
on one level. This is a second draft of something I had sketched out for
my own use a few weeks before the question was raised.]

--

This article was intended to be a quick sketch of the (dis)advantages of
trying to add registers to a stack machine. It's basically off the top of
my head; I've only ever implemented one of the kinds of interpreters
described, and it wasn't even optimized. It is quite likely that there
are mistakes, and that I have made some assumptions that will often not be
valid, or that I have chosed some very unrepresentative examples.
Nevertheless, it tries to be thorough, and it rather failed to be a "quick
sketch". (Importantly, it ignores the storage advantage of byte-coding
(although it tries to capture the caching advantage) over direct
threading.)

--

Deciding what optimizations are useful for interpreted languages
depends on how exactly the language is designed, and thus many things that
are critical for performance have little to do with compiling.
Furthermore, most compiler-level optimizations are fighting with the
interpreter overhead. If your interpreter runs at 10% of the speed of
machine code, a strength reduction from multiplication to addition will
not be anywhere near as large a win as it is in machine language. At 50%
speed, many normal optimization techniques can pay off. Most importantly,
any optimizations which decrease the number of instructions executed will
probably be a win--motion of loop-invariant code is probably the most
important one.

In this post, I'll look at five straightforward ways of implementing
an interpreter, and try to estimate the costs of such an implementation.
I do this with very explicit calculations, and mostly explicit
assumptions; nevertheless, it may be possible to tune the code that I'm
implicitly using a bit better, and some of my assumptions may be false.
The best way to decide which method is the best on a given machine is to
write all of them and find out. You probably won't write compilers to
generate code for all of them, so be careful to hand-code examples that
you think an optimizing compiler would be capable of. To a certain
extent, I tend to assume the host machine is a RISC-- you should watch out
for this assumption if it's not valid for you.

The two most familiar machine models are register-based and
stack-based. Most interpreted languages follow one of these machine
models. However, it is important to remember that what can be efficiently
decoded and operated on in parallel in hardware cannot be in software.

For a register-oriented interpreter language, it may be possible to
store some or all of the registers in the host machines registers. This
would obviously appear to be a performance enhancement, if it can be
pulled off.

In a stack machine, operands are normally stored on the stack. The
language FORTH is a famous stack based interpreted language. However, it
is not a particularly good one to compile to. Generally, the compiler
would want to keep common subexpressions on the stack. The instructions
'dup' and 'pick' allow the compiler to bring a common expression to the
top of the stack at the right time, but there is no way to delete the
entry further down the stack. (Actually, the general rotate instruction
provides a way, but in practical implementations will not be anywhere near
as efficient as simply storing the temporary values in memory.) In
general, the elements on the stack get stored to memory as new elements
are pushed on, and so keeping common subexpressions on the stack may be
less effective for a compiler than simply storing them to temporary
locations. One could get some speedup by making those temporary locations
host machine registers, in theory. Once could also attempt to cache some
of the top elements of the stack in machine registers. The problem with
this is that it is infeasible in software to have the size of the cache
change on the fly, and therefore any instruction which changes the size of
the stack still has to read from or write to memory, as well as shifting
things through the stack cache. Thus this will often not be a win, beyond
caching the top of stack.

Given that these are our two basic models, we can try to estimate the
performance advantages to using registers for variable and temporary
variable storage.

We will ignore the cost of execution of particular interpreted language
instructions, and rather only look at the overhead due to various
implementation techniques.

The actual costs of various techniques will depend on the host machine's
characteristics; thus, we will use symbolic timing values to represent the
costs of the various methods. Arbitrarily, I will assume a 32-bit
wordsize, rather than propagating this symbolically. This will allow the
values to be a bit more intuitive.

Let I be the average cost (including cache performance) of fetching an
instruction. D is the cost of a memory reference, assuming good locality
(for example when accessing the interpreter's stack or temporary value
storage). E is the cost of a memory reference to get an interpreted code
instruction; E may be larger than D due to worse cache behavior; it will
probably never be less than D. Let A be the cost of a basic arithmetic
operation. Let B be the cost of a branch to a computed address. Let W be
the cost of writing to a memory location (one in the cache).

First, let's calculate general overhead due to interpretation.
Instruction fetch (in the interpreted machine) may proceed in one of three
ways: 32-bit word read, 8-bit byte read, or 32-bit word read once every 4
instructions, plus extra work to extract each instruction. In each case,
every instruction a pointer must be incremented.

On some machines, the pointer increment may be free; we'll assume a
RISC-like machine and count that as an A.

Note that for pipelined machines, the cost of fetching the instruction
generally disappears into the costs of operation, and so we can set I to
0. For a RISC machine with very poor memory behavior, the last method may
do well; in many other cases, though, the prior two will.

Next, we look at the cost of decoding the instruction. There are two
general ways to do this: either a jump table, or the instruction fetched
above is actually the address to branch to (direct threading). Clearly,
in the latter case, 8 bits will not be sufficient (although 16 may well
be, another case that I don't think I want to try to measure explicitly.).
Unfortunately, the latter case is difficult to code in many high level
languages.

If a jump table is used, clearly the jump table is not going to have 2^32
entries. Normally, a 32-bit instruction has an "operator" field that
would be decoded with a switch, and the registers will be decoded in
another way; to decode for the switch, we'll need to mask out our
"instruction" bits. (We'll assume they don't need shifting, but that
other fields will.)

If data references to memory are all you count, the first method has the
fewest. However, if E is significantly larger than D, then the fourth
method may do better than the first. Interpreters implemented in
higher-level languages may not be able to use the first method, also.
However, the fourth has the largest number of instructions, and will only
be faster than those above it if arithmetic instructions are significantly
faster than memory references. Finally, the second method still has other
information available in the instruction word that has not yet been used,
and this means that in general it may be able to get more work done than
the others. That is, it'll have to execute fewer interpreted
instructions, somewhere between 2-4 times fewer. To give it the benefit
of the doubt, we'll assume 4 times fewer, but we still have to factor in
the further decoding costs.

Now that we've completed instruction fetch and decode, we want to look at
the costs of getting at our operands. We now need to explicitly state
which machine models we're considering:

There's another kind of machine that I'm not going to examine, because
it's hard to generalize, but it may be the "right" method-- a "powerful
register machine". Here, the operands are in "registers", but the kinds
of operations available are much more complex than those found in general
purpose processors--this will help amortize the cost of instruction decode
better.

Anyway, in each case with registers, we can either put the registers in
machine registers, or in memory locations (or some combination, but we'll
ignore this for now.) Let's indicate that registers are in machine
registers by calling this implementation "fast", or "F" in the acronym.
So we have five kinds of machines:

RM, SM, SRM, FRM, FSRM

The RM and the FRM need to get longer instruction values than 8-bit. The
three stack machines can use any of the other three instruction
fetch/decode methods, but that choice will not affect the choice of
implementation method here. Since the full 32-bit instruction was fastest
for instruction fetch and decode, we'll look at the register machines
first.

RM

In our register machine, our "registers" are stored in memory. Thus, to
get at our operands requires memory references. Typically, we need two
reads and a write--two input operands and an output operator. More
complex instructions are possible, but we'll use this as a simple example
for now. We need a shift and a mask to decode an operand, so the operand
access costs are two arithmetic and a memory operation each, or 3I + 2A +
D or W. With three operands, this totals to 9I + 6A + 2D + W. We can now
add this into the cost of our instruction fetch, giving us a total cost of
14I + 8A + B + 3D + E + W.

FRM

Here we have a problem. We have a field in our "instruction", and we need
to access a machine register based on this field. We are probably
obligated to use yet another jump table for each operand (unless the
machine we're on allows indexing into registers). If we stick to two
operand instructions, we can save a little. Decoding each will cost 4I +
2A + B + D, or 8I + 4A + 2B + 2D total. This requires a total cost of 13I
+ 6A + 3B + 3D. Compared to the above, we haven't actually saved any
memory reads--just a write, and we've had to do three more computed
branches. For a fairer comparison, we should allow three operands: this
costs 12I + 6A + 3B + 3D, or a total of 17I + 8A + 4B + 4D.

The advantage that the stack machines have is that they can have one
instruction decode operation tell them everything they need to know about
the instruction. Could we do something like this for registers? If we
only had four registers, for a two operand instruction, there'd be 12 or
16 possible variations, depending on whether we allow the same register
for both. Requiring sixteen copies of all the common instructions will
balloon code size quite a bit--plus, four registers is not really enough
to be useful. Furthermore, this squishes instruction space quite a
lot--an 8-bit instruction word would hardly be sufficient. One
possibility would be to have more registers, but restrict which operations
each can be used for. Of course, this will make compilation extremely
difficult.

SM

With a stack machine, the operand values are found on the stack. Without
any registers, even for local variables, the compiler has to generate
multiple instructions to get values to or from the stack--a memory address
and a pop, for instance. Because of this, we'll ignore this case and just
look at the decidely advantaged register-based stack machines.

SRM

With a register-based stack machine, there are a number of basic stack
operations that read from and write to a register pool. The exact address
is implicit in the instruction; probably the register pool is stored on a
"return stack", so that they do not need to be saved across function
calls. The return stack pointer will be in a machine register, so the
cost of reading or writing an operand is just D or W, plus the cost of
adjusting the stack, which will be a stack pointer move, and a read or
write to memory (regardless of whether the top of stack is cached). Thus,
it costs 3I + D + A + W to push, and 3I + W + A + D to pop--i.e. 3I + D +
A + W regardless.

FSRM

The cost here is like the above, in that the stack has to be updated
still, but the cost of reading or writing memory is removed. Thus, the
cost here is 2I + W + A for a push, and 2I + D + A for a pop.

Now, we're going to look at two sample bits of code and see what the cost
of executing them is on each of the above machines. Admittedly, my choice
of code is probably a bit weighted against the register machines, but I'd
like to keep the examples interesting. (For example, I don't look at the
cost of accessing a constant. The register machine may have already done
all the work of loading the constant, whereas the stack machine still has
to fetch it from memory.)

First we want code to do this:

e = (a + b) * (c + d);

on each machine. We can ignore the underlying implementation while
writing the code. Assume all are local variables, and are stored in
registers if possible, and as many temp registers as needed are available,
and assume c is not used later.

3-op register machine:
ADD F, A, B
ADD C, C, D
MULT E, F, C

2-op register machine:
LD E, A
ADD E, E, B
ADD C, C, D
MULT E, E, C

Stack machine:
PUSH A
PUSH B
ADD
PUSH C
PUSH D
MULT
ADD
POP E

We are going to ignore the costs of the addition and multiplies
themselves, which will be the same regardless of the implementation
(although they would be important if we wanted to estimate precise
relative speeds).

Surprisingly, if memory writes are reasonably fast, it may be more
efficient (or at least as efficient) to store registers in memory. Since
we can have larger register pools in memory, and they may not need to be
saved across procedure calls (using software register windows, basically),
the "fast" register machines may be slower than the other.

For the stack machine, we have 4 operand fetches, one operand store, and
three "operators." Remember our original values for the cost of
instruction fetch&decode were scaled up by a factor of 4; conveniently,
the above example is exactly 8 instructions. The cost of instruction
fetch&decode will be:

Each of our three operators gets two values, creating a third. They need
to read their values off the stack, write to the stack, and shorten the
stack by one. If the top of the stack is cached in a register, this means
one memory reference for each operation, plus the instruction computing it
can explicitly read from and store to the top of stack register; however,
each operand fetch and store needs an extra instruction to transfer
properly through the top of stack cache. Thus, with a cache, each
operator costs (again, ignoring the cost of actually computing it) I + D,
for a total of 3I + 3D, plus the extra five instructions for manipulating
the cache, giving 8I + 3D. (For comparison, without the cache, each
operator must read two values and write one, costing 3I + 2D + 1W
(assuming a RISC-like machine, though), or, for all three, 9I + 6D + 3W;
this is why we use the cache! On the other hand, if we want to throw away
the top of stack unseen, the cache hurts.) We still have to figure the
costs of accessing our operands. We have four reads, and one write. For
the SRM, each is 3I + D + A + W. For the FSRM, the reads are 2I + W + A,
and the write is 2I + D + A. Then the total cost for operands for the SRM
is 15I + 5D + 5A + 5W, while the FSRM costs 10I + 5A + 4W + D. Adding in
the operator costs:

SRM: 23I + 8D + 5A + 5W
FSRM: 18I + 4D + 5A + 4W

Finally, we need to add the instruction fetch&decode back on. Let 1 be
the "32-bit explicitly addressed", 2 be the "8-bit byte read, jump table",
and 3 be the "32-bit read, 4 8-bit instructions" (plus, for comparison we
copy the register machine costs below them):

That's a lot of numbers, so it's a bit hard to get a feel for. If you
only look at the 'I' coefficient, that's the number of instructions. If
we assume a RISCish machine, and let I be one cycle, then A is 0, B
depends on the pipeline depth, but it should be possible to fill the delay
slot with the increment pc operation; we'll assume our cache is very good,
and D disappears into I, W disappears into I, and most of E does, except
perhaps .1 or .2 cycles. Leaving E and B symbolic, the above becomes:

Unless E is at least 3 + 3B, the 3-op RM is faster than the other two
FRMs. For this RISCish machine, the stack fetch/decode method of 3 is
always worse than 2, and method 1 will be the best as long as 8E < 2E + 8,
or E < 1.3. While this might not apply on some machines with very slow
memories, generally this will be true. However, the implementation method
may not be available. Thus, we can narrow our comparison down to just
FSRM 1, FSRM 2, and 3-op RM, where, again, we remind you that FSRM 1 is
the stack machine with registers in registers, with explicit branch
addresses fetched from the instruction stream, while FSRM 2 is byte coded,
branching through a jump table; and the 3-op RM stores its registers in
memory. The symbolic costs for these three, again, are:

Comparing the first and the last, we see they both read 12 items from
memory, but the RM reads more values from better-cached addresses. The RM
writes one fewer value, and makes fewer branches. We can see why this is
so easily. FSRM 1 has much higher bandwidth requirements to read its
instruction stream, since it reads explicit branch addresses. It writes
one extra value because it has to write its operands onto the stack. The
register machine doesn't branch to get at its operands; instead it does
extra decoding and fetches them directly from memory; hence the extra
arithmetic costs. FSRM 2 has many more data accesses because it uses an
explicit read to get each byte-coded instruction; with 8 instructions, and
the stack needing touching each time, we would expect ~16 data accesses.
In FSRM 1, this is *exactly* how many we have. FSRM 2 has eight more;
this is from jump-table lookup overhead. Clearly noticeable on all three
is the number of words of instructions read; this is the E coefficient.

The second sample bit of code we're not even going to spell out. It is a
procedure call. The only point of mentioning it is that registers stored
in machine registers will need saving. This will not be executed by
writing an explicit save instruction for each register, but instead with
some sort of register save mask, to reduce instruction overhead costs. We
might just save a whole block of registers, although this is difficult
from a higher level language.

The cost of this for an interpreter implementation with registers in
registers is one or zero extra interpreter instructions, plus the memory
cost of saving the instructions. If m out of n registers are live, then
the costs of saving will be whichever is cheaper: n(I + W), or n(I) + m(I
+ W), where the actual cost of decoding which registers to save has
mysteriously disappeared.

If people are writing extremely modular code, or object-oriented code with
a lot of short methods (like accessing methods), then there will be a
major advantage to not keeping registers in machine registers (given that
there wasn't a humongous win from this to begin with). As such, we simply
throw out all FRM and FSRM, leaving us only SRM and RM. RM is a little
faster under the RISCish assumptions; the general symbolic costs for the
first example again are:

Another possible machine design that I did not consider so far is a stack
machine with the top of stack cached, plus a special accumulator register.
We can directly push to the accumulator, doubling our push instruction
space, plus we double the operators to take as input, instead of the top
two stack items, the top of stack and the accumulator. The The advantage
here is that when we compute values in our expression tree where either
operand is a leaf, we don't generate extra memory traffic pushing down our
stack. In a tree with only binary operators, this will always be at least
half the operations. A slightly more powerful model would allow the
computed value to go either onto the top of stack or the register, either
tripling or quadrupling our operator code size; computations left in the
accumulator would not automatically disturb the top of stack, and we'd
have instructions to push into the top of stack destructively. This would
even allow computation of our sample hunk of code with almost no stack
traffic. If we had a value on top that we knew we could destroy, we could
do

This makes FSARM 1 somewhat advantageous, unless the branches hurt a lot,
and brings SARM 1 into competition with the 3-op RM if we assume procedure
calls are frequent. However, it may be difficult to compile very well for
this machine. You should look at how deep your expression trees get in
your intermediate code to decide if this method is worthwhile.
Furthermore, for the full example above, there are quite a few versions of
each operator, roughly [using '+' as a typical operator]:
TOS = TOS + acc
TOS = TOS + pop()
acc = TOS + acc
acc = pop() + acc
[I'm abusing the meaning of pop()--the first time, it means the
second item on the stack; the second time, it means the top of stack.]

In general, designing a stack-based instruction set to allow the compiler
to explicitly schedule top-of-stack overwrites and implcit discards of
top-of-stack may save some overhead, while taking up more instruction
space--requiring extra instructions will likely not be a win.

Of course, the most important thing to do to optimize your interpreted
language design is to reduce your instruction fetch and decode and operand
access overhead. If certain instruction combinations are extremely
common, it may pay to make single instructions to perform them. Some
examples might, in theory, be: 3 operand add, (i.e. a = a + b + c), memory
indirection (a = **b, or in FORTH [ @ @ ]), etc. Procedure/function calls
should be big complex instructions in most cases, because of the
interpreter overhead. Better still is shift the scale at which your
primitives work.

If you go with a stack machine, the next step is to determine whether it's
more efficient to keep temporaries on the stack or in "registers".
One-time temporaries should go on the stack (or in the accumulator in the
SARM machine above), since the implicit access to temporaries is the main
instruction bandwidth savings for the stack machine. But you need to
measure the costs of pushing the value on the stack and bringing it back
up later to the costs of storing to a register and pushing it later. This
will depend on the specific implementation. (On a FSARM, this could even
end up merely transferring from a machine register representing an
interpreter register to the FSARM's accumulator register, another machine
register.) Getting rid of the extra temporary values down on the stack
may just require leaving them there until they happen to rise up to the
top--in which case it's critical not to leave them in between to values
you want to operate on directly.