Compilation into a virtual machine, and interpret the virtual machine code.

If you are interested in performance, the virtual machine approach is
the way to go (because fetching and decoding is simpler, and therefore
faster). If you are not interested in performance (yet), you still may
want to consider the virtual machine approach, because it often is
as simple as the others.

Threaded code, in its original meaning [bell73], is one of the techniques for implementing
virtual machine interpreters. Nowadays, at least the Forth community
uses the term threading for almost any technique used for
implementing Forth's virtual machine.

Let's look at a piece of straight-line code consisting of the virtual
machine instructions A, B and C. We could write machine-level
subroutines Ar, Br and Cr for performing the action of the virtual
machine instructions. Then we could write the following machine code
to execute these virtual machine instructions:

call Ar
call Br
call Cr

This is known as subroutine-threaded code, although it is not
threaded code in the original sense; in fact, subroutine threading is
not even an interpretive technique.

Now, let's eliminate the calls:

Ar
Br
Cr

i.e., a sequence of code addresses that represent the virtual
machine instructions.

As a consequence, we cannot execute this piece of code by jumping
to its start. We also have to keep track of the pointer to the current
instruction in a separate register (instead of using the processor's
program counter register and return address stack/register).

How do we execute the next instruction? Let's assume that the
instruction pointer register (ip) always points to the word
in the code sequence that directly follows the current instruction
word. Then we just have to load the word, jump to the routine pointed
to by it, and increment the instruction pointer. E.g., in MIPS
assembly language:

This routine is the interpreter (aka inner interpreter in the
Forth community). It is also known as NEXT routine. Either every
virtual machine instruction routine has a copy of NEXT at the end, or
they share one copy of NEXT and jump to it. With modern processors,
the shared NEXT not only costs a jump, but also dramatically increases
the misprediction rate of the indirect jump on CPUs with BTBs and
similar indirect branch predictors [ertl&gregg03jilp]. Therefore, I recommend not to
share NEXT.

The method described above is the one described in [bell73] as threaded code and has later
been called direct threaded code.

Indirect Threaded Code

Let's consider constants: They can be represented by the virtual
machine instruction lit (for literal), followed
by the value of the constant. If the constant occurs frequently, it is
more space-efficient to have a virtual machine instruction for this
constant. If we have several constants, the code for their virtual
machine instructions will look very similar. So, we may want to share
the code, while having separate data.

We can do this by adding a level of indirection in NEXT (resulting
in indirect threaded code. Each word (a generalization of the
virtual machine instruction) has a code field and a parameter
field. E.g., in the case of the constant:

The code for the next instruction can compute the parameter-field
address from the code-field address in $2.

Forth and Direct Threading

Traditionally Forth is implemented using indirect
threading. Therefore, direct threaded Forth implementations have much
in common with indirect threaded implementations: Non-primitives have
a code field, but it now contains a jump to the code instead of its
address. On most processors this jump costs more time than the
additional load of indirect threading, so direct threading only pays
off when primitives are executed. The resulting speedup is 2%-8% on
the 486.

(Note: On the Pentium, the K5 and the K6(-2) it is extremely expensive to mix code and data, so
this form of direct threading is much slower on this processor than
indirect threading.)

Token Threaded Code

Direct threaded code cannot be simply transported from one machine to
another, because it contains code addresses, which vary from machine
to machine. Token threaded code uses a fixed virtual machine
instruction encoding, allowing code portability at the price of a
table lookup (that maps the intruction token to its code address) in
each NEXT. Indirect threading and token threading are orthogonal, so
they can be combined into indirect token threading (with an even
slower NEXT).

Other Terms

Switch Threading

A method used for implementing virtual machine
interpreters in languages like C. See below.

A technique used for the 8086
architecture. The code consists of a sequence of segments instead of a
sequence of code addresses. This allows using the whole address space
of the 8086 for threaded code with sixteen-bit "pointers", but
requires 16-byte alignment for all words.

Native Code

This is not a threading technique. This is the
term used in the Forth community and elsewhere for implementations
that generate machine code instead of some interpreted code. Simple
native code systems start with subroutine-threaded code, then inline and peephole-optimize
some subroutines. Related terms are true compiler,
just-in-time compiler etc.

Byte code

Each virtual machine instruction is represented by
one byte. This can be seen as a variant of token threading.

Many languages don't offer a way to produce indirect jumps, so they
cannot be used for implementing direct or indirect threaded code. This
section presents the available options. You can find more on this
topic in [ertl95pldi] and [ertl93].

Here I present variants corresponding to direct threading. Add an
indirection in NEXT to get the indirect threaded version.

This is one of GNU C's extensions to standard C, and it is currently
the most portable method for implementing threaded code. A similar
feature is FORTRAN's assigned goto (not the computed
goto, which is more similar to C's switch
statement).

In GNU C, a direct threaded NEXT looks like this:

typedef void *Inst;
Inst *ip; /* you should use a local variable for this */
#define NEXT goto **ip++

Continuation-passing style

In continuation-passing
style, all calls are tail-calls, and can be tail-call-optimized
into jumps. Thus, we could implement the indirect jumps of threaded
code as indirect tail-calls by writing the interpreter in
continuation-passing style and using a language or compiler that
guarantees tail-call optimization (Scheme? ML?). In C (which, AFAIK
has no compilers that guarantee tail-call optimization) direct
threading with this technique would look like this:

Switch threading (aka switch dispatch) uses the C switch
statement (or similar statements in other languages, e.g., Pascal's
CASE). The result has the same advantages as token
threaded code; the disadvantage over token threaded code is the lower
speed due to the range check generated by most (all?) C compilers for
the switch; moreover, with the right encoding token threaded code can
avoid scaling, while the C compilers I have seen do not avoid it.

The main performance drawback of switch threading on modern CPUs is
is that it uses one shared indirect branch; this leads to close to
100% mispredictions in the BTB (branch target buffer) used for
indirect branch prediction on many modern CPUs (e.g., Athlon 64,
Pentium 4, PPC 970), whereas threaded code with separate NEXTs has
only about 50% mispredictions [ertl&gregg03jilp].

Call threading uses indirect calls (which most programming languages
have) instead of indirect jumps. For every call, there must also be a
return (except for optimized tail-calls, which don't occur here). So
the cost of this method is that of using a call-return sequence
instead of a simple jump.

Moreover, each virtual machine instruction is represented by a
function (procedure). This is very elegant to look at, and allows
separate compilation, but it means that global variables have to be
used for the virtual machine registers (e.g., the instruction pointer
ip), which most (all?) compilers allocate into memory; in
contrast, with switch threading you can use local variables, which are
(hopefully) allocated into registers.

Note: Call threading is a term I have used for this
technique. There seems to be no established term yet. I have seen
subroutine threading used for this technique, but that term
has a well-established and different meaning (see above).

History

Charles Moore invented (indirect) threaded Code in 1970, while working
at the National Radio Astronomy Observatory (NRAO), according to [moore87]. The first publication is [bell73], which was first received by CACM in June
1971; this paper does not contain any historic information.

Interestingly, the system described in [moore&leach70] (before Moore worked at NRAO)
does not use threaded code, but string interpretation. However, it
uses a code field, i.e., it has the seed for indirect threading; Then
again, code fields were folklore at that time [kay96].

@Article{bell73,
author = "James R. Bell",
title = "Threaded Code",
journal = cacm,
year = 1973,
volume = 16,
number = 6,
pages = "370--372"
}
@Article{dewar75,
author = {Robert B.K. Dewar},
title = {Indirect Threaded Code},
journal = cacm,
year = {1975},
volume = {18},
number = {6},
month = jun,
pages = {330--331}
}
@string{ieeecomputer="Computer"}
@Article{kogge82,
author = "Peter M. Kogge",
title = "An Architectural Trail to Threaded-Code Systems",
journal = ieeecomputer,
year = "1982",
pages = "22--32",
month = mar,
annote = "Explains the design of (a classical
implementation of) Forth, starting with threaded
code, then adding the parameter stack, constants,
variables, control structures, dictionary, outer
interpreter and compiler."
}
@InProceedings{ertl93,
author = "M. Anton Ertl",
title = "A Portable {Forth} Engine",
booktitle = "EuroFORTH '93 conference proceedings",
year = "1993",
address = "Mari\'ansk\'e L\'azn\`e (Marienbad)",
url = "http://www.complang.tuwien.ac.at/papers/ertl93.ps.Z",
abstract = "The Forth engine discussed in this paper is written
in GNU C, which provides several extensions that are
important for Forth implementation. The indirect
threaded Forth engine is completely
machine-independent, direct threading requires a few
machine-specific lines for each machine. Using
a portable language like GNU C encourages producing
an engine with many primitives. In order to make the
development of primitives easier and less
error-prone, an automatic tool generates most of the
code for a Forth primitive from the stack effect
notation, even if the TOS is kept in a register. The
engine is combined with the parts of the system
written in Forth by loading a machine-independent
image file that contains the executable Forth code
in relocatable form."
}
@InProceedings{ertl95pldi,
author = "M. Anton Ertl",
title = "Stack Caching for Interpreters",
crossref = "sigplan95",
pages = "315--327",
url = "http://www.complang.tuwien.ac.at/papers/ertl95pldi.ps.gz",
abstract = "An interpreter can spend a significant part of its
execution time on arguments of virtual machine
instructions. This paper explores two methods to
reduce this overhead for virtual stack machines by
caching top-of-stack values in (real machine)
registers. The {\em dynamic method} is based on
having, for every possible state of the cache, one
specialized version of the whole interpreter; the
execution of an instruction usually changes the
state of the cache and the next instruction is
executed in the version corresponding to the new
state. In the {\em static method} a state machine
that keeps track of the cache state is added to the
compiler. Common instructions exist in specialized
versions for several states, but it is not necessary
to have a version of every instruction for every
cache state. Stack manipulation instructions are
optimized away."
}
@Proceedings{sigplan95,
booktitle = "SIGPLAN '95 Conference on Programming Language
Design and Implementation",
title = "SIGPLAN '95 Conference on Programming Language
Design and Implementation",
year = "1995",
key = "SIGPLAN '95"
}
@Article{ertl&gregg03jilp,
author = {M. Anton Ertl and David Gregg},
title = {The Structure and Performance of \emph{Efficient}
Interpreters},
journal = {The Journal of Instruction-Level Parallelism},
year = {2003},
volume = {5},
month = nov,
url = {http://www.complang.tuwien.ac.at/papers/ertl%26gregg03jilp.ps.gz},
url2 = {http://www.jilp.org/vol5/v5paper12.pdf},
note = {http://www.jilp.org/vol5/},
abstract = {Interpreters designed for high general-purpose
performance typically perform a large number of
indirect branches (3.2\%--13\% of all executed
instructions in our benchmarks). These branches
consume more than half of the run-time in a number
of configurations we simulated. We evaluate how
accurate various existing and proposed branch
prediction schemes are on a number of interpreters,
how the mispredictions affect the performance of the
interpreters and how two different interpreter
implementation techniques perform with various
branch predictors. We also suggest various ways in
which hardware designers, C compiler writers, and
interpreter writers can improve the performance of
interpreters.}
}
@TechReport{moore&leach70,
author = "Charles H. Moore and Geoffrey C. Leach",
title = "FORTH -- A Language for Interactive Computing",
institution = "Mohasco Industries, Inc.",
year = "1970",
address = "Amsterdam, NY",
url = "http://www.dnai.com/~jfox/F70POST.ZIP"
}
@Article{moore87,
author = {Charles Moore},
title = {Forth -- eine persönliche Sprache},
journal = {Vierte Dimension},
year = {1987},
volume = {3},
number = {3},
month = oct,
pages = {11--13},
note = {Translated into German by Klaus Schleisiek, the
original is probably in More on NC4000},
}
@InProceedings{kay96,
author = "Alan C. Kay",
title = "The Early History of Smalltalk",
crossref = "hopl2",
pages = "511--579",
}
@Proceedings{hopl2,
title = {History of Programming Languages},
booktitle = {History of Programming Languages},
year = 1996,
key = {HOPL-II},
publisher = {ACM Press/Addison-Wesley}
}