Performance of Forth systems

The benchmarks we used were the ubiquitous Sieve (counting the
primes <16384 a thousand times); bubble-sorting (6000 integers) and
matrix multiplication (200*200 matrices) come from the Stanford
integer benchmarks (originally in Pascal, but available in C (The C
version and Martin Fraeman's original translations to Forth can be
found at
ftp.complang.tuwien.ac.at/pub/forth/stanford-benchmarks.tar.gz)) and
have been translated into Forth by Martin Fraeman and included in the
TILE Forth package. These three benchmarks share one disadvantage:
They have an unusually low amount of calls. To benchmark calling
performance, we computed the 34th Fibonacci number using a recursive
algorithm with exponential run-time complexity. You can find the
benchmarks in Forth, C and forth2c-generated C at http://www.complang.tuwien.ac.at/forth/bench.tar.gz.

Win32Forth, NT Forth and its NCC were benchmarked under Windows NT,
bigForth and iForth under DOS/GO32, all other systems under Linux. The
results for Win32Forth, NT Forth and its NCC were provided by Kenneth
O'Heskin, those for iForth and eForth (with and without peephole
optimization of intermediate code) by Marcel Hendrix. All measurements
were performed on PCs with an Intel 486DX2/66 CPU with 256K secondary
cache with similar memory performance. The times given are the median
of three measurements of the user time (the system time is negligible
anyway).

The table above (and a picture) shows
the time that the systems need for the benchmarks, relative to the
time of f2c opt (or, in other words, the speedup factor that
our translator (and GCC) achieves over the other systems). Empty
entries indicate that I did not succeed in running the benchmark on
the system.

The result of Timbre's Forth-to-C translator is slow, as expected
(since Timbre does not have DO..LOOP and friends, I could only
measure fib, but I think this result is representative). Combining our
translator with a non-optimizing GCC results in code that is even
slower than the interpretive Gforth system. Hand-coded C is between
14% faster and 10% slower than the output of the Forth translator. I
was a little surprised by the matrix multiplication result, where the
code translated to Forth and back was faster than the original. Closer
inspection showed that, in translating from C to Forth, an
optimization had been performed, which the C compiler does not perform
(it is an interprocedural optimization that requires interprocedural
alias analysis) and which reduced the amount of memory accesses; this
is probably responsible for the speedup, maybe combined with vagaries
such as instruction cache alignment.

BigForth and iForth achieve a speedup of about 3 over the fastest
interpreters, but there is still a lot of room for improvement (at
least a factor of 1.3-3). Even on the fib Benchmark, which should be
the strong point of Forth compilers, f2c opt was better. The
results of NT Forth NCC are a bit worse on average and have a big
variance (a speedup of 1-4 over Gforth, 1.2-7.2 times slower than
f2c opt.). These results show that researching better native code
generation techniques is not just a waste of time and that there is
still a lot to be gained in this area. These results also show that
the following statement has not become outdated yet: ``The resulting
machine code is a factor of two or three slower than the equivalent
code written in machine language, due mainly to the use of the stack
rather than registers.'' [Ros86]

The interpretive systems written in assembly language (except
eforth opt) are, surprisingly, slower than Gforth. One
important reason for the disappointing performance of these systems is
probably that they are not written optimally for the 486 (e.g., they
use the lods instruction). eforth opt demonstrates
that peephole optimization of intermediate code offers substantial
gains. eforth opt is 3.5-6.5 times slower than f2c
opt. We can expect even better results if the baseline
interpreter is more efficient (e.g., Gforth).

Gforth is 3.5-6.5 times slower than f2c opt, PFE
7.5-11 times. The slowdown of PFE with respect to Gforth can be
explained with the self-imposed restriction to standard C (although
the measured configuration of PFE uses a GNU C extension: global
register variables), which makes efficient threading impossible (PFE
uses indirect call threading).
ThisForth and TILE were obviously written with a certain
negligence towards efficiency issues and the limited optimization
abilities of state-of-the-art C compilers, resulting in a slowdown
factor of more than 49 for TILE on the Sieve.

I not only measured run-time, but also code size and compile time. For
threaded code (interp. size), I measured the space alloted in the
Gforth system during compilation of the program and subtracted the
alloted data; i.e., the interpreted code size includes the space
needed for headers. Gforth uses one cell (32 bits in the
measured system) per compiled word, two cells for the code field and
it pads the header such that the body is maximally (i.e., 8-byte-)
aligned. For the size of the machine code produced by the
translator/compiler combination (.o size), I used the sum of the text
and data sizes produced by the Unix size command, as applied to
the object (.o) file. This does not include the size of the
symbol table information included in the object file (which is easy to
strip away after linking). The data size in the object file does not
include the alloted space, as that is allocated later at run-time.

The code size measurements dispell another popular myth,
that of the inherent size advantage of stack architecture code and of
the bloat produced by optimizing C compilers. While a comparison of a
header-stripping 16-bit Forth with a RISC (about 50% bigger code
than CISCs) would give a somewhat different result, the reported size
differences of more than an order of magnitude need a different
explanation: differences in the functionality of the software and
different software engineering practices come to mind.

For the compile time measurements, only the user time needed by GCC to
compile and link the program are displayed. The system time was
constant at 0.6s. The compilation to Gforth's interpreted code needed
a negligible amount of time; the translation to C also vanished in the
measurement noise, although it was not written for speed and although
the present implementation should be much slower than normal Forth
compilation. The compile time data indicate that, after a startup time
of about 1.4s (user+system), GCC compiles about 90 lines of Forth
code (1500 lines of translator output) per second. Interestingly, less
than one byte of machine code is generated per line of C code.