elan@cheme.cornell.edu (Elan Feingold) writes:>[I want to write a fast emulator for the Z80. Dynamically compiling> to 8086 code would limit portability. Instead, I'd like to use an> IR that's faster than pure emulation but more portable than machine> code generation.]

Prepare to be astonished: most fast instruction set simulators use either
translation to host machine code or translate+interpret. Few use direct
decode-and-dispatch interpretation. Blatant plug: see the (extensive)
related work section in our upcoming SIGMETRICS '94 paper on Shade [CK94].

Here's the intuition: instruction set simulation (interpretation in
general) is composed of several phases: decode, dispatch, and simulation.
Suppose you're simulating an instruction at address 0x53.
Decode-and-dispatch interpreters execute full instruction decode and
dispatch each time the instruction is executed. In contrast, systems that
execute using translate+interpret can cache the translation for reuse.
Thus, the first time the instruction at 0x53 gets executed, there's full
overhead for decode and some dispatch to create the translation, some
added overhead to store the translation, and finally some more overhead to
finish the dispatch and simulate the instruction. The next time 0x53 is
executed, however, the simulator can use a simple dispatch to the
translation and use that to simulate the instruction without incuring the
overheads for decode and general dispatch.

For example, Robert Bedicheck's `g88' 88000 emulator [Bedichek89]
translates to threaded code and simulates user/kernel mode execution,
address translation, and a bunch of other stuff, yet it runs real
applications (including the operating system) at around 30 host
instructions per simulated target instructions. A simpler simulation
(e.g. no simulation of address translation) should further speed things by
quite a bit.

30I/I is still not as good as compilation to native host code (and the
discrepancy between fast interpretation and native compilation is greater
for trace generators like Shade), but 30I/I is better than
decode-and-dispatch interpretation. `g88' is written in C plus
(originally) a postpass fixup of the assembly code or (lately) using a GNU
CC C extension. The closest figures I have for a comparable
decode-and-dispatch interpreter are about 40 instructions for a RISC
machine simulator written in assembler (ugh!) and it doesn't ``do the
details'' like `g88'.

I expect the decoding overhead would be greater on many CISCs, so the
relative advantage of translate+interpret would be even better.

Fast interpreters _do_ generally make time/space tradeoffs and thus
consume more memory than decode-and-dispatch interpreters. `g88', for
example, uses 5X the space for code: 1X for the original copy plus 4X for
the threaded code (although some `g88' derivatives cache just the
most-recently-used translations). Decode-and-dispatch interpreters may
also handle application RTCG more simply because they don't cache and
instead retranslate on every invocation; systems that store translations
must worry about translation consistancy [CK93] (if these techniques
aren't good enough for you, write me; there's more).

In short, if you can afford some code space explosion, the techniques used
by `g88' are really worth the slight added effort to separate decode and
simulation. Actually, they may eventually make the simulator *simpler* --
real machine instructions use dense encodings, so integrated decode and
dispatch simulation have quite hairy code. If you're thinking about going
the fast simulator route, the `g88' paper [Bedicheck89] is as much of a
`must read' as is Deutsch and Shiffman's paper is for dynamic translation
[DS84]. `g88' is also publically available under the GNU public license,
write to `robertb@cs.washington.edu' for details.

%L [CK94]
%A Robert F. Cmelik
%A David Keppel
%T Shade: A Fast Instruction-Set Simulator for Execution Profiling
%J SIGMETRICS '94 (to appear)
%D 1994
%X a substantially trimmed version of the TR, plus an extended related
work section. The TR is Sun Microsystems Laboratory TR 93-12 and
University of Washington CS&E TR 93-06-06; the latter is available via
anonymous ftp from `cs.washington.edu' (128.95.1.4) in
`tr/1993/06/UW-CSE-1993-06-06.PS.Z'

%A Peter Deutsch
%A Alan M. Schiffman
%T Efficient Implementation of the Smalltalk-80 System
%J 11th Annual Symposium on Principles of Programming Languages
(POPL-11)
%D January 1984
%P 297-302
[re astonishment, I knew that current emulators generally translate to chunks
of machine code, and have done so at least as long ago as Cathy May's 370
emulator for the ROMP and Peter Woo's ROMP emulator for the 370. What would
astonish me would be if you could still get usefully fast performance compiling
to something other than machine code. -John]
--