ed_davis2 asked:> I have developed a simple stack based virtual machine. I would like> it to be as efficient as possible, in terms of both speed and object> code size. Size, as in size of the object code the VM processes> (byte-code), and not necessarily the size of the VM.

You'll always have to make some tradeoffs between size and space.
However, you probably can do quite well, if you note that a lot of
operands are repeated. So, are a lot of opcodes and code sequences,
but we'll concentrate on the operands first.

For example, there are probably on the order of 255 common operands in
a code sequence, so use something akin to sliding window data
compression to capture that fact. Here is one possibility:

struct operation {
char opcode;
char operand;};

Use the operand values 0-254 to refer to the most recent 255 operands
(as appearing in the code. Use operand value 255 to indicate a "new"
operand being takens from a list of word aligned locations. When you
fetch a new operand, retire the oldest operand from the most-recently
used list. You also increment the pointer into the operand list, so
that the next "new" operand fetches a different element from the list.

There are lots of heuristics you can build into the 255 most recent
operand memory. You can think of it as a cross between a register
file and a cache.

For example, you can preload the cache with the operand list. This
saves the 1st 255 uses of the "new" operand. You can also load/store
the cache on a call/return by call return basis. Note, you probably
don't want to actually copy the cache on the call/return, but instead
keep a cac he pointer. This is a place where profiling will
pay-off. Measure a couple of different strategies and keep measuring
as one tunes.

Next, since most routines do not have 255 local variables (unless
you've done extensive inlining), you can consider making the cache
smaller, and using some of the operand values to represent function
parameters (carving out 16 or even 32 function parameters out of the
cache won't significant impact the cache, and will handle almost any
calls).

You can even compress more, say your profiling suggests that you can
get away with 24 cache entries, 7 paramete locations, and 1 "new"
operand, that fits in 5 bits. Then, if can find 7 common opcodes (3
more bits(, you can pack the entire opcode and operand into 1 byte.

You can also compress the operation spcae the same way by reserving,
some number of opcodes as "compresses" opcodes, that do a table lookup
into an instruction cache. If the instrcution cache can hold
sequences of instructions, you are close to Ziv-Lempel eocnding.

Back to operand encoding, it is worth noting that the entries in the
"address" mrs et. al. don't have to be just simple pointers. You can
re-invent the concepts of indirect and indexed addressing (and even
auto-increments if it suits you) in your virtual machine. RISC
hardware architectures tended to drop these concepts. However, the
tradeoffs in a hardware implementation are different than a software
one. Again, measure and tune.

Of course, the farther you push this. the farther you get from a
simple VM.

Going the ooposite direction. One can borrow a concept that hardware
designers have used, aligning instructions with no-ops. If you want
your essential instructions to be half-word (or byte) aligned, but you
want your insructions with 32 bit operands to have the operands
algined on 32 bit boundaries. First, determine what opcode
length/position would put the operand on a 32 bit boundary. For
example, if you decide to have byte boundary instruction with a 1 byte
opcode field, putting an instruction on the 3rd byte should then align
the following opcode field. If you wnat to generate such an
instruction and you aren't on the appropriate boundary, you can either
insert no-op instructions to get to ehte deisred alginment, or pull
instructions from further in the instruction list to try to fill the
space. Note, that doing so, is potentially difficult, and whole
phases are written to "schedule" instructions to try to minimize
various penalties. Now many of those penlties don't apply, since your
VM is likely to execute each instruction in serial, rather than doing
multi-issue instructions that hardware can do.

However, as more hardare gets to be multi-issue and out-of-order, the
things which are cheap and parallel in hardware become more and more
applicable to software. Not long ago, I would not have recommended
using an 8 bit opcode and a 24 bit address in a VM, because the mask
instruction to get a 32 bit pointer out of the 24 bit address, might
have been expensive. These days I would recommend measuring the
effect before making the decision. I suspect these days on any P 4 or
better class machine, the mask is free. Of course, on the same class
of machine I bet the mis-alignment penalty is nil also. The real
measure is the amount of work one needs to do, especially in terms of
memory traffic and most importantly non-cache memory traffic.