jbuck@Synopsys.COM (Joe Buck) writes:>[PowerPC code is on the order of 1.5 times larger than 68k code.]

This discussion (Pittman's IEEE Micro paper, ``The RISC Penalty'')
doesn't have anything to do with 68k. I evidently wasn't clear the
first time I wrote about it, let me try again:

Pittman had an application, ``foo'', to which he applied
code-expanding transformations (``optimizations''). Just based on an
instruction cycle count (adds take one cycle, fp ops take 3 cycles,
etc.), the code-expanding transformations should have made ``foo''
three times faster. Instead, the transformations made ``foo'' slower,
due to the space costs of the transformations.

As I read the paper, Pittman stared at this for a while and said ``the
obvious problem is that less-dense instructions use up more
instruction cache space and memory fetch bandwidth. Since processors
are getting faster and faster relative to memory, it seems that this
problem is going to get worse before it gets better, and spending more
time decoding the instructions might slow the processor but improve
overall performance.''

On a related note, ARM (one of the leaders in the embedded processor
market) recently created an instruction set extension they call
``thumb''. Instead of using fixed 32-bit instructions it uses fixed
16-bit instructions, with a 16/32 mode. The basic implementation is
that they have an on-chip decoder that turns 16-bit instructions into
32-bit instructions and then sends them in their ``uncompacted'' form
to the instruction decoder. There's a lot of compromises in the
16-bit mode: all operations are two-operand, destroying one of the
arguments and thus necessitating extra copying; it gives access to
only half of the register set (16 registers); it has a very limited
set of operations compared to the 32-bit instruction set. However,
despite these compromises, some codes run FASTER using 16-bit
instructions. With 16-bit instructions, the instruction counts for a
given program are higher, but the space cost is lower: each cache miss
fetches twice as many instructions; the cache holds twice as many
instructions, and so on [Turley 94].

In a similar vein, somebody who used to work for a company making VLIW
machines told me that they had a `nop N' instruction that said ``the
next N instructions are nops, so don't even bother fetching them.''
Programs ran faster than using explicit nops, due to the space cost.

>[All we can conclude is that code-expanding transformations are> sometimes a poor idea.]

That's one conclusion, and it's true if you regard the instruction set
and cache size as fixed. However, if you're a processor designer,
you'd like to produce an architecture that provides high performance.
One mechanism is to enable lots of code-expanding transformations,
since specialized code is often one of the keys to better performance.
How do you keep from falling of the ``cliff'' when you overflow the L1
cache due to the compiler transformations? Given a choice between
expanding a 32KB cache to 64KB (4 transistors/bit * 8 bits/byte * 32KB
= 1M transistors), a processor architect might think about spending
that real estate for providing a denser instruction encoding --
compacting the instructions also has the effect of enlarging the
cache, bus widths, the L2 cache, etc.

At the end of Pittman's article, he suggests a stack architecture with
5-bit instructions. Combining ARM's ``thumb'' idea, CRISP's stack
ISA, register implementation architecture [Ditzel et. al 87] and the
WISC (``writable instruction set'') idea [Koopman 87], you can imagine
an instruction decoder that's tailored to the application (and even
updated on-the-fly) in order to enable very dense instruction set
encoding in order to, in turn, enable lots of code-expanding
transformations (``optimizations'') without completely losing all the
potential advantages in a cloud of icache misses.

To summarize, this doesn't really have anything to do with ``RISC''
vs. ``CISC''; it has everything to do with running out of instruction
fetch bandwidth and trying to make the most performance-oriented
decisions in order to keep boosting real-world performance. Indeed,
this seems like an ideal time to repeat the idea of ``proceduraling''
optimizer, which increases the instruction count but decreases the
code size, by finding common sequences and turning them into shared
``procedures'' [Fraser et. al 84].

Lest I sound adamant about all of this, let me say that I don't know
whether denser instruction sets are the way to go. I do believe it's
probably a good idea to go stare at for a while.