A Minimal CISC

Abstract

The minimal CISC architecture presented here is an extremely simple
zero-address architecture suitable for microprogrammed implementation.
It is sufficiently simple to be introduced in one lecture, with time left
over for discussion of implementation or enhancement options.

This writeup is a revision of a paper by the same name published
in ACM Computer Architecture News, 16, 3 (June 1988), pages 56-63.

Introduction

A clear distinction has come to be recognized between two schools of
instruction set design, frequently characterized as RISC, standing for
reduced instruction set computer architecture and CISC, standing
for complex instruction set computer architecture. In fact, the
distinction between these schools emerged long before the names were
coined. For example, this distinction is quite apparent in the
comparison of the Data General Nova (RISC) and the DEC PDP-11 (CISC)
architectures developed in the late 1960s.

The Nova has an instruction set in which most instructions can execute
in a single fixed-length cycle involving an instruction fetch, and one
of either a fetch, a store, or an operation on registers. In contrast,
the PDP-11 has numerous addressing modes; depending on the mode, an
instruction may execute in from 1 to 7 memory cycles. Both machines
were designed in the late 1960s, and except for the RISC-CISC distinction,
they are quite comparable; they competed for similar applications, offering
similar performance at similar prices.

In teaching introductory computer architecture courses, it is frequently
difficult to find simple architectures which illustrate the distinction
between these two schools of design; this is particularly difficult on the
CISC side, where, as the name suggests, complexity is the rule. The
architecture presented here was developed to meet this need.

The minimal CISC instruction set is presented and formally specified in
the next section. An example program for the minimal CISC is presented
in the section after that, and the example is then used to compare the
potential performance of the minimal CISC with the DEC PDP-11. The final
section discusses the effects of modifications to the architecture and its
implementations.

Instruction Set

The minimal CISC instruction set presented here is stack-based and composed
entirely of zero-address instructions. There are only 8 instructions, so
each can be coded as a 3-bit syllable. Assuming a 16 bit word, 5 instruction
syllables can be packed in each word (the least significant syllable is the
first). The instructions are:

(000) NOP:

No operation.

(001) DUP:

Duplicate the stack top. This is the only way to allocate stack space.

(010) ONE:

Shift the stack top left one bit,
shifting one into the least significant bit.

(011) ZERO:

Shift the stack top left one bit,
shifting zero into the least significant bit.

(100) LOAD:

Use the value on the stack top as a memory address; replace it with
the contents of the referenced location.

(101) POP:

Store the value from the top of the stack in the memory location
referenced by the second word on the stack; pop both.

(110) SUB:

Subtract the top value on the stack from the value below it, pop both
and push the result.

(111) JPOS:

If the word below the stack top is positive, jump to the word pointed
to by the stack top. In any case, pop both.

This instruction set is not particularly convenient, but that is not the
point of this exercise. The important thing is that it is very simple
yet it is sufficient to write any program, given enough memory, and given
memory mapped input-output devices.

Using this instruction set, any constant can be pushed on the stack by a
DUP followed by 16 ONE or ZERO instructions.
Zero may be pushed on the stack by the sequence DUP DUP SUB;
negation may be done by subtracting from zero; addition may be done
by subtracting a negated value, and pushing zero prior to pushing an
unconditional branch address allows an unconditional branch. The code in
the example given later illustrates these tricks.

The NOP instruciton is required because the JPOS instruciton
can only address the first instruciton in a word. If the logical destination
does not fall on a word boundary, it must be moved to a word boundary by
padding with NOP instructions.

The following high level description of this architecture is presented using
a Pascal-like pseudocode. This description will be used to support the
evaluation of this architecture, and it will be used to systematically
derive the register-transfer logic and the corresponding microcoded control
unit.

Some effort was made to limite the variety of operations used in the
above pseudocode. Among the operations avoided were subscript expressions
such as m[sp+1] or m[m[sp]], and double decrements such as
sp:=sp-2. These would have allowed a more compact textual
representation, at the expense of making the derivation of the hardware
less obvious.

Programming

As an illustration of programming the minimal CISC, consider the problem
of summing the contents of an array in memory. A Pascal-like
outline of a solution to this problem follows;

The SMAL
assembly language will be used to present the machine code that solves
this problem. In this language, the W assembly directive causes a word
to be assembled into memory, and most of the other details are borrowed
from the MACRO-11 assembler for the PDP-11 (VAX assembly language is very
similar). To simplify the coding, a macro will be defined that assembles
instruction syllables into words:

In the above, note that a<<b shifts a left by
b bits. The CODE macro accumulates one instruction
syllable, while FLUSH should be used before each label to align
the next instruction on a word boundary. To make the code readable,
symbolic constants will be defined for each machine isntruction:

NOP = 0 LOAD = 4
DUP = 1 POP = 5
ONE = 2 SUB = 6
ZERO = 3 JPOS = 7

Although these tools are sufficient for programming, macros for pushing
constants on the stack are needed if examples are to be presented
compactly. The parameters to these macros are passed by value, as indicated
by the leading equals sign on the formal parameter declarations.

The macros in this program expand to a total of 206 instructions. Thus, the
program fills 41 words, plus 3 bits of a 42nd word. One pass through the
loop body involves, 115 instructions and can be executed with 23 memory
cycles for instruction fetch, 8 memory cycles for operand loading and storing,
and 32 memory cycles for stack manipulation. Thus, the run-time for this
program will be dominated by the need to perform 63 memory cycles per
iteration of the loop, where most cycles access the top elements of the
stack.

In contrast, the equivalent PDP-11 program using general registers and
using the SOB (subtract one and branch) instruction for loop control
can be written in 5 instructions occupying 7 words. In this case, the
loop body is 2 instructions long and requires 3 memory references per
iteration.

The above comparison clearly demonstrates that the minimal CISC
architecture is not competitive, but it does not demonstrate that it is
suboptimal! The class of optimal architectures can be thought of as a
surface in a multidimensional computer design space. Taking typical
axes of the space to be processor complexity (measured, for example, as the
number of nand gates required to implement the processor, or the number of
square millimeters of silicon needed under a fixed set of design rules),
the program size for some benchmark, and the memory traffic required to
execute that benchmark, it is clear that the PDP-11 is better than the
minimal CISC on two axes, but it also requires a more complex
processor. The minimal CISC can only be proven to be suboptimal
if a processor can be found that is better when measured along at least one
axis of the design space while being no worse along any other axes.
Of course, serious evaluation of an architecture must rest on a benchmark
that is more comprehensive than that used here!

Implementation

A systematic approach to implementing the minimal CISC architecture described
by the pseudocode given above 2 involves, first, identifying the
register-transfers used in the code, then building a register transfer
machine that can perform these transfers, and finally designing a control
unit to evoke the transfers in the right order.

Each assignment statement in the pseudocode describes a register transfer.
One register can hold the value of each simple variable, and the list of
assignments to each variable determines the functional units that must
process the input to the corresponding register and the registers
from which those functional units take their inputs. This approach was
followed in deriving the register-transfer logic outlined in Figure 1.
Here, connections to the control unit are shown to the left, and connections
to the memory are to the right. A tri-state data bus is assumed, where
the control signal MRE (memory read) causes the contents of the addressed
word to be gated onto the data bus, and a positive transition of MWR (memory
write) causes the contents of the bus to be stored in the addressed word.

Figure 1

It is worth noting that all of the functional boxes in Figure 1 correspond
closely to standard MSI chips. The sp register and its functional
unit can be implemented by a 74LS169A up-down counter, the pc
register can be implemented by a 74LS161 counter, and the ir
register can be implemented by a 74LS298 register. Finally, a general
purpose ALU such as the 74LS381 can perform the operations in the functional
unit feeding acc.

The reader is invited to rewriote the pseudocode for the minimal CISC with
control signals to evoke particular register transfers substituted for
the assignment statements. Consider using the notation
(ACW=1, IRF=0, IRC, SPF=1, SPC), for example, to mean `Hold
ACW and SPF to one, hold IRF to zero, and apply
clock pulses to IRC and SPC; this would evoke the
register transfers ir:=acc and sp:=sp-1 concurrently.

In carrying out this exercise, note that the operation of shifting the
instruction register can be moved from the end of the inner while
loop to the end of each alternative in the case statement. This
allows the shift operation to be done in parallel with the final register
transfer of each alternative.

Before a microprogram can be written, one hardware detail must be determined:
How does the microcode interpreter perform conditional branches? The solution
used here is outlined in Figure 2. The next microinstruction address is
formed by oring the condition selected by the condition select field
of each microinstruction with the least significant bits of the
next address field. Note the use of a two phase clock; all
registers in the data part change on the negative edge of the clock pulse,
while the microprogram state advances on the positive clock edge.
If a single-phase clock were used, where all registers changed simultaneously,
the overall speed of the system could probably be higher, but the
microprogram would be larger because conditional branches in the microcode
could not depend on the results of the current register transfer.

Figure 2

Microcode for the minimal CISC is given in Table 1. The comments indicate
what phase of which machine instruction each microinstruction performs.
Each instruction execution cycle begins with a fetch or a decode
microinstruction, depending on whether the previous instruction was the
last in a word. NOP, DUP, ONE, ZERO and
LOAD can be executed with one additional microinstruction
(assuming that memory latencies equal register latencies, in the latter case).
The other instructions require more cycles. The line commented
JPOS 3-, is used when JPOS detects a negative operand.
The same line is used as the final microinstruciton of POP, since
in both cases, the accumulator must be loaded from the stack top.

As noted, the microcode presented in Table 1 assumes that the memory access
time can be handled in a single microcycle. If the clock rate is fixed,
this means that the microcycle time must match the memory cycle time, even
if many register transfers can be finished faster. This assumption has not
been justified by most viable computer technology. Typically, the microcycle
time for non-memory reference operations can be quite fast, while memory
reference operations are slower. With such a technology, it is common to
overlap microexecution with memory latency by designing the microengine
so it can execute multiple microinstructions while a memory cycle is being
completed.