5.3 ARCHITECTURE OF THE RTX 32P

The Harris Semiconductor RTX 32P is a 32-bit member of the Real Time
Express (RTX) processor family. The RTX 32P is a prototype machine that is the
basis of Harris' commercial 32-bit stack machine design.

The RTX 32P is a CMOS chip implementation of the WISC Technologies CPU/32
(Koopman 1987c) which was originally built using discrete TTL components. The
CPU/32 was in turn

developed from the WISC CPU/16 described in Chapter 4. Because of this
history, the RTX 32P is a microcoded machine, with on-chip microcode RAM and
on-chip stacks.

The RTX 32P is a 2-chip stack processor designed primarily for maximum
flexibility as an architectural evaluation platform. It contains very large
data and return stacks on-chip, as well as a large amount of on-chip microcode
memory. This large amount of high speed RAM forced the design to use two
chips, but this was consistent with the goal of producing a research and
development vehicle. Real time control is the primary application area for the
RTX 32P.

The primary language for programming the RTX 32P is Forth. However, the
RTX 32P's commercial successor will be enhanced for excellent support of more
conventional languages such as C, Ada, Pascal; special purpose languages such as
LISP and Prolog; and functional programming languages.

An important design philosophy of the RTX 32P is that as processor speeds
increase, an ALU can be cycled twice for every off-chip memory access.
Therefore the RTX 32P executes two microinstructions for each main memory
access, including instruction fetches. Every instruction is two or more clock
cycles in length, with a different microinstruction executed on each clock
cycle. The reasons for adopting this strategy are discussed at greater length
in Section 9.4.

The Data Stack and Return Stack are implemented as identical hardware
stacks consisting of an 9-bit up/down counter (the Stack Pointer) feeding an
address to a 512 element by 32 bit wide memory. The stack pointers are
readable and writable by the system to provide an efficient way of accessing
deeply buried stack elements.

The ALU section includes a standard multifunction ALU with a DHI register
for holding intermediate results. By convention, the DHI register acts as a
buffer for the top stack element. This means that the Data Stack Pointer
actually addresses the element perceived by the programmer to be the
second-from-top stack element. The result is that an operation on the top two
stack elements, such as addition, can be performed in a single cycle, with the
B side of the ALU reading the second stack element from the Data Stack and the
A side of the ALU reading the top stack element from the Data Hi register.

The Data Latch on the B side of the ALU input is a normally transparent
latch that can be used to retain data for one clock cycle. This speeds up swap
operations between the DHI register and the Data Stack.

There are no condition codes visible to machine language programs. Add
with carry and other multiple precision operations are supported by microcoded
instructions that push the carry flag onto the data stack as a logical value (0
for carry clear, -1 for carry set).

The DLO register acts as a temporary holding register for intermediate
results within a single instruction. Both the DHI and DLO registers are shift
registers, connected to allow 64-bit shifting for multiplication and division.

An off-chip Host Interface is used to connect to the personal computer
host. Since all on-chip storage is RAM-based, an external host is required for
initializing the CPU.

The RTX 32P has no program counter. Every instruction contains the address
of the next instruction or refers to the address on the top of the return
address stack. This design decision is in keeping with the observation that
Forth programs contain a very high proportion of subroutine calls. Section
6.3.3 discusses the affects of the RTX 32P's instruction format in greater
detail.

Instead of a program counter, the block described as the Memory Address
Logic contains a Next Address Register (NAR), which holds the pointer for
fetching the next instruction. The Memory Address Logic uses the top element
of the Return Stack to address memory for subroutine returns, while it uses the
RAM address register (ADDR REG) for doing memory fetches and stores
efficiently. The Memory Address Logic also contains an increment-by-4 circuit
for generating return addresses for subroutine call operations. Since the
Return Stack and Memory Address Logic can be isolated from the system Data Bus,
subroutine calls, subroutine returns, and unconditional jumps can be performed
in parallel with other operations. This results in these control transfer
operations costing zero clock cycles in many cases.

Program memory is organized as up to 4 Gbytes of memory, addressable on
byte boundaries. Instructions and 32-bit data items are required to be aligned
on 32-bit memory boundaries, since data is accessed in 32-bit words from
memory. The actual RTX 32P chips can address only 8M bytes because a limited
number of pins on the package.

Microprogram Memory is an on-chip read/write memory containing 2K elements
by 30 bits. The memory is addressed as 256 pages of 8 words each. Each opcode
in the machine is allocated its own page of 8 words. The Microprogram Counter
supplies an 9 bit page address of which only the lowest 8 bits are used in this
implementation. This scheme allows supplying 3 bits from the current
microinstruction, the lowest bit of which is the result of a 1-in-8 conditional
microbranch selection, as the address for the next microinstruction within the
same microcode page. This allows conditional branching and looping during the
execution of a single opcode.

Instruction decoding is accomplished simply by loading the 9-bit opcode
into the Microprogram Counter and using that as the page address to
Microprogram Memory. Since the Microprogram Counter is built with a counter
circuit, operations can span more than one 8-microinstruction page if required.

The Microinstruction Register (MIR) holds the output of the Microprogram
Memory. This allows the next microinstruction to be accessed from Microprogram
Memory in parallel with execution of the current microinstruction. The MIR
completely removes the Microprogram Memory access delay from the system's
critical path. Its use also enforces a lower limit of two clock cycles on
instructions. If an instruction could be accomplished in a single clock cycle,
a second no-op microinstruction must be added to allow the next instruction to
flow through the MIR fetching sequence properly.

The Host Interface allows the RTX 32P to operate in two possible modes:
Master Mode and Slave Mode. In Slave Mode, the RTX 32P is controlled by the
personal computer host to allow program loading, microprogram loading, and
alteration of any register or memory location on the system for initialization
or debugging. In Master Mode, the RTX 32P runs its program freely, while the
host computer monitors a status register for a request for service. While the
RTX 32P is in master mode the host computer may enter a dedicated service loop,
or may perform other tasks such as prefetching the next block of a disk input
stream or displaying an image, and only periodically poll the status register.
The RTX 32P will wait for service from the host for as long as is necessary.

The RTX 32P has only one instruction format, shown in Figure 5.4. Every
instruction contains a 9-bit opcode which is used as the page number for
addressing microcode. It also contains a 2-bit program flow control field that
invokes either an unconditional branch, a subroutine call, or a subroutine
exit. In the case of either a subroutine call or unconditional branch, bits
2-22 are used to specify the high 21 bits of a 23-bit word-aligned target
address. This design limits program sizes to 8M bytes unless the page register
in the Memory Address Logic is used with special far jump and call
instructions. Data fetches and stores see the memory as a contiguous 4G byte
address space.

Wherever possible, the RTX 32P's compiler compacts an opcode followed by a
subroutine call, return, or jump into a single instruction. In those cases
where such compaction is not possible, a NOP opcode is compiled with a call,
jump, or return, or a jump to next in-line instruction is compiled with an
opcode. Tail-end recursion elimination is performed by compressing a
subroutine call followed by a subroutine return into a simple jump to the
beginning of the subroutine that was to be called, saving the cost of the
return that would otherwise be executed in the calling routine.

Since the RTX 32P uses RAM for the microcode memory, the microcode may be
completely changed by the user if desired. The standard software environment
for the CPU/32 is a version of MVP-FORTH, a FORTH-79 dialect (Haydon 1983).
Some of the Forth instructions included in the standard microcoded instruction
set are shown in Table 5.2. One thing that is noticeable in this instruction
set is the number and complexity of instructions supported.

Table 5.2b shows some common Forth word combinations that are available as
single instructions. Table 5.2c shows some words that are used to support
underlying Forth operations such as subroutine call and exit. Table 5.2d lists
some high level Forth words that are directly supported by specialized
microcode. Table 5.2e shows words that were added in microcode to support
extended precision integer operations and 32-bit floating point calculations.

Since the instructions vary considerably in complexity, execution time of
instructions ranges accordingly. Simple instructions that manipulate data on
the stack such as + and SWAP take 2 microcycles (one memory cycle) each.
Complex microinstructions such as Q+ (128-bit addition) may take 10 or more
microinstructions, but are still much faster than comparable high level code.
If desired, microcoded loops can be written that can potentially last thousands
of clock cycles to do things such as block memory moves.

Figure 5.5 -- RTX 32P microinstruction format.

As mentioned earlier, each instruction invokes a sequence of
microinstructions on a Microprogram Memory page corresponding to the 9-bit
opcode for the instruction. Figure 5.5 shows the microinstruction format. The
microcode used is horizontal, which means that there is only one format for
microcode, ant that the format is broken into separate fields to control
different portions of the machine.

As with the WISC CPU/16, the simplicity of the stack machine approach and
the RTX 32P hardware results in a simple microcode format, in this case only
using 30 bits per microinstruction. The microcode format of the RTX 32P is
similar to that of the CPU/16 discussed in the previous chapter.

Bits 0-3 of the microinstruction specify the source of the system Data Bus.
Two of the bus sources are used as special control signals to configure the
RTX 32P for one-clock-cycle-per-bit multiplication and nonrestoring division of
32/64 bit numbers.

Bits 8-9 specify the Data Bus destination. Two special cases for
destinations exist: DLO may be independently specified as a bus destination
using bits 22-23, and the DHI register is always loaded with the ALU output.
Bits 8-9 and 10-11 specify Data Stack Pointer and Return Stack Pointer control,
respectively. Bits 12-13 control a shifter on the output of the ALU. This
shifter allows shifting left or right, as well as an 8-bit rotation function.

Bits 14-15 of the microinstruction are unused, and therefore not included
in the Microcode RAM. Bits 16-20 control the function of the ALU. Bit 21
specifies a carry-in of 0 or 1. To synthesize multiple precision arithmetic,
the microcode does a conditional microbranch based on the carry-out of the low
half of the result, and then forces the next carry-in to 0 or 1 as appropriate.
Bits 22-23 control the loading and shifting of the DLO register.

Bits 24-29 of the microinstruction are used to compute a 3-bit offset into
the microprogram page for fetching the next microinstruction. Bits 24-26
select one of eight condition codes to form the lowest address bit, while bits
27-28 are used as constants to generate the two high order address bits. This
allows jumping and 2-way conditional branching anywhere within the microprogram
page on every clock cycle. Bit 29 can be used to increment the contents of the
9-bit Micro Program Counter to allow opcodes to use more than 8 Microcode
Memory locations. Bit 30 initiates the instruction decoding sequence for the
next instruction. This is required since instructions are a variable number of
clock cycles long. Bit 31 controls the return address incrementer for use as
a counter into memory for block data accesses.

One microinstruction is executed on every clock cycle, with two or more
microinstructions executed for every machine macroinstruction.

The heritage of the WISC CPU/16 in the RTX 32P architecture is unmistakable.
The most obvious area of improvement is the addition of more efficient Memory
Address Logic and the isolation of the Return Address Stack from the Data Bus
during subroutine call and return operations. These changes, along with the
RTX 32P's unique instruction format, allow subroutine calls, returns, and jumps
to be processed "for free" to the extent that they can be combined
with opcodes.

The RTX 32P's clock runs at twice the speed that main memory can be
accessed, thus giving two clock cycles per memory cycle, and a minimum of two
clock cycles per instruction.

There are a number of uses for the RTX 32P's instruction format, many of
which are not immediately obvious. One of them is for executing conditional
branches. The RTX 32P does not have direct hardware support for conditional
branches, since this would slow down the rest of the hardware too much on other
instructions or require excessively fast program memory. Conditional branches
are accomplished by using a special 0BRANCH opcode combined with a
subroutine call to the branch target. The subroutine call is processed by the
hardware in parallel with the opcode's evaluation of whether the top stack
element is zero (in which case the branch is taken). If the branch is to be
taken, the Return Stack is popped, converting the subroutine call to just a
jump, and execution continues. If the branch is not to be taken, the microcode
pops the Return Stack and uses the value to fetch the branch fall-through
instruction, in effect performing an immediate subroutine return. The cost for
this conditional branch is 3 clock cycles to take a branch, 4 clock cycles to
not take a branch. Remember that on this processor each memory cycle is 2
clock cycles.

Another interesting capability of the RTX 32P is quick access of any memory
location as a variable. Even though the 0-operand instruction format would
seem to require a second memory location to specify the variable address, the
following operation can be used. A special opcode is compiled with a
subroutine call, where the address of the "subroutine" is actually
the address of the variable desired to be fetched. The microcode then "steals"
the variable value as the instruction fetching logic reads it in, then forces a
subroutine return before the value can be executed as an instruction.

The point of discussing these two methods is to illustrate that there are
several significant capabilities of the hardware that are not immediately
obvious to programmers who are used to more conventional machines. These
capabilities are especially useful in programming data structure accesses (for
example, expert system decision trees), and actually allow direct execution of
data structures. This direct execution is accomplished by storing the data in
a tagged format having a 9 bit tag (corresponding to special user-defined
opcodes) and a 23-bit address that is a subroutine call or jump to the next
data element in the structure, or a subroutine return for a nil pointer.

An important implementation feature of the RTX 32P is that all resources on
the machine can be directly controlled by the host computer. This can be done
because the host interface supports Microinstruction Register load and
single-step clock features. With these features, any microinstruction desired
can be executed by first loading values into any or all registers in the
system, loading a microinstruction, cycling the clock, then reading data values
back to examine the results. This design technique makes writing microcode
extremely straightforward, eliminating the need for expensive external analysis
hardware. It also makes testing and diagnostic programs very simple to write.

The RTX 32P supports interrupt handling, including interrupt on stack
underflow and overflow for both the Data Stack and Return Stack. The usual
technique for handling these overflows and underflows is to page in or out half
the on-chip stack contents to a holding area in program memory. This allows
programs to use arbitrarily deep stacks. With a 512 element hardware stack
buffer size, typical Forth programs never experience a stack overflow.

The RTX 32P is implemented on 2.5 micron CMOS standard cell technology in a
2-chip set. The data path chip, which contains the ALU, data stack, and ALU
bits of the microcode memory, is an 84 pin Leadless Chip Carrier (LCC). The
control chip, which contains the rest of the system, is packaged in a 145 pin
Pin Grid Array (PGA). The RTX 32P runs at 8 MHz.

The RTX 32P is designed for real time control applications, especially in
the area of embedded systems with low power and small size requirements. As
was mentioned previously, the RTX 32P is a prototyping vehicle for a commercial
processor which, as of this writing, is planned to be called the RTX 4000.
This new processor will have several features that make it suitable for use in
real time control applications and personal computer coprocessor acceleration
tasks including: a mixture of ROM and RAM microcode to shrink the system onto a
single chip, stand-alone operation, on-chip hardware support for floating point
math, a significantly faster clock speed, and on-chip support for dynamic
program memory chips. Some versions of the chip may not have all these
features. In addition, architectural enhancements will be made to support
languages such as C, Ada, and LISP by allowing use of the address field in the
instruction to specify fast-access 21-bit literals. This will allow crucial
operations such as frame-pointer-plus-offset addressing to run at high speed.

The information in this section is based on the descriptions of the WISC
CPU/32 in Koopman (1987c), and Koopman (1987d), and the introduction of the RTX
32P in Koopman (1989).