National Semiconductor Swordfish

The Swordfish is a unique design with a superscalar external appearance
but a long-instruction-word (LIW) internal microarchitecture based on a
decoded instruction cache (DINC).

Swordfish
chip photo (1.4M TIF file). A 1 KB data cache lies along the bottom
of the die on the left. Above it, in the lefthand column, is the
floating-point unit and the DSP multiplier (at the very top lefthand).
The CPU core is in middle column extending 2/3 the way down. The instruction
emulator is in the lower middle, below the CPU core. At the top righthand
edge is a 4 KB instruction cache. Below it, and slightly to the middle,
is the instruction loader. Along the lower righthand edge are the DMA,
ICU, and timers. The BIU is along the bottom of the die, 1/4th the way in
on the right.

The Swordfish was designed to be a successor chip to the NS32532.
As such it was initially known as the NS32732 (and later as the NS32764).
Even though it was not delivered as a N32K family member, the design
lives on in one of National's embedded processor lines (see
CompactRISC).

The people involved in the Swordfish design were:

Ran Talmudi, project manager

Don Alpert, chief architect

Dror Avnon, manager for the FPU

Amos Ben Meir, responsible for the MMU

Sorin Iacobivci and Amos Intrater, design of the bus

Gideon Intrater, design of the cache subsystem

The Swordfish design was started in the late 1980s in Israel.
The design featured dual integer pipelines, A and B, for superscalar issue.
The pipelines were standard RISC-like designs, with five stages
(fetch, decode, execute, memory, writeback). Pipeline B was the primary,
in the sense that all instructions could execute on it. Pipeline A was
secondary and in particular could not execute branches or initiate
floating-point instructions. However, the first instruction in an
instruction pair fetched from the decoded instruction cache (see below)
was always assigned to pipeline A, and the second to pipeline B.
The pipelines operated in lockstep except when the decoder in pipeline
B stalled due to a dependency between the paired instructions or some
other condition. A new instruction pair would not be obtained until
both instructions from the previous pair had exited the decode stages.

A register scoreboard was used to control WAR and WAW stalls.
Additionally, a load reservation FIFO was included so that the
pipelines could continue execution past data cache misses, each
of which required about six cycles to satisfy. The register
scoreboard would stall a load-dependent instruction if it was
decoded prior to the missing data being returned from cache.

Each instruction that was issued to pipeline B was also supplied to
the floating-point pipeline, so that a floating-point instruction could
be immediately started. The floating-point pipeline consisted of five
stages after fetch: decode, execute-1, execute-2, round and normalize,
and write back. Pipeline B operated in lockstep with an instruction
in the floating-point pipeline; this was done in order to control program
sequencing. If the floating-point instruction could trap, pipeline B
additionally cycled twice in its memory stage so that both pipeline B and
the floating-point pipeline would enter their respective write-back stages
simultaneously. Floating-point traps were thereby made precise (i.e,
no instructions beyond the trapping one would be allowed into
a writeback stage).

The initial chip ran at 50 MHz, and could perform a 32bx32b integer
multiply in one cycle or a 16bx16b->32b signed integer multiply in one
cycle (with selection of the 16b from either low or high halves of the
registers to help implement complex arithmetic). Three floating-point
units were provided: an adder, a multiplier, and a divider.

Don Alpert states,

Swordfish was most strongly influenced by:

MIPS-X at Stanford. We followed a similar integer pipeline and looked at
their branch handling as well. I visited Stanford in summer 1987(?) and was
exposed to the work in detail.

Multiflow VLIW. I had met Josh Fisher once when I was a student at
Stanford, then heard him give a talk about Multiflow at UC Berkeley in 1987
(?). We were trying to figure out how to get parallelism out of multiple
functional units, and adopted a microarchitecture that was like VLIW: each FU
was assigned to fixed slots in a 2-wide instruction word fetched from the
cache. We had the HW detect dependencies as instructions were placed in the
cache slots, so it was a superscalar architecture with a VLIW machine
organization. To improve icache efficiency we allowed dependent instructions
to be packed together with a bit per pair of instructions that indicated
whether or not they were dependent. Independent instructions could be
executed in parallel, dependent instructions had to be executed sequentially,
but still on the pipeline assigned to that slot. Just about the only wasted
cache slots were for FP instructions that could not be paired with a load or
integer op.

... Overall it was a very efficient architecture. With little extra cost for a
second integer pipe and a simple control structure, it was possible to
derive a lot of parallelism on many embedded loops.

As mentioned in the quote, the instruction cache was organized into instruction
pair entries (or, 2-wide LIW), with each instruction mapped to one of the two
integer pipelines. Pre-decoding ("instruction loading") was performed during
instruction cache refill to determine the instruction pairs, identify any
true dependency (i.e., RAW) between the two instructions in a pair, and to
calculate and store the branch target address (rather than storing the branch
offset). A predict-taken branching policy was used and yielded 0-cycle taken
branches.

Two additional bits were used in each instruction cache entry: one to
indicate a dependent pair and thus force sequential issue, and another
to indicate an emulated rather than hardwired instruction.

The initial plans were to implement only a "performance-critical" core
of NS32K instruction set. When a marked instruction was fetched, an
instruction emulator unit would feed a sequence of core instructions to
the pipelines in order to interpret an unimplemented instruction (cf. PPro).
Later, an approach using a native RISC-like instruction set was adopted.
The native instructions used the undefined opcodes in the NS32K definition.
Pre-decoding would then classify the NS32K instructions into 3 groups:

instructions that could be translated into one native instruction,
which was then placed in one of the two instruction slots
in the instruction cache entry according to instruction-pairing rules

instructions that could be translated into two native instructions,
which were then placed into both instruction slots in the current
instruction cache entry

instructions that required more than two native instructions; for these,
the instruction cache stored an entry point for the instruction emulation
unit (a microcoded state machine that emitted a series of native
instructions into the two pipelines to emulate complex NS32K instructions)

Patents on Swordfish techniques

5,669,011 - Partially decoded instruction cache

5,481,751 - Apparatus and method for storing partially-decoded
instructions in the instruction cache of a CPU having multiple
execution units

5,263,153 - Monitoring control flow in a microprocessor

5,249,286 - Selectively locking memory locations within a
microprocessor's on-chip cache (includes the revision 2.0 architecture
manual as an appendix, which is dated February 1990)