V22.0436 - Prof. Grishman

Lecture 19: Pushing CPU Performance

Pipelining, cont'd (chapter 6)

structural hazards: two instructions want to use the same module
(e.g., the ALU) in the same clock cycle. This problem is reduced
in machines with a very uniform instruction set, such as MIPS. Now
that logic is cheap, we may duplicate some components to avoid structural
hazards.

data hazards: one instruction uses the result of the previous instruction.
The simplest solution is to "stall" ... to hold up the current instruction
until the prior one has finished. A more efficient solution is data
forwarding ... to send a result of one instruction directly to the ALU
for the next instruction, as well as putting it in the register.

branch hazards: we must wait until a conditional branch completes
before we know whether the following instructions should be executed.
Again, we could stall the instruction after the branch, but this is inefficient.
Alternatively, we can guess whether or not the branch is taken, start subsequent
instructions based upon our guess, but wait to store their results until
we know the outcome of the branch. If our guess is correct, we continue;
if it is wrong, we invalidate the instructions we issued following the
branch, and try again. Modern CPUs use a branch prediction table
which keeps track of recent branches and whether they were taken in order
to "guess" more accurately.

Pipelining is more complex for CISC machines, because the instructions
may take different lengths of time to execute. However, RISC-style pipelining
is now incorporated into high-performance CISC processors (such as the
Pentium) by translating most instructions into a series of RISC-like
operations.

Superscalar (text, section 6.8)

Some machines now try to go beyond pipelining to execute more than one
instruction at a clock cycle, producing an effective CPI < 1. This is
possible if we duplicate some of the functional parts of the processor
(e.g., have two ALUs or a register file with 4 read ports and 2 write ports),
and have logic to issue several instructions concurrently. However,
it requires even more complex logic to guard against hazards. Such designs
are called superscalar.

Taking advantage of technology improvements

How has the steady progress in integrated circuit technology been translated
into improvements in processor performance?

The technology improvements lead to faster transistors and smaller transistors.
Faster transistors mean faster clock times. Smaller transistors mean that
we can put more transistors on a chip (the Pentium III is approaching 10M
transistors). What can we do with the increasing number of transistors
to improve performance?

increase the width of the data we process. The first microprocessor (Intel
4004) had 4-bit data paths; later processors had wider paths. However,
it doesn't help much to increase processing beyond 32 or 64-bit units.

All of these techniques can be observed in the progress of x86 implementations:

increase in register size to 16 bits in 8086, 32 bits in 80386, and associated
increases in width of data paths

more operations (e.g., floating point) in CPU

instruction prefetch (instruction buffer in 80386)

execution overlap (pipelining) in 80486

superscalar (two execution pipelines in Pentium)

Architectural Approaches

At some point most of these methods have diminishing returns; it
is very hard to squeeze out additional parallelism from a serial architecture.
New architectural approaches are needed:

SIMD instructions (single-instruction multiple-data) for specialized
applications. A single instruction specifies operations, element
by element, on arrays (vectors) of data. The operations are explicitly
parallel, so no complex checking is required. Intel added MMX instructions
to the Pentium for this purpose, and added further instructions (including
floating-point vector operations) in the Pentium III.

EPIC (explicitly-parallel instruction) architectures. An entirely
new architecture (instruction set) based upon large instructions containing
several operations which are to be performed in parallel. By making
the parallelism explicit, we gain two advantages over superscalar:
much less logic (and hence less time) is required to identify parallelism
at execution time, and compilers can generate code to take full advantage
of the parallelism. Intel and HP have developed an architecture (IA-64)
of this type; research on the compiler technology is being conducted
at NYU (Trimaran project).

Multiprocessors (text, chapter 9). For applications which
can take advantage of high-level parallelism, multiprocessors can provide
a large improvement in performance, even though this adds to software complexity.
Many systems now use large collections of workstations as the most effective
approach for high-performance computing, and for large on-line services,
such as web servers, data base servers, and transaction processing.