Pipelining: An Overview

Part I of the present series covered the basics of pipelining, and concluded with a preliminary discussion of how pipelining increases application performance. Specifically, pipelining increases the rate at which instructions are completed, with the result that a program's overall execution time is lowered. Or, another way to put this would be to say that pipelining allows the processor to complete more instructions in a given period of time, with the result that a particular batch of instructions (i.e. a program) gets processed more quickly.

The present article quantifies the speedup from pipelining more precisely, and details some of the drawbacks to very deep pipelines, like those found in Intel's Pentium 4.

Note to the reader: The two articles in this series are meant to be read back-to-back, so if you dive right into the section below and find that you're lost, it might help to go back and review the relatively brief (4 pages) first article. In fact, if you just go back and read the last page of Part I you should be all set to begin Part II.

The speedup from pipelining

In general, the speedup in completion rate versus a single-cycle
implementation that's gained from pipelining is ideally equal to the number of
pipeline stages. A four-stage pipeline yields a four-fold speedup in the
completion rate versus single-cycle, a five-stage pipeline yields a five-fold
speedup, a twelve-stage pipeline yields a twelve-fold speedup, and so on. This
speedup is possible because the more pipeline stages there are in a processor,
the more instructions the processor can work on simultaneously and the more
instructions it can complete in a given period of time. So the more finely you
can slice those four phases of the instruction's lifecycle, the more of the
hardware that's used to implement those phases you can put to work at any given
moment.

To return to our assembly line analogy, let's say that each crew is made up
of six workers, and that each of the hour-long tasks that each crew performs can
be readily subdivided into two shorter, 30-minute tasks. So we can double our
factory's throughput by splitting each crew into two smaller, more specialized
crews of three workers each, and then having each smaller crew perform one of
the shorter tasks on one SUV per 30 minutes.

Stage 1: build the chassis.

Crew 1a: Fit the parts of the chassis together and spot-weld the joins.

Crew 1b: Fully weld all the parts of the chassis.

Stage 2: drop the engine in the chassis.

Crew 2a: Place the engine in the chassis and mount it in place.

Crew 2b: Connect the engine to the moving parts of the car.

Stage 3: put doors, a hood, and coverings on the chassis.

Crew 3a: Put the doors and hood on the chassis.

Crew 3b: Put the other coverings on the chassis.

Stage 4: attach the wheels.

Crew 4a: Attach the two front wheels.

Crew 4b: Attach the two rear wheels.

Stage 5: paint the SUV.

Crew 5a: Paint the sides of the SUV.

Crew 5b: Paint the top of the SUV.

After the modifications described above, the ten smaller crews in our factory
would now have a collective total of ten SUVs in progress during the course of
any given 30 minute period. Furthermore, our factory could now complete a new
SUV every 30 minutes, a ten-fold improvement over our first factory's completion
rate of one SUV every five hours. So by pipelining our assembly line even more
deeply, we've put even more of its workers to work simultaneously, thereby
increasing the number of SUVs that can be worked on simultaneously and
increasing the number of SUVs that can be completed within a given period of
time.

Deepening the pipeline of our four-stage processor works on similar
principles and has a similar effect on completion rates. Just like the five
stages in our SUV assembly line could be broken down further into a longer
sequence of more specialized stages, we can take the execution process that each
instruction goes through and break it down into a series of much more than just
four discreet stages. By breaking the processor's four-stage pipeline down into
a longer series of shorter, more specialized stages, we can put even more of the
processor's specialized hardware to work simultaneously on more instructions and
thereby increase the number of instructions that the pipeline completes each
nanosecond.

We first moved from a single-cycle processor to a pipelined processor by
taking the four-nanosecond time period that the instruction spent traveling
through the processor and slicing it into four discreet pipeline stages of one
nanosecond each in length. These four discrete pipeline stages corresponded to
the four phases of an instruction's lifecycle. A processor's pipeline stages
aren't always going to correspond exactly to the four phases of a processor's
lifecycle, though. Some processors have a five-stage pipeline, some have a
six-stage pipeline, and many have pipelines deeper than ten or twenty stages. In
such cases, the CPU designer must slice up the instruction's lifecycle into the
desired number of stages in such a way that all the stages are equal in length.

Now let's take that four nanosecond execution process and slice it into eight
discreet stages. Because all eight pipeline stages must be of exactly the same
duration for pipelining to work, the eight pipeline stages must each be 4 ns / 8 = 0.5ns in length. Since we're
presently working with an idealized example, let's pretend that splitting up the
processor's four-phase lifecycle into eight equally-long (0.5ns) pipeline stages
is a trivial matter, and that the results look like what you find in figures
PIPELINING.6.1 and PIPELINING.6.2. (In reality, this task is not trivial and
involves a number of tradeoffs. As a concession to that reality, I've chosen to
use the eight stages of a real-world pipeline, the MIPS pipeline, in the
diagrams below, instead of just splitting each of the four traditional stages in
two.)

Because pipelining requires that each pipeline stage take exactly one clock
cycle to complete, then our clock cycle can now be shortened to 0.5ns in order
to fit the lengths of the eight pipeline stages. Take a look at figures
PIPELINING.6.1 and PIPELINING.6.2 below to see the impact that this increased
number of pipeline stages has on the number of instructions completed per unit
time.

Figure PIPELINING.6.1: An eight-stage pipeline

Figure PIPELINING.6.2: An eight-stage pipeline

Our single-cycle processor could complete one instruction every four
nanoseconds, for a completion rate of 0.25 instructions/ns, and our four-stage
pipelined processor could complete one instruction every nanosecond for a
completion rate of 1 instructions/ns. The eight-stage processor depicted above
improves on both of these by completing one instruction every 0.5ns, for a
completion rate of 2 instructions/ns. Note that because each instruction still
takes 4ns to execute, the first four nanoseconds of the eight-stage processor
are still dedicated to filling up the pipeline. But once the pipeline is full,
the processor can begin completing instructions twice as fast as the four-stage
processor and eight times as fast as the single-stage processor.

This eight-fold increase in completion rate versus a single-cycle design
means that our eight-stage processor can execute programs much faster than
either a single-cycle or a four-stage processor. But does the eight-fold
increase in completion rate translate into an eight-fold decrease in program
execution time? Not exactly.