Pipelining: An Overview (Part I)

Ars CPU Editor Jon Stokes looks at pipelining in the first of a two-part …

Understanding pipelining performance

The original Pentium 4 was a radical design for a number of reasons, but perhaps its most striking and controversial feature was its extraordinarily deep pipeline. At over 20 stages, the Pentium 4's pipeline almost twice as deep as the pipelines of the P4's competitors. Recently Prescott, the 90nm successor to the Pentium 4, took pipelining to the next level by adding another 10 stages onto the Pentium 4's already unbelievably long pipeline.

Intel's strategy of deepening the Pentium 4's pipeline, a practice that Intel calls "hyperpipelining", has paid off in terms of performance, but it is not without its drawbacks. In previous articles on the Pentium 4 and Prescott, I've referred to the drawbacks associated with deep pipelines, and I've even tried to explain these drawbacks within the context of larger technical articles on Netburst and other topics. In the present series of articles, I want to devote some serious time to explaining pipelining, its effect on microprocessor performance, and its potential downsides. I'll take you through a basic introduction to the concept of pipelining, and then I'll explain what's required to make pipelining successful and what pitfalls face deeply pipelined designs like Prescott. By the end of the article, you should have a clear grasp on exactly how pipeline depth is related to microprocessor performance on different types of code.

Note that if you read an earlier article of mine from a few years back entitled "Understanding Pipelining and Superscalar Execution," you'll find the first part of this text?specifically the assembly line analogy?vaguely familiar. The present article is based in part on that earlier article, but it has been reworked from the ground up to be clearer, more precise, and more up-to-date.

The lifecycle of an instruction

The basic action of any microprocessor as it moves through the instruction stream can be broken down into a series of four simple steps, which each instruction in the code stream goes through in order to be executed:

Fetch the next instruction from
the address stored in the program counter.

Store that instruction in the instruction register and decode
it, and increment the address in the program counter.

Execute the instruction
currently in the instruction register. If the instruction is not a branch
instruction but an arithmetic instruction, send it to the proper ALU.

a.Read the contents of the input registers.b.Add the contents of the input registers.

4.Write the results of that instruction from the ALU back into the destination register.

In a modern processor, the four steps above get repeated over and over again until the program is finished executing. These are, in fact, the four stages in a classic RISC pipeline. (I'll define the term "pipeline" shortly; for now, just think of a pipeline as a series of stages that each instruction in the code stream must pass through when the code stream is being executed.) Here are the four stages in their abbreviated form, the form in which you'll most often see them:

Fetch

Decode

Execute

Write (or "write-back")

Each of the above stages could be said to represent one phase in the "lifecycle" of an instruction. An instruction starts out in the fetch phase, moves to the decode phase, then to the execute phase, and finally to the write phase. Each phase takes a fixed, but by no means equal, amount of time. In most of the example processors with which we'll be working in this article, all four phases take an equal amount of time; this is not usually the case in real-world processors. In any case, if a simple example processor takes exactly 1 nanosecond to complete each stage, then the that processor can finish one instruction every 4 nanoseconds.