Features —

The Pentium 4 and the G4e: an Architectural Comparison: Part I

When the Pentium 4 hit the market in November of 2000, it was the first major …

Design philosophies

While some processors still have the classic, four stage pipeline described above, most modern CPUs are more complicated. The G4e breaks the classic, four-stage pipeline into seven stages in order to allow it to run at increased clock speeds on the same manufacturing process. Less work is done in each of the G4e's shorter stages but each stage takes less time to complete. Since each stage always lasts exactly one clock cycle, shorter pipeline stages mean shorter clock cycles and higher clock frequencies. The P4, with a whopping 20 stages in its basic pipeline, takes this tactic to the extreme. Take a look at the following chart from Intel, which shows the relative clock frequencies of Intel's last six x86 designs. (This picture assumes the same manufacturing process for all six cores). The vertical axis shows the relative clock frequency, and the horizontal axis shows the various processors relative to each other.

Figure 2.1: I know it says "Figure 2," but ignore that.

Intel's explanation of this diagram and the history it illustrates is enlightening, as it shows where their design priorities were.

Figure 2 shows that the 286, Intel386 ?, Intel486 ? and Pentium ? (P5)processors had similar pipeline depths ? they would run at similar clock rates if they were all implemented on the same silicon process technology. They all have a similar number of gates of logic per clock cycle. The P6 microarchitecture lengthened the processor pipelines, allowing fewer gates of logic per pipeline stage, which delivered significantly higher frequency and performance. The P6 microarchitecture approximately doubled the number of pipeline stages compared to the earlier processors and was able to achieve about a 1.5 times higher frequency on the same process technology. The NetBurst microarchitecture was designed to have an even deeper pipeline (about two times the P6 microarchitecture) with even fewer gates of logic per clock cycle to allow an industry-leading clock rate. (The Microarchitecture of the Pentium 4 Processor, p. 3)

As we'll see, the Pentium 4 makes quite a few sacrifices for clock speed, and although Intel tries to spin it differently, an extraordinarily deep pipeline is one of those sacrifices. (For even more on the relationship between clock speed and pipeline depth, see my first K7 article.)

Some might be tempted to attribute the vast differences in pipeline depth between the P4 and the G4e to the fact that modern x86 processors like the Athlon, PIII, and P4 need to break down large, complex x86 instructions into smaller, more easily scheduled operations. While such instruction translation does add pipeline stages to the P4, those stages aren't part of its basic, 20-stage pipeline. (Yes, the P4 still needs to translate x86 instructions into ?ops, but as we'll see later on the P4's trace cache takes the translation and decode steps out of the P4's "critical execution path.").

The drastic difference in pipeline depth between the G4e and the P4 actually reflects some very important differences in the design philosophies and goals of the two processors. Both processors want to run as many instructions as quickly as possible, but they attack this problem in two different ways. The G4e's approach can be summarized as "wide and shallow." Its designers added more functional units to its back end for executing instructions, and its front end tries to fill up all these units by issuing instructions to each functional unit in parallel. In order to extract the maximum amount of instruction-level parallelism (ILP) from the (linear) instruction stream the G4e's front end first moves a small batch of instructions onto the chip. Then, its out-of-order (OOO) execution logic examines them for interdependencies, spreads them out to execute in parallel, and then pushes them through the execution engine's nine functional units. Each of the G4e's functional units has a fairly short pipeline, so the instructions take very few cycles to move through and finish executing. Finally, in the last pipeline stages the instructions are put back in their original program order before the results are written back to memory.

At any given moment the G4e can have up to 16 instructions spread throughout the chip in various stages of execution simultaneously. As we'll see when we look at the P4, this instruction window is quite small. The end result is that the G4e focuses on getting a small number of instructions onto the chip at once, spreading them out widely to execute in parallel, and then getting them off the chip in as few cycles as possible.

Figure 2.2: The G4e's approach

The P4 takes a "narrow and deep" approach to moving through the instruction stream. It has fewer functional units, but each of these units has a deeper, faster pipeline. The fact that each functional unit has a very deep pipeline means that each unit has a large number of available execution slots and can thus work on quite a few instructions at once. So instead of having, say, three short FP units operating slowly in parallel, the P4 has one long FP unit that can hold and rapidly work on more instructions in different stages of execution at once.

It's important to note that in order to keep the P4's fast, deeply pipelined functional units full, the machine's front end needs deep buffers that can hold and schedule an enormous number of instructions. The P4 can have up to 126 instructions in various stages of execution simultaneously. This way, the processor can have many more instructions on-chip for the out-of-order execution logic to examine for dependencies and then rearrange to be rapidly fired to the execution units.

Figure 2.2: The P4's approach

It might help you to think about these two approaches in terms of a McDonald's analogy. At McDonald's, you can either walk in or drive through. If you walk in, there are five or six short lines that you can get in and wait to have your order processed by a single server in one, long step. If you choose to drive through, you'll wind up on a single, long line, but that line is geared to move faster because more servers process your order in more, quicker steps: a) you pull up to the speaker and tell them what you want; and b) you drive around and pick up your order. And since the drive-through approach splits the ordering process up into multiple, shorter stages, more customers can be waited on in a single line because there are more stages of the ordering process for different customers to find themselves in. So the G4e takes the multi-line, walk-in approach, while the P4 takes the single-line, drive-through approach.