Prescott summary table

The 90nm Prescott introduces some significant changes to the P4's Netburst architecture in an effort to squeeze more performance out of it and to make it a bit more efficient. As with the Pentium M, the details are scarce, but I'll present the main points that are publicly known. Most of these enhancements are along the same lines as the enhancements we have talked about earlier, i.e., lengthened buffers and queues, improved branch prediction, etc.

Let's first run down the list of miscellaneous improvements and tweaks that I do not have too much to say about, before moving on to more detailed stuff.

Pipeline, execution core, SIMD, other Prescott improvements

As was publicized at the time the processor was unveiled, Prescott's basic pipeline is a little longer than that of the P4; we're not sure how long, but rumors are that two extra stages have been added. I won't speculate on what those stages are, but Intel claims that they were added for ease of clock speed scaling. Yes, in case you were wondering, this does amount to beating a dead horse.

Intel has beefed up Prescott's core a bit with some changes to the integer ALUs. Specifically, they added a shifter/rotator to one of the fast ALUs, which means that more common forms of shift and rotate can now be executed twice as fast. They also added a dedicated integer multiplier, which I'm assuming is an addition to the complex integer ALU. Previous versions of the P4 used the floating-point hardware to do integer multiplies, which means extra latency because operands have to be moved over to the FPU. The addition of the dedicated multiplier to the complex integer ALU does away with this move, improving multiplication latency and performance.

Prescott's ROB is still 126 entries deep, but the system of queues and buffers that makes up its instruction window has been enlarged. Specifically, the floating-point/SIMD schedulers have been increased in size, as have the sizes of the queues that feed all of the schedulers. The former changes should yield some improvement on floating-point and SIMD code, and the latter may help slightly with other types of code as well.

One way of keeping a deeply-pipelined design full of code and data is to improve the size and hit rate of the processor's caches, and Intel does this by upping the L1 data cache size from 8K (4-way associative) to 16K (8-way associative). Also, the P4's unified L2 cache (256K on the low end versions, 512K on the high end) gets a boost in Prescott to 1MB.

Prescott also brings with it the latest extension to the x86 ISA: SSE3. SSE3 consists of 13 new instructions designed to speed multimedia code. I won't describe all of those here, but the list includes instructions designed to speed floating-point-to-integer conversion, complex arithmetic, video encoding, graphics, and thread synchronization.

Finally, there are a few miscellaneous improvements to the internals of the processor that will give better performance on hyperthreaded code. Prescott is a little bit smarter about how it shares microarchitectural resources between concurrently-running threads, and it also includes software instructions which coders can use to help with thread management.

Branch prediction on the Prescott

I said earlier that branch prediction is a great place to spend transistors because of its performance-enhancing potential on all types of code, and with that in mind Prescott has two new tricks up its sleeve for predicting branches.

The first of these two tricks is an improved static branch predictor. In the previous section on the Pentium 4 I briefly described static branch prediction, and I promised that I'd go into a bit more detail this time. Here's an explanation from a previous article on the P4 that sums up static branch prediction well enough:

There are two main types of branch prediction: static prediction and dynamic prediction. Static branch prediction is simple, and relies on the assumption that the majority of backwards-pointing branches occur in the context of repetitive loops, where a branch instruction is used to determine whether or not to repeat the loop again. Most of the time, a loop's conditional will evaluate to "taken," thereby instructing the machine to repeat the loop's code one more time. This being the case, static branch prediction merely assumes that all backwards branches are "taken." For a branch that points forward to a block of code that comes later in the program, the static predictor assumes that the branch is "not taken."

By studying loop behavior in actual code, Intel has discovered something about loops that has allowed it to improve the plain old static branch predictor a bit. Here's Intel's own description of their new method:

We can try to ascertain the difference between loop-ending branches and other backwards branches by looking at the distance of the branch and the condition on which the branch is dependent. Our studies showed that a threshold exists for the distance between a backwards branch and its target; if the distance of the branch is larger than this threshold, the branch is unlikely to be a loop-ending branch. If the BTB has no prediction for a backwards branch, the Intel Pentium 4 processor will then predict taken for the branch only if the branch distance is less than this threshold.

So in situations where the static predictor is used, Prescott's static predictor compares the distance of the branch to a hardwired threshold number, and if that distance is less than the number it assumes that the branch is a loop-ending branch and marks it taken.

Intel also improved the Prescott's dynamic branch predictor by taking a page from the PM's playbook and adding an indirect branch predictor. The literature doesn't really say how it works, but they do credit the P-M folks with the motivation for the addition, so it's possible that it works similarly to what I've elsewhere described for the P-M.

Trace cache improvements

Prescott's trace cache was improved so that it now holds more types of uops than the P4's trace cache. Although I didn't mention the fact previously, the P4's trace cache doesn't hold the long uop sequences that correspond to really complex, multicycle legacy x86 instructions. When the P4's decoder comes across an x86 instruction that will decompose into a whole string of uops, it inserts into the trace cache a pointer to a place in the Microcode ROM that holds the proper uop sequence. When the time comes to execute this string of uops, the pointer is fetched from the trace cache and the front end is redirected to look in the microcode ROM for the proper instruction sequence.

This little jump into the microcode ROM takes time, so for a few of the less lengthy instructions Intel has decided that it would be better if Presocott decoded them the old-fashioned way and stored them in the trace cache. This saves time by allowing the instructions to be fetched more quickly, since they now come directly from the trace cache. The downside is that it pollutes the trace cache with these longer strings of uops that were previously stored in ROM, thus reducing the cache's effective size.

Prescott conclusions

I won't say much about Prescott here, because I've already said everything I want to say about it my most recent Prescott article. In short, the initial benchmarks for Prescott are very disappointing, and its power requirements are through the roof. Prescott's days are numbered, and it represents Intel's last major version the Pentium 4's Netburst architecture.

General conclusions

So we've now followed the Pentium name from its beginnings in the original Pentium to its current, schizophrenic role as the brand name of two radically different architectures which embody two quite different approaches to personal computing performance. On the one side is Prescott, the final incarnation of an ambitious, commercially-successful, but ultimately flawed architecture that reflects both the heady days of the gigahertz race and a supreme confidence in the onward march of Moore's Curves. On the other side is the Pentium M, which with its roots in the venerable P6 and its status as Intel's Next Big Thing assured, is the once and future king of Intel's consumer product line.

Back in 1993 I would have been the last person to think that the peculiar "Pentium" name would endure for over a decade after its introduction. But endure it has, just like the P6 core that made it a household name. And in fact, the not-so-distant future may very well see the day that a multicore Pentium M derivative sports x86-64 support and supplants Itanium as the focus of Intel's 64-bit server efforts, thus bringing the Pentium name and the P6 core back into every segment of today's highly segmented computer market.