Into the Core: Intel’s next-generation microarchitecture

Earlier this year at its Developer Forum, Intel unveiled Core, the next- …

General approach and design philosophy

In a time when an increasing number of processors are moving away from out-of-order execution (OOOE, or sometimes just OOO) toward in-order, more VLIW-like designs that rely heavily on multithreading and compiler/coder smarts for their performance, Core is as full-throated an affirmation of the ongoing importance of OOOE as you can get. Core represents the current apex of OOOE design, where as much code and data stream optimization as possible is carried out in silicon.

Core is bigger, wider, and more massively resourced in terms of both execution units and scheduling hardware than just about any mass-market design that has come before it. "More of everything" seems to have been the motto of Core's design team, because in every phase of Core's pipeline there's more of just about anything you could think of: more decoding logic, more reorder buffer space, more reservation station entries, more issue ports, more execution hardware, more memory buffer space, and so on. In short, Core's designers took everything that has already been proven to work and added more of it, along with a few new tricks and tweaks that extend some tried-and-true ideas into different areas.

Core's microarchitecture

Wider doesn't automatically mean better, though. There are real-world limits on the number of instructions that can be executed in parallel, so the wider the machine the more execution slots per cycle there are that can potentially go unused because of limits to instruction-level parallelism (ILP). Also, memory latency can starve a wide machine for code and data, resulting in a waste of execution resources.

Core has a number of features that are there solely to address ILP and memory latency issues, and to ensure that the processor is able to keep its execution core full. In the front end, macro-fusion, micro-ops fusion, and a robust branch prediction unit (BPU) work together to keep code moving into the execution core; and on the back end, a greatly enlarged instruction window ensures that more instructions can reach the execution units on each cycle. Intel has also fixed an important SSE bottleneck that existed in previous designs, thereby massively improving Core's vector performance over its predecessors.

In the remainder of this article, I'll talk about all of these improvements and many more. I'll attempt to place Core's features in the context of Intel's overall focus on balancing performance, scalability, and power consumption.

The P6 lineage from the Pentium Pro to the Pentium M

One of the most distinctive features of the P6 line is its issue port structure. (Intel calls these "dispatch ports," but for the sake of consistency with the rest of my work I'll be using the terms "dispatch" and "issue" differently than Intel.) Core uses a similar structure in its execution core, although there are some major differences between Core's issue port and RS combination and that of the P6.

To get a sense of the historical development of the issue port scheme, let's take a look at the execution core of the original Pentium Pro.

The Pentium Pro's execution core

As you can see from the above figure, ports 0 and 1 host the arithmetic hardware, while ports 2, 3, and 4 host the memory access hardware. The P6 core's reservation station is capable of issuing up to five instructions per cycle to the execution units—one instruction per issue port per cycle.

As the P6 core developed through the Pentium II and Pentium III, Intel began adding execution units to handle integer and floating-point vector arithmetic. This new vector execution hardware was added on ports 0 and 1, with the result that these two ports got a bit overcrowded. By the time the PIII rolled around, the P6 execution core looked as follows:

The Pentium III's execution core

The PIII's core is fairly wide, but the distribution of vector execution resources among the two main issue ports means that its vector performance can be bottlenecked by a lack of issue bandwidth. All of the code stream's vector and scalar arithmetic instructions are contending with each other for two ports, a fact that, when combined with the two-cycle SSE limitation that I'll outline in a moment, means that the PIII's vector performance could never really reach the heights of a cleaner design like Core.

Information on the Pentium M's (a.k.a., Banias) distribution of labor on the issue ports is hard to come by, but it appears to be substantially the same as on the Pentium III.