AMD shakes up x86 CPU world with two new designs

One of the main messages from AMD's financial analyst day was, "x86 …

SUNNYVALE — Companies rarely make big news at financial analyst day events, but AMD bucked that trend Wednesday by unveiling details of its newly revamped roadmap, its two brand-new processor architectures, and its plans for CPU/GPU integration. (AMD and Intel also made some other news together). Rather than attempt a comprehensive overview of what was announced, I'll walk you through the two new processor architectures, leaving the CPU/GPU "Fusion" revelations and roadmap specifics for a second article.

Bobcat: AMD's new mobile architecture

The slide below shows Bobcat, the codename for AMD's new-from-the-ground-up microarchitecture that's aimed at portables and SoCs. Bobcat will compete with Atom and with VIA's Nano, though it has much more in common with the latter than the former.

AMD's Bobcat

Bobcat is an out-of-order processor that can dispatch up to two instructions per cycle from its front-end to the integer and/or floating-point schedulers. Attached to the integer schedulers are four pipelines, two integer pipes and two memory pipes (one load, one store). There is no word on the depth of the integer pipeline as of yet, but I would be shocked if it were less than 12 stages or more than 20.

Attached to the floating-point scheduler are two floating-point/SIMD pipelines. Not much was said about either of these pipes, but if I had to guess I would bet that they both support common, scalar, fully-pipelined double-precision floating-point operations, but one pipe is for FDIV/FMUL and SIMD permute instructions, while the other is SIMD scalar operations. AMD did reveal that the unit supports SSE flavors 1 through 3.

AMD notes that this core is a synthesizable IP block that's designed to be mixed and matched with other blocks on an SoC. What this means in English is AMD stores the CPU block in a high-level description language that then gets compiled down into logic gates and laid out on the chip by an automated toolset. The decision to do things this way, vs. the traditional method of customizing a lot of the lower-level design by hand, means trading off performance and some power efficiency for flexibility and time-to-market.

As you can see from the slide, AMD is targeting the sub-1W power envelope with Bobcat, though at launch it will probably hit this target only for the very lowest clockspeed parts; the higher-clocked parts will certainly be above 1W, and possibly up to 2 or 2.5W.

By the time this core launches in 2011 on a 32nm SOI process, Intel will have had Atom on 32nm for a while and will be eyeing 22nm. So, while on a clock-for-clock basis an out-of-order design like Bobcat will certainly smoke Atom in absolute performance, it's hard to predict where Intel will have taken Atom in that timeframe.

Bulldozer: AMD's server architecture

AMD's newly announced high-end processor architecture is a significant departure from the architecture that powers the company's existing processor line. It represents the implementation of an idea that quite a few folks have tossed around, but no one has really made work yet. Take a look at the Bulldozer "module" depicted in the slide below:

AMD's Bulldozer

It may or may not be immediately apparent to you what AMD has done here—I know I had to get some clarification directly from AMD CTO Chuck Moore (also one of the key engineers behind this design) before I was clear that AMD was doing what I thought.

In an nutshell, AMD has taken two out-of-order back-ends and made them share a single front-end and a single floating-point/SIMD unit. Here's how this works.

A single Bulldozer "module" looks to the OS like a single processor core with simultaneous multithreading (SMT) enabled, which makes sense, because that's essentially what it is. But unlike a normal SMT core, instructions from each thread are dispatched, tracked throughout the execution process, and retired by a dedicated instruction window. And when instructions from one thread retire, they write their results out to a dedicated data cache (so each module has two d-caches).

AMD has not said how many instructions per cycle the front-end can dispatch, but it can't be less than four, and it may be as high as six or eight, depending on the amount of decode hardware.

As you can see in the diagram above, there are two integer schedulers, each of which feeds four pipelines: two integer pipes and two memory pipes (load and store). Right now, AMD is referring to each integer scheduler and the pipelines associated with it as a "core," making each Bulldozer module "dual-core." I think this terminology is a huge mistake, and I hope AMD rethinks it. It's probably better to call each back-end an "execution core"—a term that I actually use in my book—in contrast to a "processor core" or just a "core," which is the front-end and everything behind it.

Both threads share a large, 128-bit FPU/SIMD that supports a new, probably single-cycle FMAC instruction. Though AMD didn't say, my guess is that both of these FPU/SIMD units are symmetric, meaning that they have identical functionality. It's not really clear how AMD manages this shared scheduler via two separate instruction windows.

Right now, there isn't enough information out there to speculate on how competitive Bulldozer and Bobcat will be with Intel's 2011 lineup; AMD will begin doling out more details in a series of papers, starting next year. As the picture begins to get fleshed out, we'll be able to gain a better understanding of AMD's long-term competitive prospects.