AMD shakes up x86 CPU world with two new designs

One of the main messages from AMD's financial analyst day was, "x86 …

SUNNYVALE — Companies rarely make big news at financial analyst day events, but AMD bucked that trend Wednesday by unveiling details of its newly revamped roadmap, its two brand-new processor architectures, and its plans for CPU/GPU integration. (AMD and Intel also made some other news together). Rather than attempt a comprehensive overview of what was announced, I'll walk you through the two new processor architectures, leaving the CPU/GPU "Fusion" revelations and roadmap specifics for a second article.

Bobcat: AMD's new mobile architecture

The slide below shows Bobcat, the codename for AMD's new-from-the-ground-up microarchitecture that's aimed at portables and SoCs. Bobcat will compete with Atom and with VIA's Nano, though it has much more in common with the latter than the former.

AMD's Bobcat

Bobcat is an out-of-order processor that can dispatch up to two instructions per cycle from its front-end to the integer and/or floating-point schedulers. Attached to the integer schedulers are four pipelines, two integer pipes and two memory pipes (one load, one store). There is no word on the depth of the integer pipeline as of yet, but I would be shocked if it were less than 12 stages or more than 20.

Attached to the floating-point scheduler are two floating-point/SIMD pipelines. Not much was said about either of these pipes, but if I had to guess I would bet that they both support common, scalar, fully-pipelined double-precision floating-point operations, but one pipe is for FDIV/FMUL and SIMD permute instructions, while the other is SIMD scalar operations. AMD did reveal that the unit supports SSE flavors 1 through 3.

AMD notes that this core is a synthesizable IP block that's designed to be mixed and matched with other blocks on an SoC. What this means in English is AMD stores the CPU block in a high-level description language that then gets compiled down into logic gates and laid out on the chip by an automated toolset. The decision to do things this way, vs. the traditional method of customizing a lot of the lower-level design by hand, means trading off performance and some power efficiency for flexibility and time-to-market.

As you can see from the slide, AMD is targeting the sub-1W power envelope with Bobcat, though at launch it will probably hit this target only for the very lowest clockspeed parts; the higher-clocked parts will certainly be above 1W, and possibly up to 2 or 2.5W.

By the time this core launches in 2011 on a 32nm SOI process, Intel will have had Atom on 32nm for a while and will be eyeing 22nm. So, while on a clock-for-clock basis an out-of-order design like Bobcat will certainly smoke Atom in absolute performance, it's hard to predict where Intel will have taken Atom in that timeframe.

Bulldozer: AMD's server architecture

AMD's newly announced high-end processor architecture is a significant departure from the architecture that powers the company's existing processor line. It represents the implementation of an idea that quite a few folks have tossed around, but no one has really made work yet. Take a look at the Bulldozer "module" depicted in the slide below:

AMD's Bulldozer

It may or may not be immediately apparent to you what AMD has done here—I know I had to get some clarification directly from AMD CTO Chuck Moore (also one of the key engineers behind this design) before I was clear that AMD was doing what I thought.

In an nutshell, AMD has taken two out-of-order back-ends and made them share a single front-end and a single floating-point/SIMD unit. Here's how this works.

A single Bulldozer "module" looks to the OS like a single processor core with simultaneous multithreading (SMT) enabled, which makes sense, because that's essentially what it is. But unlike a normal SMT core, instructions from each thread are dispatched, tracked throughout the execution process, and retired by a dedicated instruction window. And when instructions from one thread retire, they write their results out to a dedicated data cache (so each module has two d-caches).

AMD has not said how many instructions per cycle the front-end can dispatch, but it can't be less than four, and it may be as high as six or eight, depending on the amount of decode hardware.

As you can see in the diagram above, there are two integer schedulers, each of which feeds four pipelines: two integer pipes and two memory pipes (load and store). Right now, AMD is referring to each integer scheduler and the pipelines associated with it as a "core," making each Bulldozer module "dual-core." I think this terminology is a huge mistake, and I hope AMD rethinks it. It's probably better to call each back-end an "execution core"—a term that I actually use in my book—in contrast to a "processor core" or just a "core," which is the front-end and everything behind it.

Both threads share a large, 128-bit FPU/SIMD that supports a new, probably single-cycle FMAC instruction. Though AMD didn't say, my guess is that both of these FPU/SIMD units are symmetric, meaning that they have identical functionality. It's not really clear how AMD manages this shared scheduler via two separate instruction windows.

Right now, there isn't enough information out there to speculate on how competitive Bulldozer and Bobcat will be with Intel's 2011 lineup; AMD will begin doling out more details in a series of papers, starting next year. As the picture begins to get fleshed out, we'll be able to gain a better understanding of AMD's long-term competitive prospects.

Looks nice. I just wonder if the timing of the anouncement had something to do with the dropped intel lawsuits and the cross licensing agreement. Like maybe this is borrowing some IP from intel's QPI or something. Cant wait to see what this can do. May even switch my cpu back to AMD at that point in a new build.

Looking forward to the next article Jon. Seems like they're opening the door to farm parallel execution off to an integrated GPU seamlessly. The idea being that SIMD code will just run on the GPU without the need for a recompile or special attention from the OS / developers.

I'm also interested to see how they treat memory caches between the cpu and gpu on a fusion chip.

One can argue 'til the cows come how as to whether or not it's better or worse than Intel, but one thing is for certain: AMD wins the processor naming war. Things like "Bulldozer" and "Clawhammer" before it beat the heck out of names like "Lynnfield" and "Wolfdale." I'm looking forward to "Chainsaw" and "Locomotive"

If you are suggesting that they have one pipe that does only the non-packed SSE instructions that sure doesn't make sense to me!

And having one pipe that did both the shuffles (ain't no permutes in SSE ) and the floating point multiply add instructions would ALSO be a bad idea ... because most serious "high performance computing" SIMD code is almost all floating-point multiply-add with some shuffles/permutes thrown in.

The reason Altivec ran so fast in the early going compared to SSE (in the CPU at least, presuming you could move the data in the G4 era) was that Altivec was truly 128 bits wide per cycle (early SSE wasn't), issued a fused multiply-add as a single instruction, AND could issue a permute instruction ... this has been true of every PPC implementation since .. too.

SSE has a bunch of ISA attributes which make it less-than-great ... Larrabee (if it ever gets out the door) fixes a bunch of them.

With SSE the way it is, you REALLY want to be able to issue at least a packed fp multiply, packed fp add, a packed shuffle, and hopefully a packed load and a packed store ... per cycle. That's a lot of issue. And on top of that, because SSE is duple coded, you often need to do register-to-register moves, so you'd like one more issue to do that.

That's a ton of issue -- 6 issue per cycle, sustained. I've never figured out exactly how much of this Core2 can do, but quite a bit.

Jon, Any commentary on the bulldozer design? You provided the 'what' but not the 'why'. By sharing a front end and FP unit do they maximize the die space for 2 processors by trading off loss of throughput? Will single threaded apps run like a beast on these because they basically get access to the internals of 2 CPU cores?

Originally posted by mpat:Interesting. Any ideas why AMD didn't simply implement SMT instead of doubling the integer execution units? That seems like a very efficient way to use the available transistors.

This really doesn't make sense to me either. One advantage of SMT is that it allows you to hide some of the memory latency; doubling up on the execution units like this does not, and now they are both competing for an L2 cache as well.

Also, this leaves the chip rather FP starved, and as BadAndy points out, there just isn't enough issue width for the vector unit to begin with.

Originally posted by Ravedave:Will single threaded apps run like a beast on these because they basically get access to the internals of 2 CPU cores?

I don't think so. It looks like the execution stream divides permanently after the reshuffle and decode stages. One instruction stream will coast right through the underutilized front end and funnel through to *a* back end.

I am also not a hardware guy, but I have to ask, why are they focusing on x86 architecture for 2011? The mobile CPU I get. I know that Windows isn't the only O/s there is, especially in the server space, but seeing how Windows 7 / Server 2008 are the last iterations to support x86 CPUs, what do they hope to do with this Bulldozer CPU?

It's a shame they don't come out for so long. I'm just about to build a home file server, and the Atom doesn't appear to be strong enough for my needs, but everything else just has really high (relatively) power consumption.

Originally posted by severusx:I am also not a hardware guy, but I have to ask, why are they focusing on x86 architecture for 2011? The mobile CPU I get. I know that Windows isn't the only O/s there is, especially in the server space, but seeing how Windows 7 / Server 2008 are the last iterations to support x86 CPUs, what do they hope to do with this Bulldozer CPU?

There's so much here that's fascinating – and unexpected. It’s essentially dual core, but only for integer math! Presumably that's going to have some interesting performance implications: great threaded performance on integer code (and perhaps good power/performance characteristics for multiple threaded int workflows), but maybe a certain amount of cross-thread competition when there's much FP on more than one thread.

Of course, for many applications, FP is relatively rare. And there will be 8 (? 16?) of these cores per die. Perhaps the idea is that max performance can be achieved by carefully grouping int-heavy and fp-heavy threads amongst cores. A sort of asymmetry-without-asymmetry. At least, it looks like a single integer-intesive thread could coexist with a float-heavy thread on the same core very happily.

So I guess the contention is that few people will need more than N threads of FP, where N = number of cores. At least, not on the CPU.

I think the proper way to look a Bulldozer is as "dual core lite" and not as an odd SMT machine. If you size the L1I cache properly, you can get most of the benefits of SMT with less complexity. Depending on the design of the front end, it could take less area than a "true" dual core design, and use less power. The biggest downside when compared to SMT is that you likely have more execution hardware idle when the workload is light.

Of course, you've also added complexity to the front end and the FP unit, but I can see that being more manageable than SMT if you're sufficiently clever.

Originally posted by Cenotaph:I think the proper way to look a Bulldozer is as "dual core lite" and not as an odd SMT machine. If you size the L1I cache properly, you can get most of the benefits of SMT with less complexity. Depending on the design of the front end, it could take less area than a "true" dual core design, and use less power. The biggest downside when compared to SMT is that you likely have more execution hardware idle when the workload is light.

What benefit of SMT is this getting you? SMT is basically a way of making better use of existing execution hardware; this has a completely different purpose. It seems like a simple dual core with a shared front end and L2.

quote:

Of course, you've also added complexity to the front end and the FP unit, but I can see that being more manageable than SMT if you're sufficiently clever.

Well, only AMD's engineers know, but I would think that SMT is less complex, and offers more.

Look more PPT slides from AMD. If PPT slides were shipping products AMD would have put Intel out of business 2 years ago. Wake me when AMD has a shipping product and then lets ooooo and ahhhhhh over it.

PS- And no I have nothing personal against AMD or its products other then not caring about anything they have put out over the last few years. I use to be an AMD man. But Intel has been on their game for a while while AMD has...stagnated I guess would be the most appropriate word. Again I don't care about more PPT slides. Show us working samples or something. ANYTHING!?? Instead of slides that are at minimum a year out.

Though AMD didn't say, my guess is that both of these FPU/SIMD units are symmetric, meaning that they have identical functionality. It's not really clear how AMD manages this shared scheduler via two separate instruction windows.

Heh. This I somehow doubt AMD will clarify anytime Real Soon. I don't know enough to argue with BadAndy's issue reservations, but otherwise I'd say that, at least from a quick eyeball, a single-cycle 128-bit wide FMAC SIMD unit per execution core will be anything but FP starved. I'm assuming of course, that a roomful of AMD engineers running detailed simulations over a period of six or (many) more months can optimize core resources at least as well as anyone here can after a moment's glance. But the proof of this pudding will be in the application benchmarks. We shall see!

I don't agree that AMD is aiming for the <1W target. The slide says "capable", a rather conservative word. Intel did much the same when announcing the x86 replacement for XScale (ARM), and we all know how Atom exceeded those 'expectations' by about two times.

AMD is, as AMD is wont to be, lacking innovation by confusing it with novelty. Bobcat is a K8 (which was a K7) lacking width. Bulldozer is a K8 with more width.

Bulldozer is aimed at ILP which really doesn't limit performance as much as it did when K7 was the awesome.

AMD has refreshed K7 too much, it's time for a big redesign (e.g. i7 or even Core).

Originally posted by Emon:I hope these are actually groundbreaking unlike their last few "innovations." It would be nice to buy AMD again.

I always buy AMD. I was about to buy some HP/Intel notebook, but after hearing this good news. F*** that. I'm going to wait for this stuff. ARM CPUs run under 1 watt, but they can't play HD without the aid of dedicated graphics, and with that aid, comes another 1.2 watts of power requirement. AMD, with it's Fusion tech, can do HD in under 2 watts

What I think many people are missing here is that this might allow AMD to get back into the GHz game without creating smoke. Intel has a pretty obvious lead here and even if it isn't marketing fodder right now, it keeps Intel comfortably ahead of AMD. It probably shouldn't be looked at as a game either as clock rate is still important as long as power can be managed.

So in effect what we have here is a light dual core processor. I actually think this is a smart move relative to SMT, which has highly variable payoff. So lets imagine we have four of these "cores" in a desktop chip, effectively 8 threads and that allows them to run the chip at 3.5GHz and at less that 45 watts. Would people want that? I would, and in fact would love to see a modern OS like Linux, BSD or Mac OS/X running on such a chip. Especially if the OS is heavily threaded and maybe supporting libdispatch or similar threading library. Interestingly I mentioned 4 "cores" but just what is the die area here, they might skip four cores and go with even more.

At least this is what I'm thinking when I see this chip description, AMD is targeting highly threaded OS'es with a design that is both low power and relatively high in clock rate capability. The other thing is that there seems to be an awareness that if you are doing heavy floating point you are likely to be doing it outside of the CPU complex. That is GPU based computing, especially now that GPU's are coming with better threading and DP support.

Personally I think this is a forward looking design that needs to get here faster than 2011. The biggest problem here is the lack of knowledge about what is outside the box the "core" is in. Bulldozer could make for an excellent SoC processor even if they aren't publicly going after that market.

Originally posted by BadAndy:If you are suggesting that they have one pipe that does only the non-packed SSE instructions that sure doesn't make sense to me!

And having one pipe that did both the shuffles (ain't no permutes in SSE ) and the floating point multiply add instructions would ALSO be a bad idea ... because most serious "high performance computing" SIMD code is almost all floating-point multiply-add with some shuffles/permutes thrown in.

I thought the same thing when I read the article. I would love to see the wafer design as it strikes me that this is more of a move to optimize the number of good chips/wafer rather than any kind of a stab at real performance gains. With pipelines organized in sets (I assume a lot here from these very high level diagrams) as errors creep into wafer production, those deficient FP pipelines simply become INT pipelines instead. Do-able if they are serviced from the same cache. Perhaps it would give them a huge jump in quality production capabilites and allow them to severely lower the price of each good unit.

Originally posted by BadAndy:The reason Altivec ran so fast in the early going compared to SSE (in the CPU at least, presuming you could move the data in the G4 era) was that Altivec was truly 128 bits wide per cycle (early SSE wasn't), issued a fused multiply-add as a single instruction, AND could issue a permute instruction ... this has been true of every PPC implementation since .. too.

SSSE3 has a permute (pshufb), and also palignr, so you can do vperm pretty well.

I meant to hit this point in the article but I guess I forgot... the reason you split the instruction window like this is that it is a hugely costly structure in terms of power. The structures that track in-flight instructions are one part of the chip (along with the decoder) that is /always/ active... there's no way to do power optimization on the instruction window by shutting it off.

Every entry you add to this structure increases the power draw until you reach a point of diminishing returns. So my impression is that splitting it up to one per-thread makes it more efficient at the very least by make two structures of half-size that you can turn on and off when needed. So if one thread is stalled you can shut down the window, vs. a traditional SMT design where if one thread is stalled the other one is still using the shared window.

Jon, what do you mean by a "single-cycle FMAC"? Surely you mean pipelined, can issue one every cycle, not that it takes one cycle to execute the instruction. If they could execute an FMAC in one cycle, that would be a *true* revelation!

(I'd estimate it takes about 5 cycles to do a double-precision FMAC at 2GHz.)

What I want to know is how many DIMM's per core Bulldozer will support because THAT is the number one driver of power efficiency in the datacenter right now. Nehalem isn't a winner because it's so much faster than Istanbul, it's a winner because I can get 72GB in a cheap 2 socket server today and will be able to get 144GB in that same server next year when 8GB DIMM's get cheap. That's really the one big breakthrough for Cisco's UCS (6 DIMM's per core) and one an area where I think a lot of other people will be looking over the next year or two.

NOTE: I don't mean to make myself sound like an expert, this is entirely amateur conjecture. I'm trying to bait Jon or one of the other people who really knows what they are talking about into talking more about this topic. Please don't take me as an authoratative source on this topic.

Trying to put together the analysis of AMD's intention here with the K10 block diagrams I have seen it looks like this from an execution perspective:

AMD looks to take K10's 3 ALU's + 1 FP per core and more than double INT power to 8 ALU's and 2 improved FP per core.

I believe it was Anand's article that was making a lot of hay out of AMD's attempt to make a substantive response to SMT. My recollection is, going back to the K7 days, SMT was a 'patch' for Netburst's comically deep pipe. I did not get to see enough detail about what Intel changed with HTT to get a sense of what has changed about it to make substantive improvements out of a much less deeply pipelined core in Nehalem, but my guess is that they simply wanted to wring more performance out of their much 'wider' (in terms of execution hardware) core. In the K7 days, AMD never implemented SMT - and as far as I can recall it was always because compared to Netburst, AMD's pipeline was about a third as long and wasn't as significantly impacted by stalls.

Given what I have read about Conroe, I am guessing that Intel's Nehalem SMT implementation has more to do with execution width and less with pipeline depth (branch mispredict) - though through attrition it's possible the Nehalem/Conroe pipeline is long enough to benefit from a Netburst-style implementation. AMD was farther along the path of a broad-core than Intel was so it's feet never left the ground; and I think there was probably a sense at AMD that such a seemingly 'sloppy' solution to a design problem (pipeline too long) wasn't a good use of die area; it was much better to build better branch prediction or more cache (for which AMD processors have always been starved) than invest engineering resources into SMT (which gives highly workload dependent improvements if anything, and on occasion hurts throughput).

As we enter the 'multi-core aware' era, though, and as AMD is broadening their execution hardware further it sounds like they had to confront the reality of idle hardware. Jon's point about utilitizing the front-end units more efficiently makes a lot of sense in this regard. Where Intel's solution attempts to graft a second thread onto a fundamentally single-threaded application, AMD is designing their hardware core-level multi-threading in mind in a way to balance with that workload more efficiently (instead of depending on latency to provide the execution hardware needed to run the second thread).

The thing that is most confusing to me is the link to fusion. How exactly will AMD implement FP offload to the GPU element of the system? In software? It seems like this couldn't be accomplished without significant changes in the ISA (and thus significant changes in compliers and OS's).

Yay new architecture for AMD, though. Here's to hoping they continue putting more better information about their designs out there (a la Intel) to enhance our understanding of what they are trying to accomplish.

Originally posted by Hat Monster:AMD has refreshed K7 too much, it's time for a big redesign (e.g. i7 or even Core).

Reinventing everything isn't necessarily a good thing. There's pretty distinct similarities between a Core i7 and a Pentium 3. It's not a bad way to build a chip, and it's not that different from what AMD's doing. Intel's just better at it, and doing it on a smaller process.

At first blush, my inclination would be to think doing things this way will help threads avoid stomping on each other as much as you usually see in SMT, but throughput per watt will probably suffer.

That'll probably hurt for the server, and help for the PC. The other thing that might help is the fact that Intel's still not going to have a competitive graphics architecture, though that depends on how important GPGPU stuff becomes over the next few years.

quote:

Originally posted by Karoch Sharon:The thing that is most confusing to me is the link to fusion. How exactly will AMD implement FP offload to the GPU element of the system? In software? It seems like this couldn't be accomplished without significant changes in the ISA (and thus significant changes in compliers and OS's).

The SSE hardware is right there in the block diagram. If developers want the GPU's help, they'll have to target the GPU through software just like they do now. Fusion isn't going to change that, it's just going to make that GPU cheaper.

Originally posted by Karoch Sharon:The thing that is most confusing to me is the link to fusion. How exactly will AMD implement FP offload to the GPU element of the system? In software? It seems like this couldn't be accomplished without significant changes in the ISA (and thus significant changes in compliers and OS's).

The SSE hardware is right there in the block diagram. If developers want the GPU's help, they'll have to target the GPU through software just like they do now. Fusion isn't going to change that, it's just going to make that GPU cheaper.

A lot of the articles on Bulldozer so far talk about this design philosphy (the increase in INT power while (depending on how the new FP units are laid out) idling or at best doubling FP power) seem to imply that this is predicated on the use of an on-die GPU for FP offload - that is to say, it's easy to throw lots of die area at INT performance when you've already got the mother of all FP pipelines on the same die/package. The roadmaps indicate that Fusion will first use a K10/10.5 core, and then ultimately will use Bulldozer which means that at some point this will have to be addressed.

While the simplest solution is what you are implying (that AMD is simply going to stick a PCIe controller into the K10.5 uncore and plug the Fusion GPU into that), most sites/analysis suggest that AMD is going for more integration than that (a la something more like Larrabee - a general purpose data parallel accelerator).