AMD CTO reveals first Steamroller details

Ever since the relatively disappointing debut of AMD's Bulldozer microarchitecture, we've been curious to find out what happens next. New architectures sometimes have their share of troubles, but they often bring with them quite a bit of headroom for improvement, especially once there's operating silicon to be examined and optimized. Bulldozer seemed to have more than the usual share of problems, so the question became: does it have a correspondingly large amount of headroom for improvements in successive revisions?

The first update to the architecture, code-named Piledriver, debuted aboard the Trinity mobile APU. Although the performance improvements to the Piledriver core were fairly modest, the new architecture's superior dynamic voltage and frequency scaling helped Trinity achieve substantially higher performance per watt than Llano, which was based on the older "Stars" CPU architecture. Piledriver is slowly making its way into desktop systems aboard the Trinity APU, and we should see a broader desktop Trinity launch very soon. However, the eight-core "Vishera" processor isn't expected until next year. Regardless, our conversations with AMD architects have made one thing clear: the generation after Piledriver, code-named Steamroller, is where the big gains in performance should happen.

That leads us to today, at the Hot Chips conference, where AMD CTO Mark Papermaster delivered a keynote speech detailing some of the tweaks the firm is making to the Steamroller core in order to boost its per-clock throughput and power efficiency. Sadly, we didn't attend Hot Chips this year, but we are in possession of the slides from Papermaster's speech, which we can share with you. Let's walk through them and see what we can learn about AMD's upcoming architectural refresh.

The first slide reads like a simple acknowledgement of Bulldozer's current weaknesses, including the Amdahl's Law problem created by its relatively weak single-core performance. We'd hope for these areas to improve in subsequent generations. Otherwise, the basic layout shown here looks like any other Bulldozer overview, I believe.

The 'net has been rife with speculation about the primary sources of Bulldozer's problems. Looks like the shared front end of the dual-core Bulldozer "module" is indeed one of the culprits. Steamroller gets separate, dedicated decoders for each integer core, along with larger instruction caches.

There are some very big numbers in this slide, given what they represent. Branch mispredictions drop by 20%, instruction cache misses by 30%. Per-thread instruction dispatches that use the full width of the execution units are up by a quarter. Overall, these changes add up to a whopping 30% improvement in ops dispatched per clock cycle—and these numbers are based on simulation, not just hopeful estimation. Even more notably, this 30% figure comes from simulated client-focused workloads, including "digital media, productivity and gaming applications," not just the server-class applications for which the original Bulldozer core was so obviously tuned.

Presumably, the revised front end is the single biggest improvement in Steamroller. Provided the rest of the engine can cope with how it's being fed, these changes could result in a formidable boost to overall performance.

Steamroller's cores should be better equipped with the front-end's higher dispatch rate thanks to some changes to the schedulers and the memory subsystem. We don't have too many specifics here, but the 5-10% improvement in scheduling efficiency again comes from simulation of client-side workloads like "digital media, productivity and gaming applications."

Zooming back out, this slide offers a look at some power-efficiency provisions baked into Steamroller. The instruction fetch optimization, which detects loops and handles them more efficiently, is a familiar trick. The dynamic L2 cache resizing makes sense, too, since it's a shared resource used by both integer cores and the working data set of different threads can vary. If not all of the L2 cache is needed, portions of it can be powered down.

We're unsure what the floating-point "rebalance" is all about. Currently, Bulldozer's floating-point performance is relatively weak, in part because a single FPU is shared between two integer cores. Streamlining the FPU's execution hardware might save power, as is being claimed here, but we worry about performance. If "adjust to application trends" means hardware better suited to common workloads, then fine. If it means "gut the FPU and rely on the graphics processor to do floating-point math," well, that's less promising. We'll have to get more specifics about what's happening here. Update:AMD tells us it's not reducing the execution capabilities of the FP units at all. They've simply identified some redundancies (for instance, in the MMX units) and will be re-using some hardware in order "to save power and area, with no performance impact." So no worries, there, it seems.

Moving from architecture to design opens up more opportunity. As you may recall, Bulldozer is relatively large for a 32-nm chip with its transistor count, especially after AMD revised down the transistor count estimate. Apparently, there's plenty of room for improvement even in the same process node.

Shown above is a portion of the chip's FPU. The top image comes from a current Bulldozer chip, which employs the hand-drawn custom logic that's generally used in high-end x86 CPUs. The lower image comes from a potential future chip that uses a more automated high-density cell library. On the same 32-nm process node, the high-density library purportedly crams the same logic into 30% less area, with 30% less power use. As the slide notes, gains on this order would usually come from the transition to a newer, smaller fabrication process. We'd expect the more automated approach to design to reduce AMD's time to market, as well.

What we don't know is when we'll see a product designed using a high-density cell library like this one. AMD tells us the future processor illustrated here is a post-Steamroller design, and it therefore seems likely that any improvements realized by using these tools will happen on a future process node, not at 32 nm.

When AMD purchased SeaMicro earlier this year, it acquired the technology used to build low-power, high-density server arrays that combine multiple CPU and system modules with a shared pool of virtualized I/O resources. The firm has since stated that it plans to hand over this fabric technology to its system vendor partners, to enable them to create high-density AMD-based solutions. Now, that fabric tech has a name: French Freedom Fabric.

At the time of the acquisition, SeaMicro didn't offer Opteron-based solutions, but AMD stated its intention to deliver Opteron-based offerings in the second half of this year. The card pictured above is the hardware that delivers on that promise; it's populated with a Bulldozer-derived Opteron 4256 processor and dual SO-DIMMs. The four chips across the top are labeled "SeaMicro" and likely manage the interconnect between this module and the rest of the system. As you can see, the card itself is pretty compact, at roughly 12" in length and about half that in height. You can imagine a host of these cards packed into a single enclosure as part of a very high-density cloud server solution. Offerings like this one may be attractive enough to allow AMD to get a toe-hold in the growing cloud and blade server markets.

Meanwhile, we're looking forward to more potent CPUs based on Steamroller, which should boost server performance while also addressing Bulldozer's biggest liability: client-side single-thread performance. Although we have a few numbers now about improvements to the individual portions of Steamroller, we don't yet have any sense of the overall IPC gains or how those might combine with clock frequency increases to affect overall performance. For that, we'll probably have to wait a while yet.