Today at the annual Hot Chips conference, AMD’s new CTO Mark Papermaster unveiled the first details about the Steamroller x86 CPU core.

Steamroller is the third instantiation of AMD’s Bulldozer architecture, first conceived in the mid-2000s and finally brought to market in late 2011. Committed to this architecture for at least one more design after Steamroller, AMD has settled on roughly yearly updates to the architecture. For 2012 we have the introduction of Piledriver, the optimized Bulldozer derivative that formed the CPU foundation for AMD’s Trinity APU. By the end of the year we’ll also see a high-end desktop CPU without processor graphics based on Piledriver.

Piledriver saw a switch to hard edge flip flops, which allowed for a considerable decrease in power consumption at the expense of careful design and validation work. Performance didn’t change, but AMD saw a 10% - 20% reduction in active power. Piledriver also brought some scheduling efficiency improvements, but prefetching and branch prediction were the two other major design improvements in Piledriver.

Steamroller is designed to keep the ball rolling. It takes fundamentals from the Bulldozer/Piledriver architectures and offers a healthy set of evolutionary improvements on top of them. In Intel speak Steamroller wouldn’t be a tick as it isn’t accompanied by a significant process change (28nm bulk is pretty close to 32nm SOI), but it’s not a tock as the architecture is mostly enhanced but largely unchanged. Steamroller fits somewhere in between those two extremes when it comes to changes.

Front End Improvements

One of the biggest issues with the front end of Bulldozer and Piledriver is the shared fetch and decode hardware. This table from our original Bulldozer review helps illustrate the problem:

Front End Comparison

AMD Phenom II

AMD FX

Intel Core i7

Instruction Decode Width

3-wide

4-wide

4-wide

Single Core Peak Decode Rate

3 instructions

4 instructions

4 instructions

Dual Core Peak Decode Rate

6 instructions

4 instructions

8 instructions

Quad Core Peak Decode Rate

12 instructions

8 instructions

16 instructions

Six/Eight Core Peak Decode Rate

18 instructions (6C)

16 instructions

24 instructions (6C)

Steamroller addresses this by duplicating the decode hardware in each module. Now each core has its own 4-wide instruction decoder, and both decoders can operate in parallel rather than alternating every other cycle. Don’t expect a doubling of performance since it’s rare that a 4-issue front end sees anywhere near full utilization, but this is easily the single largest performance improvement from all of the changes in Steamroller.

The penalties are pretty obvious: area goes up as does power consumption. However the tradeoff is likely worth it, and both of these downsides can be offset in other areas of the design as you’ll soon see.

Steamroller inherits the perceptron branch predictor from Piledriver, but in an improved form for better performance (mostly in server workloads). The branch target buffer is also larger, which contributes to a reduction in mispredicted branches by up to 20%.

Execution Improvements

AMD streamlined the large, shared floating point unit in each Steamroller module. There’s no change in the execution capabilities of the FPU, but there’s a reduction in overall area. The MMX unit now shares some hardware with the 128-bit FMAC pipes. AMD wouldn’t offer too many specifics, just to say that the shared hardware only really applied for mutually exclusive MMX/FMA/FP operations and thus wouldn’t result in a performance penalty.

The reduction of pipeline resources is supposed to deliver the same throughput at lower power and area, basically a smarter implementation of the Bulldozer/Piledriver FPU.

There’s no change to the integer execution units themselves, but there are other improvements that improve integer performance.

The integer and floating point register files are bigger in Steamroller, although AMD isn’t being specific about how much they’ve grown. Load operations (two operands) are also compressed so that they only take a single entry in the physical register file, which helps increase the effective size of each RF.

The scheduling windows also increased in size, which should enable greater utilization of existing execution resources.

Store to load forwarding sees an improvement. AMD is better at detecting interlocks, cancelling the load and getting data from the store in Steamroller than before.

126 Comments

So they're going to build..First they Bulldoze the placeThen they bring in the Piledriver laying the postingsThen the Steamroller for the building surroundingsNext comes Excavator ! to destroy all the former workGreat plan amd...Reply

It looks really promising, indeed. Lot's of fine tuning there, actually more than just "fine". And they don't need to beat Intel for top performance anyway, just keep up the pressure and give us good mainstream chips with solid single thread performance!Reply

That's what they get beat on all the time, single thread performance - oh and multi thread for that matter.They've been getting creamed on single thread, specifically.Your exclamation point sure points to a fine fantasy never happening future though given the failure that the present is.Must take a lot of fanboyism and some strong prozac in the water.Reply

Less and less applications are single-threaded, it's a dying part of the market. AMD is every bit as goodas Intel and better in its price class. Most apps perform betteron FX-8350 than I5 3570k. The FPS are up there with 3770k on many new games. This will only get better over the next year as more and more games offer 8 core processor support. There is absolutely no compelling reason to go Intel for cpu's under $300. With Steamroller the ascension of AMD to a BETTER alternative to Intel will only accelerate. All the initial bad reviews which were based on erroneous testing procedures and old benchmarks are proving to be ancient history and poor analysisReply