New Bulldozer deep dive sheds fresh light on AMD’s troubled CPU

Share This article

Of all the CPUs launched in the past decade, none has been as puzzling as AMD’s Bulldozer. While it’s been widely derided as AMD’s version of Intel’s Pentium 4 Prescott, even early analysis indicated that such comparisons were fundamentally flawed. Bulldozer was an ambitious design to fuse chip logic and improve performance, but teasing out why it didn’t work has been a long, difficult process.

Thanks to Johan De Gelas over at Anandtech, we now have considerably more information on what doesn’t work, and some insight into how AMD can fix it. De Gelas used analysis tools from AMD and Intel to gather data on how Sandy Bridge, Magny-Cours, and Bulldozer/Interlagos execute various workloads and where the performance bottlenecks lie. Conventional wisdom, including my own articles, implied that Bulldozer’s high L2 cache latency was a major factor in the chip’s disappointing performance. His findings indicate that cache latency isn’t the problem I thought it was, at least not in server workloads.

Death by a thousand cuts

There are two major problems undermining Bulldozer’s performance in server workloads. First, there’s the issue of branch prediction. Bulldozer’s branch predictor is more accurate than Magny Cours’, but Interlagos takes a 20-cycle penalty in the event of a missed branch, whereas MC’s penalty is 12 cycles. As Johan explains, this is where comparisons to Prescott miss the mark; Prescott’s branch prediction penalty could be as high as 100 cycles.

Sandy Bridge’s branch prediction penalty is actually also fairly high, at 17 cycles, but Intel uses a 6K decoded µop cache that brings the penalty down to 14 cycles if the instruction is found there. This is one avenue AMD could potentially explore to improve the chip’s performance.

L1 cache associativity is the other issue discussed in-depth. We’ve known for quite some time that Interlagos’ efficiency suffers when both modules are enabled; additional evidence suggests that flipping on both modules in an Interlagos core doubles the number of L1 cache misses in certain workloads. Increasing cache associativity could help counter this.

Anandtech hints at (but does not disclose) a fourth “showstopper,” which leaves clock speed as the final factor. Here, there’s evidence of improvements in the latest Piledriver core at the heart of AMD’s Trinity. Piledriver uses what are known as “hard flops” and a resonant clock mesh to reduce power consumption and improve clock speeds. AMD claims Piledriver delivers a 10% reduction in dynamic power consumption compared to Bulldozer, with certain workloads improving by as much as 20%.

Putting it all together

The data suggests that Interlagos’ future in the datacenter is more hopeful than some have thought. The clock speed improvements built into Piledriver should allow that CPU to compete more effectively against Intel’s Xeons when it launches later this year, even if the other improvements to the CPU’s branch prediction and IPC have only a small net impact.

The client roadmap is a bit murkier. Here, the high L2 latency is a much greater factor and there are precious few desktop apps that scale well beyond four cores. Higher clock speeds will still improve the competitive situation, but high cache latencies are the architectural equivalent of millstones around the architecture’s neck.

It’s entirely possible that we won’t see these problems truly addressed until Kaveri, the 28nm successor to Piledriver scheduled for 2013.

AMD’s roadmap still shows Vishera (aka Piledriver) holding the top of the performance spectrum, but its Kabini that integrates a third-generation Steamroller core. None of AMD’s roadmaps currently show a server/high-end desktop Steamroller variant; it’s not clear how the next-gen core transitions into the product line. Kabini will be the first AMD APU to integrate a GPU based on the company’s current 28nm hardware, but we suspect it’ll be a few more years before servers are able to consistently leverage the graphics cores in every day workloads.

Still, there’s reason to be marginally more optimistic about AMD’s long-term ability to scale the Bulldozer architecture. We don’t expect it to challenge Intel’s performance any time in the near future, but some of the more pessimistic appraisals of its scalability may also turn out to be misguided.

Use of this site is governed by our Terms of Use and Privacy Policy. Copyright 1996-2015 Ziff Davis, LLC.PCMag Digital Group All Rights Reserved. ExtremeTech is a registered trademark of Ziff Davis, LLC. Reproduction in whole or in part in any form or medium without express written permission of Ziff Davis, LLC. is prohibited.