Analyzing Bulldozer: Why AMD’s chip is so disappointing

This site may earn affiliate commissions from the links on this page. Terms of use.

AMD’s Bulldozer is finally here, after years of development — and its performance is significantly worse than anyone expected. The situation is ugly enough that it may explain why so many executives left AMD over the past twelve months, and why the company was so tight-lipped about their departure. Bulldozer’s general performance has been widely covered; our goal here is to drill into why the CPU performs the way it does rather than covering it in a wide range of real world scenarios.

Note: AMD’s Turbo Core and Intel’s Turbo Mode were disabled on all chips, in order to prevent them from adjusting the CPU’s clock speed and throwing off results. As a consequence, the results here will be lower than in a standard review, particularly for single-thread performance.

The first thing to understand about Bulldozer is that it leverages aspects of simultaneous multi-threading to combine the functions of what would normally be two discrete cores into a single package (AMD refers to this combination as a “module“). Each module contains what Windows identifies as two cores, but combining instruction scheduling and CPU resources has an impact on CPU scaling in multi-threaded tests when compared to the same programs running on “traditional” multi-core processors.

When AMD designed Bulldozer, it was aiming for a CPU that would be easier to ramp to higher frequencies while maintaining the same IPC (instructions per clock cycle) as its six-core predecessor. In order to hit higher clockspeeds, AMD lengthened the CPU’s pipeline and increased latencies throughout the architecture. The concept of building chips for higher frequency has had a bad rap since the disastrous Prescott Pentium 4; after seeing Bulldozer’s overall performance, AMD’s decision to take this route may not have been a very good one. As things stand, the FX-8150 struggles to surpass Thuban in a number of tests while its IPC definitely took a hit.

Before we dig into the CPU’s architecture, however, there’s an OS factor to discuss. According to AMD, Windows 7 doesn’t understand Bulldozer’s resource allocation very well. Windows 7 “sees” eight independent CPU cores, despite the fact that each module shares scheduling and execution resources. Sometimes it makes the most sense to spin threads off to idle cores before scheduling them on cores already busy with something else. Other times, it’s best to spin two related threads off to the same core. Windows 8 will apparently be much more proficient at scheduling work loads where it makes the most sense to execute them.

This issue has a practical impact on the CPU’s performance because of the way AMD’s Turbo Core is implemented. The new flavor of Turbo Core is meant to increase maximum clock speed by up to two speed grades if only four cores are enabled. Since Windows 7 doesn’t understand which cores to turn off, however, the CPU is less likely to increase its clock speed as high as it otherwise would. “Turbo” speeds were originally introduced by Intel as a way to squeeze more performance out of lightly-threaded or single-threaded workloads, but Bulldozer’s architecture makes those extra megahertz particularly important.

We checked the impact of Windows 7’s scheduler by measuring CPU performance in Maxwell Render 1.7 and Cinebench 11.5. Both programs allow the user to define a specific number of threads (four, in our case). The 4M/8C label means that all eight cores are active, 4M/4C means that all four modules are active, with one core operating per module, and 2M/4C denotes a dual-module/quad-core configuration. Both of these tests show a 4M/4C arrangement outperforming a 4M/8C system by roughly eight percent when four threads are used. This suggests that scheduler inefficiencies could indeed be hurting Bulldozer’s general performance in workloads that can’t take advantage of all eight cores.

Tagged In

It’s still possible we’ll see something game-changing if the yield issues at GF are resolved–but I wouldn’t hold my breath.

Anonymous

I think it’s safe to say we won’t.
After all, at most GF can decrease the TDP and improve the clockspeed.
However, what we’ve seen from reviews is that even with 1 GHz of overclocking, BD still isn’t very convincing against Intel’s offerings (at stock).
I’d say it’s unrealistic to expect GF to improve manufacturing so much that the CPUs can be clocked 1 GHz higher, and still stay within their TDP envelope, which would be required to sell the CPUs with higher stock speeds (motherboards/cases/CPU coolers are not designed to handle more than the maximum TDP).

Joel Hruska

The idea of them competing very effectively with Intel is…highly problematic. My goal is more modest. (All chip designations are hypothetical).

An FX-8170 at 3.9GHz / 4.5GHz TC with a 125W TDP would be fast enough to edge past the X6 1100T in nearly all of the less-flattering tests

A hypothetical FX-8190 with a 4.2GHz stock speed / 4.8GHz Turbo at a 140W TDP would be fast enough to cleanly out-distance the X6, particularly if it could actually make more use of TC. (Tests indicate the current crop of BD’s don’t actually run in TC mode all that often).

I think the FX-8170 is predictable given long-term trends in manufacturing and how both Intel and AMD reliably reduce power consumption at a given clock speed over the long term. The FX-8190 might be possible depending on what’s really going on at GF.

AMD’s Q3 conference call strongly implied that GF’s production issues are related to 32nm as opposed to being inherent to BD. (The CEO didn’t even *mention* BD’s launch or that AMD was shipping a new CPU. Not even indirectly. He didn’t even comment that Interlagos would ship in Q4). There were questions as to whether AMD would shift its long-term plans in a way that would impact its commitments at the ‘other foundry.’

The production troubles the company talked about focused on Llano, desktop Llano in particular, and they *didn’t* claim that everything was resolved. We already know that AMD had trouble scaling Llano for the desktop–the A8-3850 that launched in June ran at 2.9GHz and had a 100W TDP. The A8-3800 from early October is a 65W chip–but cuts the core frequency down to 2.4Ghz/2.7GHz TC.

It’s entirely possible that 32nm issues are causing both Llano and Bulldozer to draw far more power than they should. It would explain a lot.

Anonymous

Well, I figured “game-changing” would mean a bit more than just competing against their own old lineup. Sandy Bridge-E and Ivy Bridge are only a few months away. So competing with Sandy Bridge is not competing with Intel. It would just mean that AMD can more or less hold on to the position they are currently in (operating in the sub-$300 market).
Because once newer/faster CPUs from Intel arrive, price/performance will be redefined according to these new CPUs. Meaning that in a few months time it could very well be that even an FX at 4.2 GHz won’t be able to sell at $300 anymore with good price/performance.

I don’t really buy into the theory that 32 nm issues are what ails Bulldozer. There is a much simpler answer: it’s a 316 mm^2 chip with 2 billion transistors running at 3.6 GHz and more. Those specs have high leakage and poor TDP written all over them.
Llano is a different story.

I’d like to compare the situation to Pentium 4’s last iteration on 65 nm, the same process as the first Core2 Duo. A lot of people thought that Intel’s 65 nm process was failing, because Pentium 4 didn’t really benefit much from the die-shrink. TDP was still high, still seemed to leak a lot.
Then Core2 Duo came around, and had MUCH lower TDP, and much higher performance to boot. It even overclocked very well.
There was nothing wrong with the 65 nm process, the poor TDP was just inherent in Pentium 4’s design: a large chip running at extreme clockspeeds.

Now, I’m not saying that GF’s 32 nm is a resounding success… No, it obviously needs to mature a bit, it will get better with time.
I’m just saying that it’s not going to have a dramatic effect on Bulldozer. Bulldozer is bound to run into the thermal wall, just like Pentium 4 did. The problem is basically that the leakage/TDP is of an exponential nature, relative to clockspeed, where improvements in process will have more of a linear effect.

Joel Hruska

One of the charges leveled against BD is that AMD opted not to hand-optimize the design layout and instead relied entirely on automatic tuning. From Xbit: http://www.xbitlabs.com/news/cpu/display/20111013232215_Ex_AMD_Engineer_Explains_Bulldozer_Fiasco.html

“The management decided there should be such cross-engineering [between AMD and ATI teams within the company] ,which meant we had to stop hand-crafting our CPU designs and switch to an SoC design style. This results in giving up a lot of performance, chip area, and efficiency. ”

Thuban is 346mm sq and 948M transistors for just under 10MiB of cache. BD is a bit smaller but over 2B transistors for 16.5MiB of cache. The chip’s size and transistor count make a lot more sense if that information is accurate.

What it doesn’t give us is a way to separate manufacturing issues from design flaws. BD is nowhere near as bad as Prescott was upon debut–we had a Prescott-certified SFF system from Shuttle–one of their top-end launch systems–whose PSU actually ignited when we dropped a Prescott in it. I could tell if I was using a Northwood or a Prescott in a testbed just by resting my hand on the desktop PSU when the chip was under load–and this was using then upper-end equipment, not cheap Chinese knock-offs.

If XBit is correct, it also suggests that AMD should have a damned good idea where to start when it comes to fixing BD. Hand optimization won’t magically turn a turd into a tea kettle, but it’d be a good place to start.

Anonymous

Yes, I’ve read about that… but that is not something that can be fixed with manufacturing. For now, GF will just have to work with that 316 mm^2 behemoth.
It’s going to take AMD at least a year to hand-optimize the design and get it production-ready, I suppose. And I’m not sure if that’s even the proper answer. If you’re going to hand-optimize the design, you won’t be able to change that design (any changes you make have to be re-optimized). So it might be better to spend your time coming up with a better design. After all, if we assume the estimate of 20% larger chips with 20% worse performance is reasonable… Apply that to Sandy Bridge, and you’d still come up with a CPU that is considerably smaller than Bulldozer, still faster, and still with lower TDP. So there’s room for improvement in that area, probably more than just in hand-optimizing the current design.

As for Prescott… I disagree. Prescott may have been a powerhungry CPU at the time, but that was years ago. Even the most power-hungry Prescotts had ‘only’ 115W TDP (which was only about 5W more than the previous Gallatin-based Pentium 4 EE). These days it’s common to have 125-130W TDP for high-end parts.
When I had an AMD Thunderbird 1400, it also burnt out my PSU, and I had trouble cooling the CPU with the aftermarket coolers available at the time. Okay, it was considered a powerhungry, hot CPU at the time, but really it was ‘only’ 72W TDP, which is hardly shocking today.

The problem we had back then was more that the PSUs, motherboards and HSF solutions were often poorly designed, and as such, couldn’t handle these CPUs. In my case, the PSU had a few resistors mounted directly on top of eachother. Normally that wasn’t a problem… However, because the PSU was stressed harder than with most other CPUs, and because the air inside the case was hotter as well, because the CPU put out so much heat, the PSU just did not cool down enough. At some point the insulation of the resistors started to melt, and eventually they short-circuited against eachother.

Prescott stayed within its TDP just fine, as did my Thunderbird… However, that doesn’t mean that all hardware available at that time was actually capable of dealing with that TDP properly. Shuttle may have used a PSU that was not up for the job, or it may not have received the cooling required to do its job (a good PSU in a case with poorly designed airflow can still fail)… Or you could just have had a dud.

Those same CPUs will run without problems with today’s PSUs, cases and cooling solutions, since we’ve come a long way since then. Both Intel and AMD have released plenty of CPUs with more than 115W TDP in recent years, and they’ve all worked quite nicely.

Joel Hruska

Scali,

They pulled the Prescott certification for the whole line after launch.

Michael Schuette’s idle and load draws actually measure CPU power, not system power. Look at the P4 670 (Prescott) and 840EE. You’re right when you say that equipment wasn’t ready for them, but they were far more than just a tad worse than the chips they replaced.

Anonymous

So apparently they weren’t Prescott-certified then :P

And I think most people forget that Prescott was not just a die-shrink of Northwood. Intel reworked the entire chip to support x64.

I’m not sure what to make of those measurements. I sincerely doubt that Intel would ever sell any CPU that exceeds its rated TDP. The TDP is usually rather conservative anyway.
I don’t see any details on how they measured the CPU power consumption specifically.
There usually is no way to measure just the CPU draw. You’ll have to measure somewhere at the PSU or on the motherboard. Which means you’re also measuring the inefficiency of their components. In which case it’s not all that surprising that a Pentium 4 motherboard from 2006 has a less efficient VRM than a motherboard from 2011. Assuming 90% efficiency would mean that both Prescott and Smithfield are still within their rated TDP.

Joel Hruska

I talked to Michael, who confirmed that he measures 12V power consumption. Since the Core i7 family also draws power through the 3.3V and 5V rails, this means that his measurements don’t include ‘Uncore’ power consumption. I showed him your post and he responded:

“He is correct that you can only measure on the PSU or motherboard level unless you modify the boards, but you don’t really have to do that. Load efficiency has not really increased that much, most improvements have focused on improving VRM efficiency at idle.

It’s a known fact that Intel CPUs can greatly exceed TDP at least transiently. Details on how/why Intel splits CPU power draw across the 3.3, 5, and 12V rails for Nehalem is here: http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=44&Itemid=1 “

I can independently confirm that Intel defines TDP as a target dissipation capacity for “thermal design targets.” Company literature on the topic repeatedly states “TDP is not the maximum power that the processor can dissipate.” Intel’s Thermal Monitor technology is designed to prevent a CPU from exceeding its rated TDP by throttling back clock speed.

But this is all ancillary to the larger issue. Even if BD isn’t as bad as Prescott was relative to Northwood, AMD isn’t in a position to absorb the disappointment.

Anonymous

Yes, I know what the literature says… but I assume he measured an average, and not a maximum peak value.
In which case the logic is as follows:
The heat dissipated by a chip is roughly equal to the electric power passing through it (a chip doesn’t really perform any ‘work’ in the physical sense).
The TDP rating is the minimum amount of heat that the cooling solution has to dissipate. If the cooling cannot dissipate as much, then the system will slowly heat up. Therefore it is possible for a chip to temporarily exceed TDP… as long as the average remains below TDP, so the cooling can compensate.

Hence it follows that you shouldn’t see an average power draw above the TDP rating, because that would mean the system would run out of spec of the cooling solution.

Also, the efficiency of modern parts isn’t really that relevant. It is common knowledge that modern CPUs, especially Intel’s, generally stay well below their rated TDP. So even if modern parts weren’t more efficient, my theory still holds: the CPUs could operate within TDP, but you’re not measuring just the CPU.

Anonymous

I believe that there have been enough documented cases where the CPU “could” operate within the TDP margin as long as it was light to moderate load but at the same time, as soon as the system was under full load, the TDP would be exceeded and the CPU would be throttled after asserting the ProcHot signal and inducing duty cycle reduction which is completely transparent to the end user since it does not show as clock speed reduction. However, benchmarks show significant drops in performance and Throttlemark was for some time the favorite review tool for any Intel CPU.

“but you’re not measuring just the CPU.” Very true, there is a 10-20% overhead but it becomes a matter of what is reasonable to “correct” and showing the raw data including VRM at least prevents you from false assumptions. We have measured “before VRM” and after the VRM (by desoldering the chokes and measuring the voltage drop across a serial resistor) and the differences were typically within less than 10% under load. So yes, you are correct but the distinction is not really relevant.

Anonymous

I think it *is* important.
Namely, the measurements are within about 10% of the rated TDP of the CPU.
If we are to assume that there is about 10% error introduced by the method of measurement, then at the very least we can conclude that the CPUs aren’t going to exceed the rated TDP drastically.

Now, in this light it is hard to believe that Joel’s Shuttle PSU was ‘destroyed’ by Prescott. If the CPU went over TDP at all, it was likely only going to be a few percent at most, and that should not be enough to blow out a PSU. Firstly, because the system designers shouldn’t pick a PSU that *exactly* matches the CPU’s specifications, but they should always overdimension it a bit, to be on the safe side.
Secondly, because the PSU builders will likely also have built in a small ‘safety margin’… so exceeding the specs by a handfull of watts isn’t going to blow out the PSU. It will just run a bit hotter than it should, but just a few % shouldn’t just blow out the PSU like that. It should be able to handle such situations for short periods of time, as long as it has time to cool off in between.
Lastly, proper PSUs shouldn’t blow out at all, but should just shut down when an overload is detected.
I really think the PSU was just a dud.

Joel Hruska

I talked to Michael, who confirmed that he measures 12V power consumption. Since the Core i7 family also draws power through the 3.3V and 5V rails, this means that his measurements don’t include ‘Uncore’ power consumption. I showed him your post and he responded:

“He is correct that you can only measure on the PSU or motherboard level unless you modify the boards, but you don’t really have to do that. Load efficiency has not really increased that much, most improvements have focused on improving VRM efficiency at idle.

It’s a known fact that Intel CPUs can greatly exceed TDP at least transiently. Details on how/why Intel splits CPU power draw across the 3.3, 5, and 12V rails for Nehalem is here: http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=44&Itemid=1 “

I can independently confirm that Intel defines TDP as a target dissipation capacity for “thermal design targets.” Company literature on the topic repeatedly states “TDP is not the maximum power that the processor can dissipate.” Intel’s Thermal Monitor technology is designed to prevent a CPU from exceeding its rated TDP by throttling back clock speed.

But this is all ancillary to the larger issue. Even if BD isn’t as bad as Prescott was relative to Northwood, AMD isn’t in a position to absorb the disappointment.

Joel Hruska

I talked to Michael, who confirmed that he measures 12V power consumption. Since the Core i7 family also draws power through the 3.3V and 5V rails, this means that his measurements don’t include ‘Uncore’ power consumption. I showed him your post and he responded:

“He is correct that you can only measure on the PSU or motherboard level unless you modify the boards, but you don’t really have to do that. Load efficiency has not really increased that much, most improvements have focused on improving VRM efficiency at idle.

It’s a known fact that Intel CPUs can greatly exceed TDP at least transiently. Details on how/why Intel splits CPU power draw across the 3.3, 5, and 12V rails for Nehalem is here: http://www.lostcircuits.com/mambo//index.php?option=com_content&task=view&id=44&Itemid=1 “

I can independently confirm that Intel defines TDP as a target dissipation capacity for “thermal design targets.” Company literature on the topic repeatedly states “TDP is not the maximum power that the processor can dissipate.” Intel’s Thermal Monitor technology is designed to prevent a CPU from exceeding its rated TDP by throttling back clock speed.

But this is all ancillary to the larger issue. Even if BD isn’t as bad as Prescott was relative to Northwood, AMD isn’t in a position to absorb the disappointment.

Anonymous

One of the most thoughtful tech articles I have ever read. I think you deserve a medal or a little award or something for this article. Whatever they pay you, it’s probably not enough.

http://www.mrseb.co.uk Sebastian Anthony

He gets paid in love and adoration, like those found in your comment.

Darren Means

Yes, and they taste great with a little steak sauce… and some steak.

Joel Hruska

Stocklone,

Thank you very much for your kind words. Much appreciated.

http://pulse.yahoo.com/_7LHBJ5PGDWWOJWZMZVH7MMZA5M Fashion

Share
a website with you ,
( http://fashion-long-4biz.com )
Believe you will love it.
credit card and f ree s hipping.
I bought two pairs. Cheap, good quality, you can
go and ship with there

overall a good article but Joel definitely needs to investigate Bulldozer’s performance more thoroughly. BD is obviously designed for heavily threaded workloads, as is evidenced by it’s huge L2 and L3 caches; a more complete picture would have emerged if he ran the tests with more than 8 threads each, really cranking up the thread count to 12, 16, 20, maybe 24 threads.

i have an i3 2100 (dual core, HT), 8 gigs ddr3 1333 and a gigabyte h61 based mb. i ran the cinebench 11.5 64 bit multithreaded benchmark at the default (for my system) 4 threads and got a score of 2.22. i then upped the thread count to 6 and got a score of 2.33. switching to the 32 bit variant i ran the 4 thread test and got 2.33 and going to 6 threads i got

rerun the cinebench test and while you’re at it run the x264 benchmark but manually edit the vb script and change “threads=auto” in both the first pass and second pass lines to “threads=16” and then do the same test with “threads=24”, i have a hunch that we’ll end up seeing BD in a new light under those conditions.

http://www.mrseb.co.uk Sebastian Anthony

Sounds interesting — will investigate :)

Joel Hruska

Cranking up thread counts doesn’t magically give a chip more performance or create new execution units out of whole cloth.

Chris K

i’m going to go out on a limp and guess that you don’t have a background in comp sci; x86 processors are out-of-order cpu’s, meaning they can execute instructions as the data becomes available, furthermore both bulldozer and post-conroe intel processors can fuse instructions and execute 2 as 1 under the right conditions.

as i said i ran the 32bit benchmark on my i3 2100 and here’s the results:

I’m going to go out on a limb and say that you don’t have a background in performance analysis. CPUs with lousy cache latencies, a low IPC, and a small number of instructions issued per clock relative to the number of cores don’t magically gain performance or create new execution units out of whole cloth just because we throw more threads at them. In fact, the OS time required to spin off and manage all those extra threads can hurt performance.

Take my word for it.

Anonymous

Out-of-order execution and instruction fusing can only happen within a single thread.
So throwing more threads at the problem is not a way to exploit these core features.

HT is a way to exploit the out-of-order execution of a single core by feeding instructions from 2 threads. However, most software will already run with the same number of threads as there are logical cores.

More threads than (logical) cores are mainly useful when the threads are waiting for I/O (which happens at the OS level, not at instruction level, so nothing to do with ooo or instruction fusing).
Therefore especially in server tasks, having many more threads than there are cores can be beneficial for performance… once a thread goes into idle mode waiting for I/O completion, the OS can schedule another thread on the same core.

Bunthoeun Has

” The 4M/4C label means that all eight cores are active, 4M/4C means that all four modules are active,”

is that a typo??

http://www.mrseb.co.uk Sebastian Anthony

Yes, will fix it now, thanks.

andrew yu

OMG! it looks like bulldozer and its descendants will NEVER EVER match intel’s sandy/ivy bridge and Has/broad well . They should just scrape bulldozer and start from scratch! it’s not even remotely competitive! SHAME ON YOU AMD! Look at those ex-AMD executives and ex-CEO on the run! AMD investors and shareholders should sue these people for mis-management!

Joel Hruska

Not to put too fine a point on it, but you’re basically saying: “Bulldozer is terrible compared to Intel’s products from 2014-2016!

Well…yes. And the 65nm Core 2 Duo (Conroe) would absolutely suck compared to Sandy Bridge. Let’s not get ahead of ourselves. Furthermore, scrapping the architecture and starting from scratch is a non-option. BD was a new design from the ground up, and those take years. AMD doesn’t have the money to say “Oh well, this one isn’t what we wanted, let’s see what we get in 2015.”

andrew yu

well, from the way things are going, for anything that is based around BD core don’t look competitive to anything from intel for at least 2012 – 2013. The CPU design is flawed, AMD should be targeting at extracting more performance per core rather than ramping up clockspeed. While multi-core CPU seems awesome, it’s incredibly difficult to write great multi-threaded applications.

We should have a one or two cores that does REALLY well for single threaded applications, and power up other cores when multi-threaded applications come into play.

With the carbon tax and climate change issue, we should be conserving power and improving power efficiency, not gunning for more clockspeed. that’s lame from the way I look at it. Thank goodness AMD have bobcat cores to fall back on!

http://pulse.yahoo.com/_D5CZDMABOPTGFJPQI3JE55N4NI Gary

Joel – Gary Silcott from AMD here.. This piece starts off with two disclaimers that essentially invalidate the rest of your premise. (Below.) We were also very clear that i7 is not our competitive target and that scheduling will improve on Win8. The article doesn’t read nearly as brutal as the headline does; I only hope people read that far. I encourage your readers to expand their search to make sure some of the more moderate opinions are taken into account. Appreciate your consideration of AMD, as always. Thx.

“our goal here is to drill into why the CPU performs the way it does rather than covering it in a wide range of real world scenarios.”

Note: AMD’s Turbo Core and Intel’s Turbo Mode were disabled on all chips, in order to prevent them from adjusting the CPU’s clock speed and throwing off results. As a consequence, the results here will be lower than in a standard review, particularly for single-thread performance.

http://www.mrseb.co.uk Sebastian Anthony

Hey Gary! Thanks for stopping by.

What do the disclaimers invalidate exactly? The purpose of the story is to work out why Bulldozer seems to perform so badly. Scheduling is certainly one aspect, but Joel seems to have covered some other issues, too.

Joel Hruska

Gary,

There’s a difference between having to re-position a product that doesn’t perform as expected and deliberately designing a mid-range part from the ground up.

Calling Bulldozer “disappointing” isn’t brutal–it’s accurate. Perhaps more to the point, it’s a broad expression of what the chip means for AMD’s near-term prospects in the server and desktop markets.

I will readily admit, it’s possible that foundry troubles at GF are responsible for Bulldozer’s TDP and relatively low clock speeds. Improvements on that front could allow for substantially faster chips at better TDPs. Certainly it’s true that some chips seem to be dramatically better than others–Tech Report reported needing 1.46v to stabilize a chip at 4.6GHz, while mine runs at the same clock speed on just 1.32v with the same cooling solution.

None of that, however, changes the fact that BD’s cache latencies are atrocious and its IPC is significantly lower than Thuban’s. Improvements to the Windows scheduler won’t appear for at least 12-15 months.

Right now, the FX-6100 (6-core, 3.3GHz) and the X6 1100T are both $189 at Newegg. Asked to pick between them, I’d aim buyers at the X6, 10 times out of 10. Not only does it offer better single-threaded performance, it scales much more effectively.

I genuinely believe AMD can fix Bulldozer–but thus far, your company has given zero guidance on when to expect any sort of improvement. Until such improvements materialize, Bulldozer is a dubious value proposition wrapped in vague promises of a better tomorrow.

http://www.facebook.com/ajay.desai Ajay A. Desai

Gary,

I’ve been an AMD fan and shareholder for around 8 years and I need you and your company to understand two things.

1. The reviews of Bulldozer sent resounding shock waves into the core audience that has been waiting several quarters for Bulldozer to finally show up. The 8150 being comparable to the i5 lineup and not screaming victory and a new era of performance in computing disheartened your fans and worse, betrayed their trust. I myself delayed an upgrade (with 400$ burning a hole in my pocket) and the day the reviews came out, I marched out to microcenter to buy a 1055T X6. I don’t know if you hired Intel’s marketing department to claim the “fastest CPU and Guiness world record breaking CPU along with “MOAR COARS!” AMD marketed it’s own core audience for years about the Mghz Myth, IPC, and how to determine true performance, then tried to play the same trick.

2. You cannot, and should not, internally, attempt to skew this as a “stalemate”. Internally, you should be restructuring your team to rebuild the processor using the “hand made” transistor placement and not the automated process that AMD attempted to employ. Your marketing team should be fired for perpetrating a fraud to your consumers, destroying your credibility and tarnishing the brand.

Had you been honest and upfront about the performance of the lineup, claimed that this was the first step in establishing a new platform for next-next generation architecture and computing, you would have not suffered the backlash that you did. Claims would have met up with expectations. I know the Bulldozer architecture as a principle is the right pivot in the marketplace (as proven by Oracles SPARC similarities and the rise of multi-thread programming), but you failed to make a case with the informed consumer and you created a catastrophe with the uninformed media and stock analysts. Do me a favor and adopt this slogan company wide, “Under Promise and Over Deliver”. Perhaps then, one day, I will be able to sell my stock.

Joel Hruska

For the record, I think Ajay’s portrayal of this as a ‘betrayal’ is overblown. I don’t think AMD lied–I think AMD was truthful about what it expected to get and got something completely different.

Furthermore, I think anyone who believed Bulldozer would re-establish AMD as being on an even keel with Intel in a single–bound was fooling themselves. The gap had simply grown too wide; the chance that Bulldozer (or any new chip) would, at a bound, close to 40%+ gaps that had opened between AMD and Intel in lightly-threaded workloads was ludicrous.

With both of those things said, I *do* agree that the emphasis needs to be on rearchitecting, not on trying to spin what was clearly a screw-up. BD was never supposed to turn out this way. That much is obvious–so let’s not pretend otherwise.

http://www.facebook.com/ajay.desai Ajay A. Desai

Sure, magnitude of betrayal is in the eye of the betrayed. And I’m really targeting AMD fans that have been waiting for Bulldozer, discussing it’s merits with great spirit on forums and on October 12th got slapped in the face by Intel fans.

Many people (including ones I met at Microcenter) were purchasing Thuban cores or i7 cpu’s. These people are your early adopters and influence(rs) and they shouldn’t have to contend with a negative sentiment within their peer base. I am a part time system integrator in the premium ($1k-2k) computer range and I know that I will have a difficult time selling Bulldozer to any of my customers that can use Google. I don’t have the market research team to back up my theories, but Gary, your a Senior PR guy, either your polling or gut instinct has to tell you that articles like this one across tech sites are not what you had in mind.As for my vigor and candor, I’ll repeat I’ve been an AMD shareholder for the past 8 years, looking at the part of my portfolio is a tearjerker.

Anonymous

AJ, I also bought AMD stock (think it was in 2000/2001,) and have been holding it (what choice?) ever since. I currently own a few computers (AMD and Intel) dont need a new build, but was really pulling for Bulldozer/AMD from the perspective of stock value, as well as wanting to see (and benefit from) AMD/Intel competition, and consumer choices (especially for us performance minded/centric). Unfortunately for us, this latest release from AMD means we’ll be holding the stock a bit longer : ) – and waiting for 22nm ivy bridge to show us the true next gen..

http://www.Something.Something.Darkside/ Loki Fenrir

while your stocks run at a loss… it’s a pity, and a just relates, less than acceptable product, stating part-truths about that said product and then expect to succeed.

“oh, where has my logic run to.”

I was lucky to have got google shares before the Android fad hit the market.

Anonymous

most of this mess is due to Windows 7.
how does it perform under Linux?

Joel Hruska

No. The Windows 7 issue AMD highlights only affects Turbo Mode scaling and is modest at best. But if you want to see Linux performance, I always recommend phoronix.com.

Anonymous

Indeed, the problem is not ‘in Windows 7’.
It is in every OS released to date, including linux, since no OS has specific logic in the scheduler to avoid this scenario, which is unique to BD.

http://pulse.yahoo.com/_QOTCSTXUKNPYJUZXW67MFQBVG4 John Smith

The Linux review of bulldozer I read seemed to indicate similarly disappointing performance. Even under the Linux kerned, Zambezi didn’t do well compared to Sandy Bridge.

Anonymous

OK, Let’s wait just a minute. I’ve read a lot of comments about how bad Bulldozer is, and I simply don’t see it. First my FX 8120 loaded Win-7 from clean install in 10 minutes! My i7 couldn’t even come close to that mark nor could my 1090t. Yes I have all three. This issue with Bulldozer that I see is three fold:

1. Windows 7. The 8120 ran must faster before Windows downloaded all of the updates including service pack one. This was noted on load times and non-synthetic bench marks, and the article is right that it seems to hamstringing Bulldozer quite a bit.

2. Synthetic benchmarks have never been favorable to anything but Intel’s chips. The press is rehashing the same issues that were stated back when Phenom II X4 965 came out. The only mistakes that AMD have made are not sticking to its guns. If you look at the article’s benchmarks it beats 1100t in almost everything and competes quite nicely with i7 until overclocked when it beats it.

3. Ram Ram Ram. Memory has a huge amount to do with Bulldozer’s performance. If you are looking any test including a synthetic benchmark that uses less than a minimum of 8 gigs than its completely under powered. Perhaps Gary from AMD can shed some light on this, but Bulldozer seems to require at least 8 gigs of ram to achieve great performance and does even better at 16 gigs. I’ve not tried 32 but it stands to reason that if you have an i7 or Phenom II x4 with 4 gigs of ram than that equates to 1 gig per core. So with 8 cores you need 8 gigs minimum. Again I would like to hear Gary’s take on this.

4. Graphics card: If any of the benchmarks or tests are using anything less than a 6-series graphics card than that will lend a 20-25% performance blow to bulldozer.

Performance: Using Pinnacle Studio HD version 14 the Phenom II X6 1090t took 2 hours and 22 minutes to render a 1 hour HD video of me lecturing at 1080p. The FX 8120 took 1 hour 31 minutes to do the same in the exact same system.

Summary: Bulldozer is an enthusiast class processor that is well ahead of most software vendors at this time. It also seems to be designed to use enthusiast class products that (graphics, memory, and motherboard) and going el-cheapo with it will render poor performance.

Joel Hruska

Jarod_A,

Storage throughput and whether or not you’re using AHCI will impact Windows install performance much more than any chip. That said:

1) Windows *updates* are not hamstringing Bulldozer. Windows 7 SP1 is required to enable AVX support in any case.

Furthermore, the performance “penalty” in question boils down to “Your OS won’t let our CPU overclock as far as we want it to” as opposed to “Our CPU is less efficient because your OS screws up scheduling.”

2) This wasn’t a review. I not only disabled Turbo Core, I chose tests that allowed me to control thread propagation. Anandtech and Tech-Report both wrote excellent full reviews of BD. Both of them demonstrate that the FX-8150 has a great deal of trouble consistently outperforming the X6 1100T in real-world, non-synthetic benchmarks.

3) Stuffing RAM into a system doesn’t make tests faster. Desktop programs are considerably more latency sensitive then bandwidth sensitive. That’s why adding an integrated memory controller helped AMD so much in 2003 and boosted Nehalem’s performance so much in 2008.

Furthermore, your attempt to subdivide the amount of RAM in a system into a “per core” breakdown shows a fundamental lack of understanding for how RAM works or how a chip accesses it. If what you were saying were true, each core would need an independent memory channel *and* a dedicated memory controller.

4) The only benchmarks that depend on the speed of your GPU…are GPU benchmarks. The idea that the video card somehow hurt the performance of a chess simulator that runs in a command line window is beyond absurd.

As long as AMD keeps living in denial and believes that everything is
nice and dandy with Bulldozer, they (AMD) will become the next VIA of
the processor industry in no more than 2-3 years from now; and we all now
what happened to VIA, right? ;)

Anonymous

As long as AMD keeps living in denial and believes that everything is
nice and dandy with Bulldozer, they (AMD) will become the next VIA of
the processor industry in no more than 2-3 years from now; and we all now
what happened to VIA, right? ;)

Lance Colton

thanks for the article and taking the time to respond to so many reader comments.
I bought one of these just so i can try and overclock it to 5ghz. Sounds like it needs a lot of power, I hope I don’t need a new PSU! I also hope AMD can work with developers to optimize compilers and software for this architecture, the “improvements” seem a bit disappointing so far. I think I can benefit from more cores anyways, I tend to run a lot of background processes. I see 33 tray icons on my laptop, my desktop is probably worse :)

http://pulse.yahoo.com/_4RJXE5TU6OJ7VINBRJYG4WEN2U AlbertoL

So are they saying this cpu was designed for windows 8?

Anonymous

No, but this CPU could benefit from a special thread scheduling strategy. The same goes for Intel’s HT, but Intel’s HT has been supported since Windows XP, and received an update for the new Core CPUs in Windows 7.

For some reason, there is no support for Bulldozer in Windows 7 (AMD was already working on this concept for years, so they could have communicated with Microsoft… at the least they could run the HT scheduler on BD).
And for some reason, Microsoft is not going to release a patch for Windows 7.

Hence the first Windows to receive a special scheduler for Bulldozer will be Windows 8.
It’s not a magic bullet though. The current Win8 developer preview does improve performance somewhat, but not by the 30-40% it would need to catch up with Intel. We’re still talking single-digits.

Joel Hruska

AMD has made noise about Win 7’s sub-optimal scaling, but only as it applies to the application of Turbo Core.

Windows 7 may be more aware of the difference between an HT vs non-HT core, but running a test like Cinebench quad-threaded on a 2600K is still faster with HT turned off than with HT turned on. The reason it doesn’t really matter is that SB’s performance is good enough that no one really cares.

With BD, running 4M/4C is also faster than 2M/4C, and the chip’s performance is low enough that the extra boost is important. Unfortunately, AMD has no plans to sell BD chips in 4M/4C configs. The reason they talk about Turbo Core mattering for scheduling purposes is that they increase TC frequency to offset the 20% performance hit from running 2M/4C vs. 4M/4C.

Anonymous

As I mentioned before, HT is a partitioning scheme on the OoO-logic. So it is expected that a CPU with HT enabled is slower than a CPU with HT disabled. Certain resources are split in half to accomodate two threads. Basically the OoO-window for each logical core is half that of a physical core (you see now why it is so important to not make the mistake that HT is a physical core and a logical core? It’s two equal logical cores on one physical core).
Therefore it is theoretically impossible to have the CPU running as fast with HT enabled as with HT disabled, with 4 or less threads.
This has nothing to do with the OS.
However, in practice the difference is very small (as long as the OS schedules properly, and doesn’t give an additional hit by forcing threads to share resources of one physical core while other cores sit idle), so there is no real need to ever disable HT.
Aside from that, most software that is capable of using multiple worker threads will also scale to more than 4 cores anyway, so they can take full advantage of all 8 logical cores, in which case HT will be the winner.

Anonymous

Actually, MANY highly threaded productivity program developers have explicitly stated that the hyper-threading is not and will not be supported, for reasons that you’ve manage to either forget or factor out with broken “math”. I’m not sure where you get your ridiculously inaccurate percentages and complete disregard for
the severe inherent penalties with hyper-threading.

“Therefore it is theoretically impossible to have the CPU running as fast
with HT enabled as with HT disabled, with 4 or less threads.
This has nothing to do with the OS.
However,
in practice the difference is very small (as long as the OS schedules
properly, and doesn’t give an additional hit by forcing threads to share
resources of one physical core while other cores sit idle), so there is
no real need to ever disable HT.”

Oh yeah that’s a really minor issue, isn’t it. What if for some strange reason you don’t have cores sitting idle…Well that would be strange wouldn’t it.

Guess what, some of us do use our cpu’s to their full potential daily.

I’ve seen you posting on other “benchmarking” websites. You are akin to a Macintosh owner and blatantly ignorant. I also suspect the only time you’ve actually “fully” utilized your cpu is by running synthetic benchmarks. Winrar and childs-play “HD” video encoding doesn’t count.

All I hear is whining about IPC. Honestly, it is the bleating of uneducated children. I’ve noticed that the writer of this article is completely unqualified and fairly blind to the actual benefits of bulldozer, which includes the ability to potentially execute considerably more at a lower cost to the consumer. As always, AMD has delivered, especially for those of use that actually DO need an architecture for something other than games.

Imagine that.

Additionally, you need to educated yourself on the stability issues experienced by many the simple minded intel “ocer” with hyper-threading enabled.
We needn’t discuss the inherently inferior “QPI” or intel integrated memory controller. It’s funny how you or hardly any other intel owner (or perhaps minor functionary employee) never mention the unacceptable exceptions that must be made to “overclock” the i7 line(s).

Scali, since it seems that you have ample amounts of time to post various inaccuracies on benchmark sites throughout google, perhaps you should set a portion of that time aside for something other than writing bullshit and playing games.

Joel, your continual complete misunderstanding of how you OC an AMD differently than an Intel “OC” alone is unequivocal proof of insufficient knowledge. Please stop creating child-like bar graphs with meaningless benchmarks as evidence of anything other than your ignorance.

Evidence? even the k10 9550 hit 3.0ghz fairly easily, IF you knew what you were doing. Of course, I’m sure you didn’t back then, because you certainly don’t have a clue now.

Seriously, who even gives a damn about single threaded performance? Oh wait, I guess you didn’t actually need anything more than that net-burst P4.

Let me spell it out for you: If you take any amount of money (be it 200 or 2 million) and purchase however much AMD or Intel hardware you can for whatever that price or budget may be, you WILL have more purely mathematical total computational power by buying the most cost effective AMD chips.

Bulldozer is for people that need it, the i7 is for people that don’t.

The END.

Joel Hruska

E92m3,

The only thing sadder than blind faith that leads to wildly inaccurate assumptions is blind faith + a serious dose of vitriol. I’m only going to bother to respond to one aspect of your rant.

Potentially is a very good word to use. The FX-4100 isn’t a terrible deal at $129 compared to AMD’s old chips, but anyone who actually needs multi-threading performance would be better served by a six-core Thuban at $150. The FX-6100, at $189, compares directly against the 3.3GHz X6. It’s a terrible deal at that price.

You can rant about Intel as much as you want, but it doesn’t change the fact that AMD’s new products aren’t particularly compelling when compared to AMD’s *old* products. An FX-8150 at $189 would be an arguable value. At $279, it’s a joke.

Anonymous

I suppose I don’t have to reply to him at all. The Cinebench numbers speak for themselves.

Joel Hruska

Scali,

Interesting. Your words make me almost want to revisit the question using a test like Cinebench 11.5 and Windows 7 (CB11.5 scales much better above 4 cores than CB10).

Back in 2008, I handled the Nehalem launch for Ars Technica using Windows Vista. You can see CB10 performance here:

The most important point was that having HT on gave the best performance, obviously–but the performance difference in a quad-threaded test of CB10 shows the 965 without HT beating the HT enabled version by 28%. If what you’re saying is true, OS improvements in Windows 7 should have significantly closed the gap between the two.

Anonymous

Well, it would be relatively easy to take the scheduler out of the equation altogether: just run a single-threaded task on a CPU with HT on and off.
Since I’ve never heard anyone complain that single-threaded tasks are slow when HT is enabled, it follows that the same would go for N threads on an N-core CPU, as long as the scheduler does its job.

I know Windows 7 actually does a good job of scheduling for HT… As I posted before, this is what it does on an i7 860: http://bohemiq.scali.eu.org/Win7HTScheduling.png
If it does this with Cinebench with 4 threads (which I assume it will), then yes, I expect very little difference between HT on and off.

But it’s hard to say how much of the 28% were caused because of poor scheduling.
I suppose a test would be in order… Or actually, multiple tests, and paying good attention to what happens in Task Manager. Since Vista doesn’t pay any special attention to what cores it schedules on, the results could be quite random. It might not always be worst-case.
In fact, if it was worst-case, then 28% is actually quite impressive… that’d mean that it ran on just 2 physical cores, yet it was only 28% slower than 4 physical cores.

In fact, I think I’ll just give it a try myself.

Joel Hruska

Installing Vista for a cause. Brave man! ;)

(I don’t recall which SPs were out for Vista at the time, but I don’t think any of them included HT optimizations (unlike Windows XP SP1, which definitely improved HT performance compared to baseline XP).

As far as I know, HT *can’t* hurt single-core performance. The entire point of HT is to interleave workloads from multiple threads. If there aren’t multiple threads to interleave, the CPU shouldn’t care. I don’t recall ever seeing a test where HT hurt single-thread performance, even back when the tech was new.

It used to be possible for a badly-behaved program to create contention issues by demanding resources that were currently otherwise occupied, but that was a P4-era problem and, IIRC, more of a Win 2K issue.

Anonymous

Well, as I tried to explain earlier, when you enable HT, certain resources of the OoO logic are split in two (see the Intel Optimization manuals, which I linked to).
So you have smaller reorder buffers per thread, effectively giving you a smaller OoO-window. The CPU DOES care.

This could theoretically hurt single-threaded performance, if the code you are running has more stalls than the smaller OoO-window can compensate for.
However, in practice the buffers are large enough that you can’t really measure a difference.
But I suppose it is possible to construct a piece of code for this corner-case.
Anyway, as I expected, as long as the thread affinity is HT-friendly, this isn’t an issue in multi-threaded situations either.

The issue of contention has never been solved completely. This is partly the application’s responsibility as well. If an application spawns too many threads, the OS will have to schedule them somehow. It’s also nearly impossible for the OS to decide which threads should run on which cores to make the most use of HT (to maximize cache sharing for example, or combining an ALU-only thread with an FPU-only thread).

Anyway, with Windows 7 it appears that even the worst case scenarios for HT aren’t that dramatic. You may run into situations where disabling HT is faster, but if the difference is in the range of 4-10%, does it really matter? Applications scale much better with multiple cores anyway these days (like Cinebench 11.5, I had to artifically limit it to 4 threads, by default it runs with 8 threads, and then it scores much better than 4 threads without HT).
So usually you’ll only get more performance from HT, not less. And more than 4-10% better performance as well, so on average, HT is a win.
I wouldn’t recommend disabling it as a general rule… although there may be specific cases (if your system mainly runs a single application which happens to scale poorly with HT for whatever reason… and can not be fixed with manual setting of affinity).

Anonymous

Well, as I tried to explain earlier, when you enable HT, certain resources of the OoO logic are split in two (see the Intel Optimization manuals, which I linked to).
So you have smaller reorder buffers per thread, effectively giving you a smaller OoO-window. The CPU DOES care.

This could theoretically hurt single-threaded performance, if the code you are running has more stalls than the smaller OoO-window can compensate for.
However, in practice the buffers are large enough that you can’t really measure a difference.
But I suppose it is possible to construct a piece of code for this corner-case.
Anyway, as I expected, as long as the thread affinity is HT-friendly, this isn’t an issue in multi-threaded situations either.

The issue of contention has never been solved completely. This is partly the application’s responsibility as well. If an application spawns too many threads, the OS will have to schedule them somehow. It’s also nearly impossible for the OS to decide which threads should run on which cores to make the most use of HT (to maximize cache sharing for example, or combining an ALU-only thread with an FPU-only thread).

Anyway, with Windows 7 it appears that even the worst case scenarios for HT aren’t that dramatic. You may run into situations where disabling HT is faster, but if the difference is in the range of 4-10%, does it really matter? Applications scale much better with multiple cores anyway these days (like Cinebench 11.5, I had to artifically limit it to 4 threads, by default it runs with 8 threads, and then it scores much better than 4 threads without HT).
So usually you’ll only get more performance from HT, not less. And more than 4-10% better performance as well, so on average, HT is a win.
I wouldn’t recommend disabling it as a general rule… although there may be specific cases (if your system mainly runs a single application which happens to scale poorly with HT for whatever reason… and can not be fixed with manual setting of affinity).

So there’s still a difference, but HT disabled is only about 10% faster with 4 threads, not 28%. Much better anyway.
What I noticed though was that although the CPU usage was nicely at 50%, as expected, the threads didn’t seem to stick to a single core. You still saw all 8 cores getting a load, none of them 100% though: http://bohemiq.scali.eu.org/Cinebench4thread.png
So apparently the scheduling is not as perfect as I would like.

So, I decided to set the affinity to all even cores only, in Task Manager, and I ran the 4-thread test again, with HT enabled…
And I scored a magic 3.96! (I guess it’s in the margin of error that it scored higher than the non-HT run).
So that proves the theory that enabling HT doesn’t really hamper performance in itself, even though theoretically it might.
It’s the scheduling in the OS that hurts the performance of HT, not the CPU itself.

Anonymous

Well, as I said, I have yet to see BD delivering more performance than HT in *any* case.Not having the IPC hurts BD in most tasks…
And the few multithreaded tasks that it DOES win, aren’t that convincing. It needs roughly twice the amount of transistors that Sandy Bridge has. As I said before, architecture-wise, it’s better to compare it against a 6-core Sandy Bridge CPU. Then the transistorcount is more evenly matched, and it is pretty much a given that Intel’s 6-core CPUs with HT will beat BD at anything, even the most multithread-friendly scenarios.

Joel Hruska

Scali,

I *still* think you’re drawing stronger conclusions regarding core performance and transistor count than is necessarily warranted. The problem here is that Bulldozer is loaded up with enormous amounts of cache that aren’t necessarily doing much for the chip’s performance.

I’ve crunched some numbers based on the diagram AMD sent over. 41% of Bulldozer’s total die size is L2 / L3 cache. Given the chip’s lackluster performance, I’m entirely unconvinced that huge, slow caches are doing much to help.
I strongly suspect that re-architected, smaller caches with lower latencies could dramatically trim die size *and* improve performance. BD’s scaling factor of about 1.6x is in-line with what I expected. It’s not really the limiting factor holding the chip back.

Anonymous

Well, to that I say: why didn’t the geniuses at AMD think of that while designing this processor?
If AMD has any level of self-awareness, they should know that building large, dense caches with low latency and high associativity is not something they can compete with. Traditionally AMD has always had relatively modest caches.

There must be a reason why AMD chose to go with this cache-configuration. Sure, the caches aren’t working as well as they should (compared to Intel, the latencies are ridiculously high)… then again, they don’t seem all that bad either (more or less in line with Phenom, despite the obviously more complex nature of the cache, where L2 is shared with 2 cores, and L3 is now shared with all 8 cores).

I don’t think that trimming caches is the answer here. It would only result in even lower IPC (clockspeed is not the answer, everyone knows that after Pentium 4… you need to improve IPC).

I think the answer is to get the caches to work better. Perhaps it’s time for AMD to ditch the exclusive caching strategy, and just follow Intel’s lead. That alone should lead to a simpler, lower latency cache. They should also be able to shave a few cycles off by just tweaking the cache logic, because compared to Intel, there’s still a lot to gain.

Other than that, my ‘save the Bulldozer’-strategy would be to ditch one module, and go for a 6-core design instead (which already cuts some of that L2, and you could scale down L3 to 6 MB accordingly… although I probably wouldn’t do that).
With the die space that is created now, I would beef up the cores themselves. Back to a 3-ALU model, and perhaps the FPU can be beefed up as well (they should already have 1 128-bit FPU per core, but performance doesn’t indicate such… so make it a proper dedicated FPU per core. And if die size doesn’t permit more, then just have it perform 256-bit AVX in two 128-bit operations). Just whatever it takes to get that IPC above and beyond Phenom levels, and closer to Intel’s.

Joel Hruska

Scali,

No one knows. From following threads at RealWorldTech, Lost Circuits, etc, the general feeling is that BD’s cache configuration makes very little sense. A small, write-through L1 backed by an ultra-high latency L2? Sharing the L2 between a module theoretically hides latency through multithreading, but not nearly well enough.

Current prediction is for an 18-cycle L2 on Piledriver vs. 20-cycle at present. That’s not real compelling. I suspect BD’s low IPC is partly caused by high cache latencies–the L2 latency is nearly twice as long as Phenom II’s.

The problem with your suggestion (I suspect) is that BD was designed to share resources at the deliberate expense of high single-thread performance. A number of functions that were coupled in K10 were de-coupled in BD and there’s no simple way to re-optimize the chip.

I agree with expanding the FPU to handle 256-bit AVX instructions in a single cycle, but the FPU performance hit relative to Thuban wasn’t actually that bad. Thuban has 50% more FPU’s than BD, but performance in some FPU tests I ran that I elected not to include here showed the X6 1100T outperforming the 81-50 by 10-30%.

I think the FPU tradeoff makes some sense given that BD is a first-gen part and the long-term plans to shift FPU code to the GPU. The integer performance is where this chip drops the ball. Frankly, if AMD *can’t* significantly improve performance by adjusting cache size and latency, I’m not sure what they can do in any sort of timely fashion.

Anonymous

Well, my assumption is that they weren’t planning on making the L2 high-latency, they just couldn’t get it any faster than this.
They had to make it shared in order to make the whole SMT/module thing work I suppose (technically the two logical cores of HT share the L1 and L2 cache as well).
But making a shared cache efficient is difficult. And AMD hasn’t really had any experience yet.
Intel already had a shared L2 cache between 2 cores in the Core Duo, and later the Core2 Duo (and in a sense the Pentium 4 HT as well). But AMD… the only shared cache they’ve made so far was the L3 cache on Phenom, which wasn’t that much of a success.
I think there’s a lot of room for improvement there.

Well, I think you’re turning things around here… My suggestion is based on the *conclusion* that BD’s resource sharing scheme doesn’t work. Yes, I know what it was *supposed* to do, but it failed at that (both ways: neither the intended multithreading performance nor the intended savings in transistor count), so now is the time to get practical and just fix the problems in the design.
However, note that I didn’t say anything about ‘re-coupling’ things, or changing anything about the module-based design in general. Just to make the IPC higher by adding some execution units (and reworking other bottlenecks with the extra transistor budget you get from removing a module).

The problem I have with AMD talking about GPU processing is that they’re just talking about it. nVidia actually gets commercial software with GPU support on the market. But Stream or OpenCL? Nothing except for some ‘toys’ like folding@home or bitcoin. AMD just doesn’t have the capability to push any technology. Intel and nVidia are surely not going to push OpenCL, so when exactly will this OpenCL revolution start happening? OpenCL has been around for a few years now, and Stream even longer. By the time OpenCL becomes even remotely useful, Bulldozer will be long gone.

Joel Hruska

Scali,

Seen this: http://www.pcper.com/files/review/2011-06-16/amd_fsa01.jpg

That’s AMD’s plans for CPU/GPU integration as laid out this past summer. I agree with you re: them just talking about up until Llano. Since Llano, they’ve at least started to do some genuine work in the area — most of it focused on small-scale consumer-level stuff rather than NV’s huge push for GPGPU.

I agree with you 100% regarding NV’s long-term commitment to the field as compared to AMD’s thus far and their overall level of resources–namely, not much.

Regarding BD Changes: A lot of what you’re suggesting boils down to “bolt parts of Thuban back into BD.” That might be a good idea (the chip’s maximum dispatch of 16 instructions per clock is pretty low), but I sincerely hope AMD has a better roadmap than that.

Bulldozer was a very ambitious design. Dropping the second core and attacking the IPC issue in the way you suggest would mean abandoning a significant number of the architecture’s primary features. It’s hard to see that as a viable move forward — but then, BD itself doesn’t look all that viable long-term without some significant changes somewhere.

Anonymous

“bolt parts of Thuban back into BD.”
Not at all. Re-read what I said.

“but I sincerely hope AMD has a better roadmap than that.”
I’m afraid they don’t. As far as we know, PileDriver is just a slightly tweaked version of Bulldozer, going for 10-15% more performance. Which would not be enough.

“Dropping the second core and attacking the IPC issue in the way you suggest would mean abandoning a significant number of the architecture’s primary features.”
I didn’t say to drop the second core. Re-read what I said.
I said that they should go from an 8-module design to a 6-module design. That would be 6 modules of 2 cores.

Joel Hruska

Scali,

My bad on that. I was quite ill the last few days. Sorry for mis-reading.

Colby Family

Joel,

Thanks for this article. I think this kind of analysis is required if AMD is ever going to compete in the *enthusiast* desktop. I use AMD myself, I have a dual socket 6000 SQL server in a SOHO and it does what I need for a price I can afford. I have another single socket VM server which I just upgraded to the AM3+ and am hanging out waiting for the fx- chips to become available again.

Having said all that I came within a hair of going with an Intel for the VM server. I have a single VM where thread performance is absolutely required, and I sooo wanted that Intel IPC for that VM. However I also need multi-cores and lots of memory. It was a tough decision believe me. In the end I was left wondering how the Intel hyperthread “fake cores” would work performance wise in a VM? What would happen if I assigned a single “core” and it was one of the hyper-thread cores? Hyper-V does not tell me which “cores” are a hyperthread nor does it allow me to assign specific cores to a VM (AFAIK). This is one of the things where all the “benchmarks” just fall flat and I have never seen this specific issue addressed. So I went with AMD (again) because at the end of the day it is a real core I am assigning and I can define what the performance will be across all of my VMs.

I absolutely agree with you that the AMD design team needs to do this same kind of analysis and get their IPC up. In my world (the server) real cores matter. But so does IPC. In the desktop, IMHO all this stuff is fluff except for the enthusiast. We have so much CPU power now that mom and dad surfing the internet, or Joe Employee creating a word doc will never see a difference, won’t care about the number of cores or IPC. I would bet a day’s pay that only .01% of the systems out there ever transcodes a video or does any of the other things that the enthusiast so eagerly searches the benchmark results for. All that stuff just doesn’t matter for a non-workstation machine.

However, in these forums (and only in these forums) the enthusiast matters, and in these forums AMD’s reputation takes a beating.

It looks to me like AMD made a business decision to try and address both markets – server and desktop – with a single design.

At the end of the day, AMD’s entire gross income is less than Intel’s research budget. You can bitch and moan about AMD’s “lack of competitiveness” but IMO they do an absolutely remarkable job with what they have. I truly believe that AMD just can’t afford to address the enthusiasts.

But of course… what matters to your audience is (mostly) benchmarks and how many frames per second they get out of their latest favorite game. And so we have threads like these.

Joel Hruska

Colby,

“It looks to me like AMD made a business decision to try and address both markets – server and desktop – with a single design. ”

Pull reviews on the original Athlon 64. It’s absolutely a server-centric chip–and it excelled on the desktop.

The problem here isn’t that AMD built a server chip that isn’t scaling very well. If it was, AMD would be a doing a lot more behind-the-scenes work to prep everyone for Interlagos. I hope servers turn out to be a good fit for the chip, but I don’t think things are going to turn out that way.

You’re absolutely right when you talk about the huge resource mismatch between AMD and Intel–but that doesn’t change the fact that BD doesn’t really meet the needs of the markets AMD wants to compete in.

http://pulse.yahoo.com/_QOTCSTXUKNPYJUZXW67MFQBVG4 John Smith

Let’s hope AMD can redo bulldozer. Remember the Phenom I? It took a speed hit due to an architecture bug that required a software fix, and essentially went down the rabbit hole to obscurity when AMD quickly released the Phenom II. If the executives have any brains they had better be working on a Bulldozer reboot.

I would like in the meantime to have two X6 processors shoehorned together on a single chip, 12 cores whooo.

Joel Hruska

The problem with Phenom (the original) is that it was a hot-running, oft-delayed design that didn’t hit its original goal of competing with Core 2 Duo. Ironic that you bring it up — the issue here is that AMD has been driven out of the server market due to ongoing problems like this.

Jan 2009 – AMD releases Phenom II, which competes effectively with C2D but can’t match Nehalem. It’s obvious, even then, that Shanghai/Deneb aren’t going to be powerful enough to put the company back on an even keel against Intel.

October 2011: Nearly three years later, Bulldozer stumbles again. It doesn’t deliver anything like the performance targets AMD wanted and certain aspects of its design (the huge caches) undercut the savings advantage of combining so much core logic.

The problem with saying: “Just wait for Piledriver” is that AMD is even farther behind the performance curve now than they were after Phenom launched. AMD finished 2007 with something like 30% of the server market. They’re currently down to 4.5%. Worst of all, Piledriver will have to be substantially better than BD just to give the new core a solid price/performance standing against *Thuban.*

I’m starting a new comment to prevent the absurd right-justification that occurs in this posting system.

It’s been seven years and I didn’t write the article, but here’s the original link. http://web.archive.org/web/20040510010524/http://www.sudhian.com/showdocs.cfm?aid=494&pid=1845

The chip *was* overclocked when the PSU burned out–but given that we’re discussing Prescott’s overall TDP and temperature, the other data points still demonstrate the trend. The PSU in question was a 220W unit, which, at the time, was a high-end option. Shuttle later opted for a different, 250W unit in that line of Prescott hardware.

When Intel shipped the P4 670, I tested it in an open-air configuration using the stock, approved heatsink. The chip would trip its own throttle monitor under load, in a room where the ambient air temperature was below 70’F. You can find additional anecdotal info on the chip’s heat output here: http://www.theinquirer.net/inquirer/news/1036882/is-intel-prescott-p4-hot-handle

I promise you, there was a lot of email exchanged regarding what PSUs were and weren’t Prescott-capable between Intel, reviewers, and various manufacturers. There were genuine widespread issues and certification problems, and it’s no accident that Intel debuted its BTX form factor in 2004-2005.

http://pulse.yahoo.com/_OJS2BF5UYJBHD6HF4RJ3RXS5SM Ravi

Joel, thanks for the great article. This is my first time reading anything on extremetech (usually sticking to Tom’s Hardware or Anandtech), and I’m incredibly impressed not only by the article, but the quality of the comments and the fact that you are seriously engaged in the conversation with your readers. That’s something I don’t see a lot of. Looks like I’ll be coming back to this site more often.

Joel Hruska

Thank you, Ravi. I very much appreciate that. You may want to check some of the more recent developments for AMD:

k i am not going to say o bulldozer sucks or that its the greatest chip in the workd but for those that say it sucks in gaming do not game like i do i like to 2 box or 3 box games it means i play 2 or 3 games at the same time and i have a bulldozer chip that lets me do this i can not do this with my i7 even though i wished i could because its sweat chip as well but for those that say it sucks it does thing my i7 cant do

mo fei

Share
a website with you ,
Share
a website with you ,ww.small-wholesale.com
Believe you will love it.
credit card and f ree s hipping.
I bought two pairs. Cheap, good quality, you can
go and ship with there
Believe you will love it.
credit card and f ree s hipping.
I bought two pairs. Cheap, good quality, you can
go and ship with there

mo fei

Share
a website with you ,
Share
a website with you ,ww.small-wholesale.com
Believe you will love it.
credit card and f ree s hipping.
I bought two pairs. Cheap, good quality, you can
go and ship with there
Believe you will love it.
credit card and f ree s hipping.
I bought two pairs. Cheap, good quality, you can
go and ship with there

mo fei

Share
a website with you ,
Share
a website with you ,ww.small-wholesale.com
Believe you will love it.
credit card and f ree s hipping.
I bought two pairs. Cheap, good quality, you can
go and ship with there
Believe you will love it.
credit card and f ree s hipping.
I bought two pairs. Cheap, good quality, you can
go and ship with there

http://profile.yahoo.com/DN64GE2AMC7L6Z7SVCCRGL7B7M nipinthebud76

what poeple are not realizing is that, the benchmarks are not real world work. People are forgetting that 4 and six is more than 2. the point im making is. the average pc does not do benchmark work but real world work. so saying thats the bulldozer is crap, must mean either you dont own one or ur just too stupid to realize 6 and 8 is more than 4 with alot of cache.

Victor Custódio

This can also happen because of the single floating-point processor on a module. Since ou have 50% integer and 50% floating-point operations, you will get a 25% average performance loss on that process.

michael ray

I know this article was written quite some time ago but I am glad it is still up to read on this website. I like the idea of thoroughly researching to get an understanding of this current topic since the lawsuit topic is still hot water.
Sorry for in coherency but I appreciate the work you did in getting this put together.
You deserve a cookie and a t-shirt.

Joel Hruska

Thank you very much!

chris cranmer

Sci soft sandra is an undecipherable mess.

This site may earn affiliate commissions from the links on this page. Terms of use.

ExtremeTech Newsletter

Subscribe Today to get the latest ExtremeTech news delivered right to your inbox.