AMD confirmed that the power management of the Bulldozer core is an improved version of the power management improvements that are part of the “Llano” CPU. Just like Llano, Bulldozer has a Digital APM Module. The APM modules samples a number of performance counter signals and these samples are used to estimate dynamic power with 98% accuracy. Now combine this power estimate with Bulldozer's power gating at the module level and vastly improved clock gating and you can start to understand what is possible.

Bulldozer reduces the number of active and power consuming circuits by vastly improved clock gating

If your application runs only one or a few threads on your 8-module, 16-core Interlagos CPU, several of those modules might be power gated. Or if you run integer-only threads, the fact that quite a few unused parts (i.e. the FPU) of the module will be clockgated might be enough to stay under the configured TDP. So in those cases, it won't be necessary to limit the clock speed. And that is really great, especially in the real world.

In the real world, only a few HPC application behave like the SPEC CPU rate benchmarks, which spawns threads accross all cores. Most server applications do not fully utilize all available cores all the time. Sometimes, only one thread will be really critical and the perceived application performance will depend on it. A little bit later several threads might demand CPU power (but not all cores will be busy). Only a certain percentage of the time are all the cores used. That is exactly the reason why the cheaper Magny-Cours make so much sense for HPC applications, yet it struggles to keep up with the higher clocked, higher IPC Xeon Westmere cores when running OLTP and ERP applications. Putting a power cap on a Magny-Cours means even lower frequencies, and as a result even higher response times (as we have measured here).

By adding power consumption measurements to the CPU, Bulldozer will run most server applications at full speed unless you lower the TDP too far. (Obviously, if the TDP is lowered enough, the CPU will not be able to operate at higher frequencies, thus degrading the response time performance too.) The maximum throughput will be a little bit lower, but most server applications almost never run at maximum throughput. In fact, maximum throughput only matters for HPC applications and benchmarks. For real human users, response times are the only thing that matter.

The beauty of this new power cap system is that in normal circumstances (e.g. the server is running at 40-70% load), the response times will hardly be any longer. At the same time, the adminstrator can make sure that the server cluster does not exceed the capacity of the cooling equipment and the power lines.

This TDP Power Cap technology could be very interesting to small and medium businesses too, and not only to owners of large server clusters. TDP Power Cap could be a way to make sure that your collocated servers never exceeds the maximum amount of amps allocated to you, and as result you will not have to pay unexpected high electricity bills. However, whether or not this ideal world of low response times and low electricity bills will become a reality for the Bulldozer server owners will also depend on the availability of a good and decently priced management software tool that allows the administrator to configure the TDP on all servers simultaneously.

On a standard server, you will get a section in BIOS that allows you to tweak the TDP in 1W increments (or a maximum of 64 power settings), a good step forward compared to the current p-state setting. But to control a server cluster in an efficient way, good management software is needed. Currently, you either have to buy all your servers from the same vendor (HP for example) and then pay for management software such as HP's Insight Control software. To really unlock this technology, AMD or one of their partners needs to make sure this kind of software is widely available--some open source code perhaps?

Post Your Comment

59 Comments

"According to leaked product positioning slides, Zambezi is aimed to fight against Intel's Core i5 and i7 lineups. Zambezi will feature up to eight cores, which is twice as many as i7-2600(K)'s four cores. AMD said that they won't join the Hyper-Threading club and they will deliver as many physical cores as Intel delivers physical and virtual cores combined. It looks like AMD is keeping their word, though they're only delivering half as many "FP/SSE cores". "

With hyperthreading and now Bulldozer's double integer core/shared FPU design, core counts are becoming increasingly a difficult metric to compare. It's important to note that while Bulldozer has doubled the number of integer cores compared to Istanbul, each integer core is actually weaker since Bulldozer only uses 2 non-symmetric ALUs and 2 AGUs compared to 3 symmetric ALUs and 3 AGUs in Istanbul. Perhaps other architectural efficiencies can make up the difference, but I wouldn't be surprised if clock-for-clock each of Bulldozer's integer cores is slightly slower than Istanbul's. I believe Sandy Bridge's integer performance is clock for clock better than Istanbul, so Bulldozer likely need very well threaded code for it's doubled integer cores to shine.

FPU resources look to be be beefed up from 3 units in Istanbul to 4 units in Bulldozer. Compared to Sandy Bridge, Intel's big advantage is native 256-bit AVX units compared to Bulldozer which only has 128-bit FP/SSE resources and needs to split 256-bit AVX instructions halving performance. So if Intel can convince developers to quickly adopt 256-bit AVX, Sandy Bridge should have a pretty large SIMD advantage.Reply

dude, you just sound like a horrified Intel fanboy. "convince developers to adopt 256bit AVX). Then what about FMA3 and FMA4 which intel doesn't even have.....

A single BD Module can handle a 256bit AVX or can deside to split into 2 x 128 for each core . It is a decision from AMD to go that way just like intel decides to have a 256bit full for a PH + HT core..... 2 x 256 logic would just need more die space without usage, just like the choice to go for 2 ALU/AGU while the usage of 3 is almost no gain in server loads besides benchmarking....

While the FPU 128+128 might be a bit slower we are talking here about perhaps 2-3% since all other parts like cache and memory are shared for a single module and very neglictable difference unless you are a fanboy which is obvious.Reply

I believe Bulldozer supports FMA4, but not FMA3 due to Intel flip-flopping on which one they'll support at the last minute breaking commonality. While FMA4 is a great capability to have, you pointing out that Intel doesn't have it is the concern. AVX could see faster adoption because it's supported by both Bulldozer and Sandy Bridge.

"While the FPU 128+128 might be a bit slower we are talking here about perhaps 2-3% since all other parts like cache and memory are shared for a single module and very neglictable difference unless you are a fanboy which is obvious."

I mention AVX performance, because I'm under the impression that Bulldozer gangs it's two 128-bit FMACs together to do 1 AVX per module per cycle while Sandy Bridge has 3x256-bit AVX units per physical core. Sandy Bridge's AVX units are non-symmetric and there are no doubt other factors that will impact performance so it won't be a 3x performance difference, but I'd think it'd be more than 2-3% given the big difference in raw processing resources.Reply

my 2-3% was only the difference between a single 256 vs 2 x 128, not against the intel part... lets see first how much AVX will be really used and how much will end up being 128 bit... doesn't mean something which is 256bit is always better then 128bit.Reply

I believe I heard once that Intel's implementation can execute either one 128-bit or one 256-bit instruction per clock. Bulldozer's fused implementation may give up on AVX throughput, but only AVX.Reply

'It's important to note that while Bulldozer has doubled the number of integer cores compared to Istanbul, each integer core is actually weaker since Bulldozer only uses 2 non-symmetric ALUs and 2 AGUs compared to 3 symmetric ALUs and 3 AGUs in Istanbul.'

why does everyone get hung up on this? yes, phenom had 3 ALUs and 3 AGUs. big deal! it could only complete 3 instructions per clock- any combination of ALU and AGU instructions but no more than 3. so how often could it process 3 ALUs consecutively?AMD has said that removing the 3rd AGU won't hurt performance and core 2, nehalem, and sandy bridge all have 2 AGU's.Bulldozer can complete 4 instructions per clock- same as core 2, nehalem and sandy bridge. granted, the all have 3 ALU's available, but how often is the extra one used?Reply

Got kids Phenom II X6 1055T based PC for their games like GTA and just for fun ran on it some scientific FP-oriented tests - parallel algebra codes and some single-core ones.Was shocked that at its 2.8GHz stock clock it is twice faster then my overclocked to 4GHz Intel processors. Is this what you guys get too? Kind of contradicting to all these game- and office-oriented and benchmarks where Intel is always on the top.

So i'm waiting for these 8-core 32nm chips in the hope to drive them to 4.5 GHZ and get additional factor of 2