There's a good chance that the majority is correct and I am in fact wrong, but... Well, that's just how I read their response. I feel there is a good chance of some more confusion afoot, much like the percentages being thrown around in the original article. Reply

For what I understand, AMD figured out how to reduce core size by 25% without impacting performance.
Each 2 cores will now share the same fetch/decode units (using SMT like Intel), and also the same FP unit (but doubled for 256 bits so actually it's 2 128 bit unit), but seperate Int unit like before). So actually they share half of the logic of two cores together, so they now use 150% of the die area of one core for 2 cores, or in other words save 25% of each core (75% * 2 = 150%).
But, it will still have 1/2 the throuput of Sandy Bridge in FP, and they still will have 1/2 the bandwidth of the fetch/decode because they use 1 for two cores instead using 1 per each.

Nevertheless it looks like a wise decision in terms of power/performance. So nice, but it won't give AMD the performance crown.

From the way it seems, I'm afraid the badly delayed, highly anticipated, much hyped and AMD's only hope to retake the performance crown from Intel will fall short of expectations. Unless they really come up with a competitive n powerful processor, I'm afraid the AMD we know from A64 days will continue to be history till the next major architecture after bulldozer which could well be 5 years or so after 2011. AMD to be budget player till then. Reply

Buldozer seems too late against Intel upcoming offerings.
An eight core Buldozer will be clearly slower than an eight core Sandy Bridge, in both integer and Fp.
This cpu implementation seems done to fight Nehalem ( two 128 bit units, both possibly utilized from one core only ).
Sandy Bridge will have two times Fp power and threads per die,
assuming the article right.
The only manner to be competitive is to consider a single "block"
like a monolitic core. Intel can answer with 50% more cores/die,
performing a complessive better integer and Fp performance.

Still we don't know what will be the new integer performance of the
Sandy Bridge integer unit. I believe it will be higher than in Nehalem.
Reply

I don't buy this claim that FP will be eliminated from CPUs in favor of doing it all on a GPU. There are too many situations where FP is still needed on a per core basis with a primarily integer load. About two minutes after the first systems ship with no integrated FP in the CPU (Bulldozer SX?) there will be engineers thinking themselves clever by proposing to boost FP performance by integrating it into the CPU die!

What will happen instead is the FP and onboard low-end graphics solution will merge. The monster GPUs will be there for high-end FP as needed and the die area consumed by the FP and IGA minimized so as to be beneath concern. FP may be external to the cores but they won't be sold without at least one FP/IGA module in the mix. That way you have a chip that is versatile for a wide range of different boxes but also cost competitive. Reply

You are right.
The eight core Sandy Bridge will have over 200 Gflops Double Precision with a power budget of 130W in 32nm and 95W in 22nm.
In these conditions the "dream" to throw away the Fp unit from the CPU it's only a Nvidia desire.....to survive.

Clearly AMD are providing enough CPU power for OpenCL, etc, to run "well", but if you need "serious" power then you'll plug in an RV900 series GPU that will probably try to get near 1TFLOP in DP in the same timeframe. With OpenCL, the exact same code will run (AMD's OpenCL driver can switch between CPU and GPU without any application changes). Reply

It looks like AMD is engaging in another word of words instead of performance. Remember when they claimed ownership of what was or was not 'dual core' and 'quad core?' While AMD declared the C2Q line as 'not true quad-core' the Intel product was actually shipping and available for use a year before AMD's 'true' chips came out with less performance and some serious bugs for added enjoyment.

This gets tiresome to the point where I hold AMD in great suspicion when they lead with a new official vocabulary instead of the product and how it actually performs.

I truly don't give a damn about your modules, AMD. Take your new architecture and define the smallest portion that could be sold as a discrete product to run a PC. That is a core. It doesn't matter how many threads it runs. It is a core. If we cannot have meaningful definition to which all companies adhere, the conversation is dead and all that remains is useless PR blather. Reply

It's good to see that the existance of AMD is healthy for the competition, progress and innovation. The existance of AMD is even good for the Intel fan-boys. The Inte CPU's wouldent be half that fast today, if there wasn't any competition on the market. Reply

An AMD representative said that the picture you provided is one core, but it has two integer units. These two integer units are hardware basis of a similar feature of Intel's Hyperthreading. The following picture is a dual core.

This is all assuming the Bulldozer core is for their enthusiasts or high end setups. For the low end, these pictures will not include two integer units. Though it all depends what AMD has in store for the microcode for their Bulldozer core because it can be one way and other or it can be both that can take advantage of both features by including a switch in the BIOS or software, but it is too soon. Reply

Looks like the AMD CPU's are slowly getting structures "borrowed" from ATI GPU's, which is very interrresting. The traditional CPU strukture from the seventies are on the way out. The future looks really exciting! Reply

Fruehe LOLed this into 80% more performance for 5% more area (ooops!), and now this meme has taken hold.

It's wrong. Each module is 50% larger to get 80% more integer throughput, and even adding in all the "uncore" portions on a chip does not get this number anywhere NEAR 5%. (The uncore is nowhere near 10x the area of all the core area combined) Reply

Because JF said, distinctly and repeatedly, he was talking about total die size, while the 50% is referring to the area of the module, sans L3$/IMC/NB/etc. And more specifically the Int-core area, which clearly doubles when going from 1 Int-core to 2 Int-cores.

So, while to get up to 180% increase in integer performance you need to double the area (or 50% of the total integer area)dedicated to integer operations, that relatively to the total die size may well take only 5% of the die space.

A single integer core (just the unique per-core parts, not the shared functionality in the module) takes up 5% of a typical quad-core Bulldozer die (including uncore and L3)? Or maybe even an octo-core die.

Also assume rounding up and down and nearest. Could be 47% and 5.4%, etc.

5% always sounded very unrealistic as that would mean a remarkable increase in IPC for such a small increase in ‘core’ size.
If it was only 5% we would expect to see a native 8 module version being for the desktop if looked at purely from die size or on a cost basis. But at 50% extra it means that all other things being equal 4 modules = 6 ‘simple’ cores in space terms ignoring the uncore.

I’m still not 100% clear on the 50% thing. If a die is 50% cores and 50% un-core and measures 100 sq mm. When we add the 50% larger cores to the equation the cores become 75 sq mm and the die becomes 25% larger or 125 sq mm. Or is there another portion of the module/core that is excluded so the total size increase is less than 25 sq mm?
Reply

That 50% sounds much more realistic.
On the k10 die are http://en.wikipedia.org/wiki/File:K10h.jpg">http://en.wikipedia.org/wiki/File:K10h.jpg u can see
that doubling the integer pipeline, data cache and load store unit is clearly more than 5% :P.
The thing is that L2 cache and L3 cache are in the buldozer module picture and they are several times bigger die area than the core. And there are also other things in the uncore like memory controler, hypertransport. The whole die vs core is quite diferent than the whole die vs module. They say 50% more core area invested not module or die area. Reply

[quote]I think the difference between 50% and 5% might be the difference between marketing and engineering. Engineers tend to be very literal.

If 2 cores get you 180% performance of 1, then in simple terms, that extra core is 50% that gets you the 80%.

What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%.

Simply put, in each module, there are plenty of shared components. And there is a large cache in the processor. And a northbridge/memory controller. The dies themselves are small in relative terms.[/quote] Reply

Meh, I miss-clicked and reported your post by mistake, sorry about that. :(

Anyway, imagine a int core is 5% area of a total module and that a module size is 100 (size units not mm^2), so the int core size is 5. 4 modules will then be 400 and 4 int cores will be 20. 20 is 5% of 400, not 20%. Same for 8 modules.

You have to see, that they do need 50% more area to get 80% Int boost performance as they are using a second Int core to accomplish that. So the dedicated area of module to do Int operations is 2x the size of a regular Int core. Reply

Meh, I miss-clicked and reported your post by mistake, sorry about that. :(

Anyway, imagine a int core is 5% area of a total module and that a module size is 100 (size units not mm^2), so the int core size is 5. 4 modules will then be 400 and 4 int cores will be 20. 20 is 5% of 400, not 20%. Same for 8 modules.

You have to see, that they do need 50% more area to get 80% Int boost performance as they are using a second Int core to accomplish that. So the dedicated area of module to do Int operations is 2x the size of a regular Int core. Reply

Nah, you don't understand him. His assumptions are:
1. one int core is about 5% of the whole die (including uncore).
2. one int core is about 50% of a module.
3. the uncore makes up about half of the core.

Put this in numbers:
Take a module as 100 size units. 4 modules means 400 size units, adding the uncore makes the size of the whole die 800. 5% of 800 is 40 size units. And tadaa, this makes an int core 40% of the size of a module ;) The number gets closer to 50% if one takes the uncore bigger.

If his assumptions are correct, a 25% total die increase (4*5% to 80%) results in 80% extra performance. This is about as good as Intel's 5% die increase for 15-20% extra performance (I know, this is a bold statement, a lot of unknown variables could alter this situation drastically). Reply

The data we have is: Removing 1 int core 5% from each module would result on 5% reduction of total die size. 1 int core = 50% of the total area dedicated to integer operations.

So, for a total die of lets say 1000 units with 8 int cores, 4 int cores represent 5% of the total die size or 50 size units.

So each int core is 12,5 size units and the 8 int cores take 100 size units or 10% of the core.

Assuming sizes for total die size or what is the Bulldozer Module size relative to total size is pure speculation, as we don't have any numbers other than that JF affirmation.

To remember:
"What I asked the engineering team was "what is the actual die space of the dedicated integer portion of the module"? So, for instance, if I took an 8 core processor (with 4 modules) and removed one integer core from each module, how much die space would that save. The answer was ~5%. "

That was the affirmation.

In no way this contradicts the affirmation that AMD increased the Module area dedicated to integer operations by 50% to achieve 80% performance.

I am disputing the JF claim: " Removing 1 int core 5% from each module would result on 5% reduction of total die size. "

I suspect that his engineers misunderstood his question, and it is actually the removal of the "extra core" from ONE BD module that would result in 5% overall die savings.

You can take it to the bank that Moore is correct that adding another integer execution unit group , L1D, etc to the core (thus making 2 cores, or a 'module') increased the size by 50%. Moore is the designer, not a marketing guy.

In order for Fruehe's claim to be correct, the uncore area would have to be VERY large:

Some more (different numbers):

Assume BD module is 30 mm2, (thus increased by 10 mm2, or 50% from 20 mm2 to add the second 'core', per Moore)

If 5% were actually the correct estimation of the area added for 4 BD modules (4 * 10 mm2 increase = 40 mm2 increase), then the overall die size would need to be... 800 mm^2.

This is nuts.

On the other hand, if "5% of the total die area" is the estimate of the space needed to add the integer resources to just 1 BD module, then the overall die can be 200 mm^2, so uncore 80 mm^2, 4 BD modules at 120 mm^2, and then Moore's numbers can be consistent with what JF heard back from the engineers.

So, my theory is that his engineers thought they were being asked how much of the total die (for a 4 BD module part) the increase in integer units to 1 BD module resulted in, while JF thought he was asking how much the increase to ALL 4 modules would be. This would be an easy misunderstanding to have, and I don't see another way to reconcile Moore's information (which I trust), with JF's claim.
Reply

I really hate the move to call each "module" 2 "cores." AMD is shooting themselves in the foot when it comes to software licensing, in particular, Oracle DB licensing where they charge .5 CPU license for each x86 "multi-core." AMD's decision will double the cost of software running on Bulldozer.

Microsoft requests licensing per mainboard socket. Oracle requests a decreased licensing cost per core, if the core is a part of a socket. Meanwhile, some other companies requests licensing costs per core.
Everyone with its own ways. Reply

Fair enough. So indeed this is a concern when buying this new stuff. I'd rather have AMD not call a this module 2 cores for the simple reason that is a sort of siamese twincore, not a true dual core. Though that is just the naming game. It looks promising nonetheless. Hopefully AMD/Oracle can enlighten the big system buyers by the time the decisions need to be made. Reply

A lot of licenses and MRCs are based on socket count, unfortunately, many of the most expensive software packages licensing arrangements are derived from core-count. Since Oracle changed their multi-core licensing near the end of 2006, quad-core x86 processors have counted as two licenses, six-cores as three licenses, etc. A Bulldozer quad-module die will therefore need 4 licenses for OracleDB.

Does this suck? Yes. Is SQLServer's licensing model better for end-users? Yes. Is SQLServer anywhere near as awesome as Oracle Database? Hell no, not even close. Reply

It means Anand get these calls ("short conference calls") requested by AMD when:

1) Anand makes a mistake

or

2) Intel is being naughty (like telling OEM to not sell AMD).

I asked Anand in the Bulldozer article if a quad-core zambezi meant 4cores/8 threads or 4cores/4 threads.

He said (and I was convinced at that time it was the correct answer too)that a zambezi quad-core meant 4 cores/8 threads and an octo-core would be 8cores/16 threads. Or if you prefer 4Modules/8cores/8threads and 8Modules/16cores/16threads.

But it seems it is 2Modules/4cores/4threads and 4modules/8cores/8threads.

The first with L3 cache was actually the AMD K6-III released in 1999. Of course, the L3 was actually on the mobo, while the L2 was on-die. But it did use a tri-level cache, making it outperform the Pentium III Katmai on integer workloads. Reply

Only on instructions per clock - the K6-3 was available at 400 and 450 MHz, while the Pentium !!! was available at (much later) up to 1300 MHz.
However, the K6-3 was in the competition against Pentium !!! at up to 550-600 MHz, as the original K7 appeared around those times. Reply

K6-2+ and K6-3 were PII competitors that were able to outperform PII and even early PIII purely thanks to ON-DIE L2 and 3Dnow.
L3 was on motherboard back then, was slow, and had little to do with K6-2+/3 performance gains.

Also K6-2+/3 was the top AMD CPU for a very short time as K7 came righ afterward.
K6-2+/3 was the notebook & low cost bussiness desktop solution of the times while K6-2 (without cache) was the budget solution. Reply

Any info on the amount of transistors for each module ?
At least they can make a decent notebook cpu from the modules. Mobile nehalem and phenons are everything with 4 cores just not low power notebook cpu-s. Reply

Something like 1 module no L3 cache for netbooks,notebooks.
2 modules with L3 cache for average notebooks and 4 or more modules for desktop replacement notebooks. They could play with the cache sizes too.
The 1 module no L3 cache for example would kill atom in performance. And the power usage could be still quite good. There is no meaning for 2-4 W slow cpu in netbook when the mainboard,hdd,display eats several times more electric power together than the cpu itself. Reply

I am wondering about the shared FPU. Does one thread really have to be purely integer for the other thread to use the 2 FMACs at the same time, or can one thread use the 2 FMACs if the other one is currently not sending FPU instructions. "if one thread is purely integer, the other can use all of the FP execution resources to itself." sounds like the first, but that would (1) waste FPU resources, (2) give problems with threads switching cores. (how does the not witched thread know it can now use all of the FPU resources?)

My presumption is that AMD chose the former and that if 2 (FMAC) instructions of different 'cores' reach the FPU (at that point in time) they will be executed using an FMAC each and if 2 instructions of one core reach the FPU without the other core sending any (at that point in time) they will be executed in parallel using both FMACs.

If my presumption is correct AMD decided not to HyperThread their ALUs but did HyperThread their FPU. Reply

The shared FPU has ONE common scheduler. Both threads can issue Ops into the scheduler queue. If there are no FPU Ops from the first thread then - of course - the second thread has the power of the whole FPU.

Very simple ... it's like a chat room. Several people type in messages and you can see it serialized in the chat window (that would be equivalent to the queue).

You will read the messages one after another, according to their posted time / when they were issued. It is the same for the Bulldozer FPU.

You do not need to switch from one chat member to another to read their messages. Neither does the FPU has to switch ;-) Reply

I hope AMD regain their common sense and use the term "core" in amore conventional sense. According to their definition a quad core will only have 2 FP pipelines. What I see in a quad core Bulldozer is dual core with 2 Int and 1 FP pipeline each. I wish them the very best in their effort to regain the performance crown but abusing existing terminology will not help in that. Reply

OS , drivers, API layers trashing the cpu constantly are usualy integer loads. For average math in code the curent fpus are fast enough. Things that realy need paralel FP performance (multimedia things,graphic) are using SIMD SSE units and those things should run much faster on gpus anyway. For 5% extra die area the extra integer pipeline rocks. Reply

This was started by Sun's Niagara (I think) processor - 32 "int cores" and only one FP unit. A physical integer core ran four threads at a time (one instruction from each, with instant context switching between them), so one would have had eight physical integer cores with only one FP unit.
The Niagara 2 would have had one FP unit for each of those integer cores, so one FP for each four int cores. Reply

This confusion has nothing to do with int or fp cores. By a common definition, a core a standalone unit, which can function on it's own if necessary.

For example, each of Niagara's 8 cores had an own fetch and decode unit. In AMD's case, the "module" is the unit with it's own fetch and decode units, and integer ALU clusters only have own scheduler. Therefore, it's very confusing to call these clusters "cores".

Fair enough. The use of the term "core" is still confusing though. There seems to be only one complete pipeline which branches after instruction decode. It's interesting whether the Icache is trace cache, i.e. contains decoded instructions or is a regular cache that needs to be fed back via the fetch/decode bottleneck. In the latter case I see no merit in calling the two integer (for lack of a better term) pipelines separate cores. Reply

Uhm, the Sun UltraSPARC T1 was a 8-core eight integer-units and one floating point unit, and with CMT - eight threads per core 32-thread. It was still called a 8-core CPU, but the T2 included a fp unit for every integer unit, so I doubt AMD will use this config for long. But who knows. It's impressive if its up to 30% faster then a PII with twich the number of FPU's.

Also one integer unit already includes three ALU's. They (the core/scheduler) seems independent enough. Reply

From a marketing standpoint, '8 cores' would be more desirable to the laymen buyer than '4 cores', so that's one reason why they might have chosen to do it. Indeed I think Intel will quickly be following AMDs definition so as not to have 'less cores'. Reply

Its a poor choice of words. Unfortunately, since the term "core" was never fully defined, everyone gets to have their own understanding of what it means. I took core to mean a complete processor that could, if packaged alone, perform all the functions of a processor (both integer and floating point). I think AMD should have taken the high road and called "modules" "dual-integer cores" instead of splitting them into two "cores" with this extraneous, shared FPU tacked on like an afterthought. It makes the term core meaningless. I would guess that making the term "Core" meaningless would be advantageous to AMD, however, since that is the term Intel uses for their entire architecture. Reply

How the hell could you embrace Intel after all the harm they've caused AMD, nVidia, VIA and ultimately us, the consumers with their criminal tactics? They don't give a damn about you, they just want your money. A lot of people say that AMD is no different (I know nVidia sure is the same as Intel in that regard) but at least they operate with integrity. They've never been accused of anything underhanded or sneaky. For that matter, neither has VIA. Intel and nVidia on the other hand, while nVidia has never done anything downright CRIMINAL, they've still been dishonest as hell. Intel on the other hand, has stooped about as low as you can go. So go ahead, embrace Intel, just like a stupid biatch who won't leave her abusive spouse. She just keeps going back for more and people like me who have brains can only shake our heads and wonder. Reply

While I agree Intel has the performance crown now, I can't knock AMD for being the value right now. Picked up an AMD x4 955 BE and Asus motherboard (full ATX/crossfire) for $230 to build my parents a computer with (Newegg combo). Intel can't compete in that price space easily. Reply

Intel can't compete in that price range? No, Intel doesn't want to. Manufacturing capacity is limited, if I can sell more higher margin products, why should I go after lower margin segment? Leave that segment to AMD, the more AMD sells in that segment, the more money AMD looses. If Intel wants to compete in that segment, they can easily kill AMD, that's not what Intel wants to do. Reply

That sounds a bit low, I hope the final comparable CPUs can manage something more like 15 - 40% better integer performance over their PhII counterparts. Then again perhaps that's just because of Intels large performance increases between their recent architectures making us expect more -- they are more the exception than the rule, so 10 - 35 % shouldn't be sneezed at, although that just may not be competitive on their release in 2011. Reply

Considering that the int cores actually have less execution units (used to be 3 alus (plus shared load/store, but can do two operations per clock), bulldozer only 2 alus (+ load and separate store)) I think 10-35% better integer performance is amazing. More than that would be a miracle imho... Reply

Based on AMD's re-defining of the word core that's actually a HUGE improvement. A quad core Zambezi has a similar transistor budget as a dual core Phenom II, and a 10-35% performance improvement.
In other words, quad core integer performance for dual core price. Reply

A quad-core Bulldozer has the same transistor budget as a tri-core Phenom II (if they existed natively), yet performs around 20% better than a quad-core.

I think that SMT would have provided easier performance pickings (20% for 5% die space). I don't understand why AMD have been avoiding SMT so far. Sure, 80% more performance for 50% die space isn't to be sneezed at, but it's not so easy pickings.

In addition there are more integer resources than in a Phenom II core, and the FPU has two 128-bit FMAs, so each core could be reasonably bigger. In effect it could be that 1 Bulldozer module is the same size as two Phenom II cores - so all you have then is the 10-35% performance increase. I hope this is per-clock... Reply

The thing is that the picture in this article contains shared L2 cache and L3 cache too and its quite unclear from the picture if L2 is shared to one module or all modules.(sharing all modules 2 times with L2 and L3 would be quite useless)
The bulldozer picture in the other article from anandtech http://it.anandtech.com/IT/showdoc.aspx?i=3681&amp...">http://it.anandtech.com/IT/showdoc.aspx?i=3681&amp... shows clearly that the L2 cache belongs to module.
So clearly ading 50% to the core(which is everything till L1) is much less than 2 whole cores each with its own same size L2 cache ( Nehalem has only tiny 256KB cache per core from die area reasons).
If we take whole die size with 8MB L3 cache and 1MB L2 cache per module/core (+ things like memmory controler,hypertransport core/module conects) the final die increase could end in 10-15% or even less. Reply

So a 4 module Bulldozer core with 512KB L2 cache and 6MB L3 cache could be something like 10-15% bigger than a 4 core PhenomII with 512KB L2 cache per core and same 6MB L3 cache. For 80% more integer performance that wouldnt be bad.
And about Oracle , the server cpu-s from both intel and AMD are running in ranges from few hundred dolars to over 2k dolars with minimal performance increase just more sockets suported and everyone is buying them. So i wouldnt care less for them than a fly on my window. It will end on final pricing per core for cpu not core/modul license price. Reply