As the title says,this blog is about discussing the PC world,new trends,new technologies.I'm personally a chip fanatic,but I like programming too.Anything that fits these categories will be discussed here.

Wednesday, October 26, 2011

After not so successful launch of their new flagship desktop processor,AMD started talking up their next chip that will succeed Bulldozer Ver1. Say hello to FX Next ,based on Piledriver core :

Piledriver is supposed to fix certain shortcomings in BDver1 processors. I summed up some of them in my previous blog so I won't reiterate those.
What will AMD have to offer before Piledriver arrives? The following roadmap gives us some clues:

FX8170 is supposedly launching in Q1 2012. It will be based on new (somewhat improved) B3 stepping. Hopefully B3 will be enough to polish speed path issues (if there are any) and bring up clock speed. The rumored 8170 is supposed to run at 3.9/4.2/4.5Ghz clocks. This is a very solid uplift(~7.7% over 8150). Bad news is that there is no "8190" on the current roadmaps and 8170 is supposed to tide AMD over until PD arrives in Q3. That is 2 quarters... Good news is that PD will be overall 10-15% faster than what AMD has at the moment of its introduction (so 8170). This lines up well with rumored ~5% IPC increase over BD with the rest being clock speed.
So:
Q1-Q2 2012: 3.9/4.2/4.5Ghz FX8170
Q3-onward : 4.2?/4.5?/4.8?Ghz FX8280?
Effectively 4.2/3.9=1.07 or 7% clock uplift with PD. Now count in the IPC uplift of 5% or so : 1.07x1.05=1.12 or 12% overall faster than 8170. AMD said 10-15% more x86 performance with PD so this slides right in the middle of this projected figure. How much is this faster than current 8150? 8170 should be around 7% faster than 8150 so 4.2Ghz(base) PD based FX is going to be roughly 20% or so faster than 8150. Not a bad upgrade if you look at it from time perspective : 8 months after BDver1 we will get 20% faster FX part (stock vs stock). It's not unreasonable to expect better OC and thermal characteristics too,so all in all it will be a good "fix" for current BD desktop competitive situation.

PS All info above is my interpretation of available data. I used "best case scenario" above. Real PD core and 8170 may turn out to be totally different than what I speculated. We have seen this happen with BD core. Since back then I also based my predictions on publicly available data coming from AMD(which turned out to be a bit optimistic on their side) , this may happen again with PD and 8170. So take it as it is,just speculation.

Thursday, October 13, 2011

OK, I have read a lot of reviews now. Some things are clearer now.
I suppose I overreacted a bit in my previous blog. Zambezi is hot ,but overall it's not a slow chip. It performs rather well in MT applications. It does have some weaknesses which AMD must correct. Some of the weaknesses are not solely AMD's fault,but GloFo's too.

So this is what ,in my humble opinion , AMD must focus on in the future ( think Piledriver and Steamroller):
1) First and foremost AMD must invest heavily in relationship with developers. They must hire a brand new team of both young and motivated guys who will literary go out and help developers in order to maximize the potential of Bulldozer design. This first iteration is just that ,first. It has some flaws which AMD will try to fix and hopefully succeed in that task. But underlying design ,which is truly revolutionary , will need GOOD software support in order to give best performance to the end users. This means FMA4,XOP,BMI and the rest will need to be properly supported in future multimedia desktop workloads. Notice I'm speaking about DESKTOP space here. Server is in no such need since recompiling is a norm there.

2) AMD must improve the cache performance,especially L1 and L2 writes. This is a major bottleneck and it shows its ugly face in many workloads. AMD is aware of this and hopefuly Piledriver has at least somewhat better write performance with these two levels of cache. L3 looks fine,even more than that. It is much faster than L3 on Thuban.
They also need to work on improving the FP unit. It may be great in FMA4 stuff but it's much less impressive in legacy SSE or AVX128 workloads. Maybe expanding it a bit and expanding the buffers could help. Single thread performance is not anywhere near what this thing SHOULD be capable of so there must be a bottleneck somewhere since in numerous SIMD workloads it's not faster than K10's single core(and its 128b unit).

3) AMD must twist GloFo's arm very hard and very fast. Not only their 32nm production is bringing many defective Llano parts (which is truly a shame since most of the time GPU is broken and then it's not APU any more), but now they can't brake 3.6Ghz barrier on a design that was SPECIFICALLY DESIGNED FOR CLOCK speed (while it does have some IPC improvements in certain areas too). So original goal set by AMD was 30% clock uplift with the same power draw as previous design. We get this ONLY in limited Turbo mode now. We should have 4.1Ghz 125W stock clocked Zambezi parts with 4.7Ghz half core turbo and 4.5Ghz full core turbo. This Zambezi would effectively be 12% faster than 8150 with same power draw. This Zamebezi would allow AMD to use SMT core affinity scheme and release a patch for windows 7 that would force threads first to modules and not cores. Performance uplift is ranging from 5% to massive 40% in some cases,averaging to around 15-20%,depending on benchmark selection.

So what we need is 95W 3.6Ghz FX8150, 125W 4Ghz 8170 and 4-4.2 Ghz 125W 8270 (Piledriver).
This lineup would hold off SB and IB ,at least in mid and mid-high performance segments,without many problems.

4) AMD should work closely with MS and release a patch to windows scheduler. As in link I've provided above, performance uplift is not a small number but a very nice 15-20%.
Trade-off is power draw though. All is explained well in this great review by harware.fr .

So there you have it. Bulldozer is not what we expected,but it's not a complete failure either. It's a solid chip which will shine in future applications ,which are going for multiple threads. Single core speed ,while still important,is not the main selling point any more. For those who want a good single core performance while having great MT performance (but still slower MT performance than FX8150) ,they can pick 2500K . It's the best chip by intel currently from perf./$ POV. 8150 is not as good but very close! It needs 10% shave from it's MSRP and AMD may sell a sh*t load of these things :).

Final Thoughts
Bulldozer has been in the works for so long that I don’t even remember when was the first time I heard about it. The concept, at first somewhat odd, gradually started making sense but sometimes these things happen by assimilation. Or, too many geeks who are “in the know” really knew that this was not only a great but a grand design. And who was I to doubt them?
Still, all things considered, there was still more than a shadow of a doubt. Especially when Intel re-introduced HyperThreading and got enormous mileage out of it – for the cost of essentially nothing in terms of real estate which directly translates into cost per die. But we were told over and again that none of this would make sense in light of the analyses performed by AMD’s engineering team. And now we are supposed to believe that Zambezi was designed as a direct competition to the Core i5. That was not a question but a statement.
Of course, this begs the old question, how predictable is performance on a new design? Apparently, it is hit and miss and in so far, my argument still stands, even if it is against the personal religion of some of the decision makers at AMD. This is at the end of a frustrating week, trying to find that one application that would justify buying an FX processor.

so this loader gets stuck, then other dozer comes to help, then he gets stuck coming out, so another dozer comes over and pushes him out ,hen gets stuck himself and in-turn pushed out by the cat dozer (first dozer stuck. see a trend? i got fed up so, i go swimming for the winch cable (it was under over 2ft of mud and water) and pull it out with my work truck. done.

Friday, September 16, 2011

Due to popularity of this one single post I will re-post it again ^^. Since I was notified via PM on XS forum about the lies and crap he is writing about me and others,I will be re-posting the truth about this guy. "The blogger with mental disorder" will be exposed again,and again if needed Enjoy :).

Many of you have already heard about the notorious blogger from Czech republic. A guy that goes by the username "OBR". Well in his latest posts he went completely overboard with insults and bad mouthing,insulting groups of internet forum users,accusing many folks of lying(mostly AMD's people),some guys that he even personally knows, AMD's PR department,AMD's John Fruehe - the guy that stands behind the server division and posts occasionally on the tech forums, etc.

Now some forum posters found this very amusing ,especially knowing that OBR himself admitted he lied and purposely faked Zambezi results and fed the fake results to many IT websites.They all took the results seriously and posted them citing "genuine sources". Then the FUD train started to roll and we have what we can call today a mess. This guy is a pure hypocrite.
A picture is worth a thousand words:

Thursday, September 15, 2011

Looks like THG got a pre-production SB-E system and run some tests on it. Nothing spectacular compered to 990x/980x,around 10% faster . Still pretty good improvement for a same core count part.

On the other hand ,controversy regarding Zambezi (desktop Bulldozer part) continues. We have some new data that points at in between 2500K and 2600K multithread performance and abysmal single thread performance. Nothing is final yet,but don't expect some sort of SB killer from Zambezi. Even in its strongest performance department which is multithread performance,it looks it will even fail to beat X6 Thuban by a decent margin. Performance is indeed just a little better,if better at all. Price is still hard to figure out. AMD plans to charge 250$ ,approximately, for the FX8150. Unless they are just counting on more OC headroom,nothing else warrants this higher price than existing X6 1100T. This is indeed kind of a let down. It turns out that 4 module part is closer to 4 core part with SMT in performance than to a 8 or even 6 core part. Performance,especially in floating point applications is all over the place,ranging to much slower (linpack) to even("8C" FX @ 3.6Ghz vs 1100T in C11.5) to slightly faster (wprime) . This uncertainty when it comes to performance kind of defeats the whole module concept because AMD touted it as much more consistent performance uplift versus intel's SMT approach. As it turns out,it is not so different after all... Good thing about Zambezi is that it will at least clock high. Some reports point to 4.8Ghz as "normal" overclock across 4 modules,on air.

update: I've just read on XS(thanks chew*) that Zambezi will have even higher OC potential than 4.8Ghz. On air(stock cooling),supposedly one can expect 5+Ghz - regular daily use and benchmarking . By chew*:

I think that any info other than what I or AMD have personally offered which is not much if any can be discarded until further notice.

There are still peices to the puzzle missing that I can assure you 99.9% don't have yet regarding CPU rev and bios support and agesa.

To sum up BD facts

BD is physically a 4 core 8 thread part.

It has no coldbug

Samples can bench at 5+ on stock cooler, 6+ on phase change and 8.4 on lhe.

And most important of all SATA works

Like i said before the joke is on that guy we won't name, he was sent intentionally quite possilby the worst chip ever produced. 6.4 on ln2,
or he needs to learn how to OC, 6.4 is my validation speed on phase change......he who laughs last laughs best.............

So with phase change 6+Ghz and with LN2 7+ Ghz and maybe 8Ghz. For water cooling ,I suppose 5.5Ghz is not out of the realm of possibility. So with this latest information in mind ( which is very reliable), Zambezi may be pretty competitive with OCed SB 2600K. Since 2600K reaches around 4.5Ghz-4.8Ghz on good air cooling,Zambezi will have some frequency potential under similar conditions. This may equate to better performance in MT applications for those who actually use their overclocked machines for something useful .

Sunday, September 11, 2011

Ok a small update on sisoft numbers for Opterons.Here is 6282SE and here is 6220 . I'm pretty sure now about following things:

1)Both Opteron sisoft results are real. Some features are turned off though.

2) 6220 part has a correct SIMD throughput,while 6282SE has somewhat inflated number (memory configuration is maybe specced higher there).

3) Both 6282SE and 6220 servers had Turbo off in integer tests. First ran at 2.5Ghz(2x 16C) ,second at 3Ghz(2x8C). Multimedia test uses AVX and gives 11% better score than legacy SSE.

4) The Opterons that will launch very soon will have roughly(2P top bin next gen vs 2P top bin previous gen) : 28-30% higher spec_int_rate score and 33-35% higher spec_fp_rate score. This is 2.3Ghz(2.8Ghz Turbo on all cores for integer workloads) Interlagos Vs 2.5Ghz 12C Magny Cours. In IPC numbers this is ~7-8% higher integer IPC and 8-10% higher floating point/SIMD IPC -all in non recompiled workloads. AVX 256/128brings ~10% more in floating point and FMA brings up to 2x more over AVX,but this is "what if case" and not the norm(we have to wait for applications to be written with FMA in mind). I guess XOP will bring similar speedups to AVX ,10% or maybe more,in integer recompiled workloads.

5) Zambezi's Sisoft results that were leaked are not correct. I don't know whether the Turbo was on or off in that test,but even if it was off the results are ~17-20% lower in integer part than what the opterons show. FP part is more or less correct since Opterons score the same per core and clock, but the test was run on ES platform with 1333Mhz DDR3 memory and unknown BIOS settings. Even if we take the legacy SSE score of 132Mpix/s,which was scored at fix clock of 2.8Ghz (100% sure about this) and correct it for launch clock of FX8150 part we arrive at 170Mpix/s.This is 54% more than 1100T.If we take best for Zambezi then it will be AVX. Now it is 71% faster than 1100T (stock versus stock => 189Mpix/s Vs 110Mpix/s).
Integer throughput is around 35% higher on FX8150 versus 1100T (stock+Turbo Versus stock=>88Gops Vs 65Gops).

I suppose numbers can be higher for desktop version,by about 5% ,compared to the ones I posted in point 5). As for Opterons,I'm 99% sure this will be the speed up that SPEC benchmarks will show. Oh and STREAM(memory BW) will be around 50% faster on Interlagos ,but this is already known.

PS And yes,this means Zambezi shouldn't score lower than Thuban 1100T in Cinebench... At least not according to above. But who knows,anything is possible. Leaks so far point that top Zambezi should get score of around 6pts in C11.5 64bit test. According to sisoft numbers it should get >9pts or close to 9pts.

Friday, September 9, 2011

Last week or so has been full of Bulldozer news. It started shipping(finally!) in server segment and we have a confirmation that desktop launch is in Q4,presumably mid October.

We have a few more leaks,none of which point in good direction(performance wise). Whether or not the leaks are genuine and the platform is final, to me it looks like Bulldozer will have a tough time against even previous X6 cores in desktop space. Not only the integer performance is abysmal ,fp/SIMD looks to be equally bad. There still is some hope that things can get better with B2G or whatever the launch stepping will be,but I'm now pessimistic when it comes to Zambezi and its desktop performance. Zambezi X8 @ 2.8Ghz base and 4Ghz Turbo cannot beat X6 1100T @ default (with its own Turbo). This is what some leaks are showing and some guys in the know kinda confirm. The level of performance of FX8150 will then be slightly below 2500K (without SMT,just plain QC Sandy bridge with 4 threads). In single thread workloads even Thuban @ 3.7Ghz(turbo) may end up being faster . I don't know if this is a design decision or some problem in the design,but performance picture does look grim. 6-7 years of development and a lot of R&D money invested and we get something like this?

Thursday, August 25, 2011

Many of you have already heard about the notorious blogger from Czech republic. A guy that goes by the username "OBR". Well in his latest posts he went completely overboard with insults and bad mouthing,insulting groups of internet forum users,accusing many folks of lying(mostly AMD's people),some guys that he even personally knows, AMD's PR department,AMD's John Fruehe - the guy that stands behind the server division and posts occasionally on the tech forums, etc.

Now some forum posters found this very amusing ,especially knowing that OBR himself admitted he lied and purposely faked Zambezi results and fed the fake results to many IT websites.They all took the results seriously and posted them citing "genuine sources". Then the FUD train started to roll and we have what we can call today a mess. This guy is a pure hypocrite.
A picture is worth a thousand words:

Monday, August 22, 2011

Same thread page six:http://diybbs.zol.com.cn/11/11_100430_6.html
8C 2.8Ghz ,A1 revision?? So the samples seem to work fine(bug-wise) but have been power limited to stay within certain power cap.
Quote from the guy who posted this:

8-core results here

Bulldozer and then check the performance of eight threads can not pay would in 9000, the main reason for the test problems is power supply and motherboard BIOS issue, bulldozers high demand for power, and now the BIOS only 60-70% of shipments performance only.

The logic above is the following: first silicon revisions (A1,B0,B1,B2?) were functioning performance-wise (mostly) as simulations predicted BUT they were very leaky and were gobbling power like crazy.So in order to validate the platform,AMD has used the new feature in Bulldozer design called "power cap" in order to limit the power draw of the CPU and still make it work on latest 900 and 800 series AM3+ boards (check the official Bulldozer blog about this useful feature in Interlagos variant ). The effect of this was that those early Bulldozer samples were throttling down aggressively (voltage and frequency wise) in order to stay within the spec. This resulted in much lower clocks then what applications like CPU-z reported- in the range of 60-70% of specification.Turbo functionality was impaired also.
After the platform is validated, AMD was already producing the B3(C0?) which supposedly has the leakage/power draw problems fixed and clock target was being met within the 95/125W specs( say 3.6Ghz 8150 8C is now hitting all targeted specifications,Turbo included,all within 125W -unlike the early A1/B0/B1 which were gobbling power like crazy with poorer yields). So after all things said,retail will perform 40-60% better than what the latest leak from chiphell showed and what the two images above (kinda) illustrate-if they are genuine that is.

As for the numbers for both 4C and 8C they kinda confirm it's possible. 4C 3.9Ghz allegedly gets 18.9K,8C 2.8Ghz allegedly gets 23K. Scaling is probably not perfect going from 4C to 8C ( i assume around 1.7x,software and hardware limitations):
18.9K / 3.9 x 2.8 x 1.7 =23K. So the scores kinda align. But the problem is that scores are roughly 1.9x higher per core than Phenom... I can't see from where this speedup comes. So take with salt

Monday, August 1, 2011

As you can see here , next gen Opteron 6200 series parts are up for a preorder( 1240+ units already ordered).
Part info: OPTERON 6200 SERIES PROCESSORS 2.1G 16MB 115WT G34 TRAY

Click for larger view

This is not the top model though.There is 115W part that runs at 2.3ghz base clock and 2.7Ghz on all cores with Turbo(and 3.3Ghz for half cores/one core active). This one will have 35% higher throughput in HPC workloads than 2.5Ghz 12C Magny Cours we have today.
Enjoy!

Wednesday, July 20, 2011

Update: some new leaks have occurred on zol.com.cn,so I have included them just for fun.New leaked clock speeds are a bit unrealistic IMO.

Since I've already speculated enough about Zambezi,why not make a small table with my estimates for desktop performance that we can expect from intel and AMD's chips in 2011/2012.
We have quite a few contenders:Nehalem 4C,Westmere 6C,Sandy Bridge 4C,Sandy Bridge E (6C and maybe 8C),Zambezi 8C/6C and Komodo 10C. Some of these are available now,some will be soon and some will be available in Q1 2012.
Points in the chart are based on hardware.fr testing results and are consistent across few generations of the chips.

Click for larger view

Some of the CPUs are not on public roadmaps ,like 2800K ,Zambezi 8170(3.8Ghz?) and 8C SB-E for desktop.I still included these since they may launch eventually.
As can be seen,I expect pretty even fight in desktop segment with intel having a slight lead with SB-E until Komodo launches. In mainstream and performance segments,Zambezi 8C/6C and SandyBridge will be logical choices while Westmere 6C is going to be too much pricey for what it offers.

On zol.com.cn we have 3 new leaked images of what is supposed to be AMD's Zambezi roadmap all with clockspeeds,stepping info and time-to market .Take with salt of course.

If somehow AMD manages to pull off these frequencies for base and Turbo then not only Bulldozer will be impressive ,the wait for it(with all the delays) will be well worth it.It will be the top performance chip,beating even SB-E cores,Komodo will not be needed for this feat.But ONLY if the sky high clocks are correct,which is not very likely to happen.

Looks like someone from Far East leaked the errata document that points to possible reasons why we had so poor performing B0 and B1 samples in the wild. Since the original source(forum post on ZOL) from China is gone here is a post by Fellix @ Beyond3d forum.

As I have speculated before,in my original 2 blogs about Zambezi ES,we have some Turbo core issues (among other things). Good news is that the launch stepping has everything working properly and is on track for September launch.

Monday, July 18, 2011

Disclaimer: everything written here is pure speculation by me based on what we already know about past and future AMD cores

Update : corrected some wrong figures towards the end of the post
Update #2: put the final estimates in concrete numbers,for all Zambezi models
Update #3: if you want to see my estimate for all desktop chips in 2011/2012,I wrote a new blog covering this.

There has been a lot of rumors and fake data about Zambezi lately.One "blogger" seeded many websites with his rigged Zambezi results and had practically spread massive amount of FUD.Maybe he has some agenda? Anyway,I will try to dissect a few public statements and see what kind of performance levels we can expect from Orochi design in client(desktop) workloads.

Server Vs Desktop performance targets

One "bulldozer" module.Note the shared and dedicated parts.

First thing that comes to mind when someone says "bulldozer performance" is AMD's server division statement of "50% more throughput performance than previous 12C MC products". Looking back in the past(Barcelona) and knowing the difference between server and client workloads,we can say with great certainty that those (server) figures are not representative of desktop performance. Server is all about throughput,where more cores usually scale nicely and where memory BW matters a lot. Interlagos ,a server product, will have 16 "Bulldozer" cores,grouped in pairs and arranged in so called modules(module is AMD's term that means "an optimized dual core" and it is a building block of Orochi design).It will be 2x8C design,a so called Multi Chip Module(different from AMD's "module" term) product,consisting of 2 Orochi dice linked together to create a 16C product.Since both Orochi dice are supporting 2 channel integrated mem. controller,this will give a 16C interlagos a total of 4 memory channels,2 per 8 core die. MCM technique gives AMD an opportunity to better match the cores that have similar real world power/perf. characteristics and cut the costs at the same time(imagine monolithic Interlagos with huge 4 channel IMC and 16cores on a massive 600+mm^2 die,yikes!- now compare it with 2 x ~290mm^2 for MCM Interlagos).

Concept behind one Orochi die.4 modules mean 8 cores.Interlagos will link 2 of these in MCM for a total of 16 cores.

A great picture by Hans De Vries(http://chip-architect.com/). An actual(real) Orochi die shot.4 groups of modules,each of which has 2 cores.Note the die area savings due to "sharing philosophy".Huge L3 cache is partitioned in 4 groups of 2MBs.Each module also have shared L2 cache,2MBs in size.

Back to "50% more throughput" statement.Since this is server statement we can't deduce much about desktop performance from it.We know that even though many server workloads scale nicely with more cores, some server workloads don't scale that well with core count.But we also know that Bulldozer will have very aggressive Turbo Core boost so it will adapt itself to this eventuality: up the clocks by 1Ghz and power/clock gate the idle parts in order to maximize
serial code performance in single or poorly threaded workloads;return from C6 state and clock up to default clock all cores if FPU intensive workloads use up all the cores in a system;clock up by 500Mhz all integer cores if all of them are running integer heavy workloads while FP units are clock gated .

So the "50% more throughput" statement is more of an average of many workloads(both integer and FP and both serial and parallel in nature-mostly parallel though).So with Interlagos we will have 33% more cores(1.33x) ,running at somewhat higher clocks(depending on the nature of workload) and having X% better IPC than Family 10h cores. I say "X% better IPC" because it is unknown by how much will IPC grow and also because we know it will have higher IPC since AMD stated so(it can be even a few % for all we know but it would still be a true statement).

80% throughput performance of CMP approach with much less die area.

This X% better IPC is a variable with Bulldozer,mainly due to 2 reasons.

First is the shared front end for which AMD publicly stated it offers 80% of throughput of conventional dual core
design(a hypothetical conventional Bulldozer dual core done "the old way").

Shared front end in Bulldozer module.

Second is the new integer core design : 2ALU+2AGU with unified integer scheduler and other improvements Vs 3ALU/AGU design with separate math and address schedulers.

First reason (shared front end) will mean that if a Bulldozer(BD) module,which is an optimized dual core,has 100pts on average in a mix of parallel workloads(2 threads),each thread is somewhat penalized .This penalty probably varies a lot and can be as little as 1% or as much as 25%(or even more).AMD listed an average of 80% throughput(25% penalty) but we have to understand that no dual core product out there scales perfectly with more threads.Usually(for non-SMT dual cores) it is around 95% scaling .So essentially BD module is around 19% slower(95/80) due to this penalty but with a much less die area than hypothetical Bulldozer that was to be done "the old way". So each thread on its own (running solo in the module) is up to 19% "stronger" then when both of them run together.
Another thing to note is that sharing is not always bad ;).This is the case of floating point unit,which is shared between 2 integer cores within the module.

BD module's FP unit organization.

When integer cores may have this slight performance penalty ,the FP units that work in SMT mode in BD module may even see the boost from this type of operation. Bulldozer module has a so called FlexFP floating point unit.It consists out of 2 FMAC units that have unified scheduler.Practically we have one FMAC per integer core but note that one integer core can use both if opportunity arises(ie. single threaded SIMD workload).Each FMAC is fused multiply accumulate capable and is 128bits wide. It is estimated that ,performance wise, one 128bit FMAC will be roughly on par with one Family 10h "core" in previous MC products even though FMAC can do either/or ADD,MUL,FMA while one Family 10h core can do one ADD and one MUL (in theory).This is done via core optimizations(new L/S unit,new scheduler,improved reg-reg transfer rate). AMD estimated that 16C Interlagos(~2.5 or 2.6Ghz) will have around 63% better FP throughput performance than 12C 2.5Ghz Magny Cours.How do we know this you may ask?Well following slide is an AMD estimate back from the day when Magny Cours topped at 2.3GHz.

Chart made by HP and AMD in a presentation held back in 2010.All performance figures for Interlagos are estimates.Magny Cours performance based on 2.3Ghz model.

If we correct for the speed bump of 2.3->2.5Ghz that Magny Cours saw in the mean time,the specfp_rate increase that comes with Interlagos(of 1.78x) will be lowered to around 1.63x. To be on the safe side lets say it's around 60% or 1.6x * .So this gets us around ~20% improvement per core with Interlagos(1.6/1.33=~1.2 or 20%). In integer we see AMD estimated 50% improvement versus 12C 2.3Ghz MC ,but after correction for clock speed bump this is down to 38%.On average ,for both integer and fp,according to adjusted projected performance on above chart,Interlagos should give : sqrt(1.38x1.6)=1.485 or close to 50% more throughput than Magny Cours ,as AMD now claims.

*I assumed that Interlagos won't have lower starting clock speed at launch than 12C 45nm MC product(2.5Ghz top model).Interlagos clock speed may end up between 2.4Ghz and 2.6Ghz.

This improvement in fp_rate may be the effect of SMT execution inside the FlexFP.Since both FMACs operate inside the module in SMT fashion,they may see the boost of ~20% ,similar to the boost intel's Nehalem/SB cores see in threaded workloads from intel's own implementation of simultaneous multithreading. Note that with BD this will probably happen in FP heavy workloads that are multithreaded.In single threaded FP heavy workloads,one core can use both FMACs and that is 2x more floating point execution units(both FMACs) in single thread FP usage model.This could lead to very big performance uplifts versus Family 10h.

Thread control/selection in Bulldozer module.Note the SMT organization for FP execution(red boxes,sorry for small image).This helps with hiding pipeline bubbles and may lead to improved FP multithreaded performance.

Putting it all together we have : 50% better throughput performance with 33% more cores(improved cores) and around the same base clock with 500-1000Mhz Turbo on top of this base clock.According to the slide I've linked earlier (that represents AMD's rough performance estimate back in 2010),we have :

-integer throughput
50% higher score than 2.3Ghz MC(90"pts" vs 60"pts"). In order to get to how much will Interlagos score (without the shared penalty) we have to figure out some variables. 1st ,adjust MC results for speed bump : 60x2.5/2.3~=65pts in specint_rate for MC 12C @ 2.5Ghz. Since scores don't scale perfectly with clocks,we have to adjust for ~4.5% (source spec.org and results for 12C MC @ 2.3Ghz and 12C MC @ 2.1Ghz in specint_rate;scaling is roughly equal to 2.2Ghz when we look at the results of 2.3Ghz MC 12C chip so 4.5% correction or 2.3/2.2=1.045). Final adjusted score for 12C MC @ 2.5Ghz : 65/1.045=62.2 pts.
Now we need to estimate the effect of integer Turbo core in Interlagos score. Base is 2.5Ghz(top Interlagos model according to some leaks) ,maximum clock is 3.5Ghz(half cores idle).But for specint_rate we have 500Mhz boost on all integer cores since they are all loaded=> 1.2x or 20% .Again we correct for 4.5% since scores don't scale perfectly with clock speed(1.2/1.045=1.15x or 15% Turbo effect). Sharing penalty is 25% or 1.25x(80% performance of CMP approach) but since we know that usually conventional CMP approach scales less than perfect(~95%),the sharing penalty is around ~19% or 1.19x(95/80). Also of interest is scaling with more cores in spec int rate tests.8C Magny Cours @ 2.3GHz in 4P configuration has 1.4x lower score than 4P 12C Magny Cours @ 2.3Ghz in this test(source: spec.org). This shows us that with 50% more cores we have roughly 40% higher score in specint_rate.With 33% more cores we will therefore have ~24% higher score in specint_rate test,or 1.24x. All summed up ,we get for Interlagos @ 2.5GHz : 90/1.15x1.19/1.24=75.1pts.Very approximate improvement versus MC per core and per clock in integer workloads: 75.1/62.2=~20%. If we leave 5% for margin of error we get ~14%. If Interlagos launches at 2.6Ghz base clock then the improvement is down to 15% or with error margin at 9% . All very rough approximation of course.

-fp throughput
78% higher score than 2.3Ghz MC(154"pts" vs 86"pts"). In order to get to how much will Interlagos score we have to figure out some variables. 1st ,adjust MC results for speed bump : 86x2.5/2.3~=93.5pts in specfp_rate for MC 12C @ 2.5Ghz. Since scores don't scale perfectly with clocks,we have to adjust for ~4.5% (source spec.org and results for 12C MC @ 2.3Ghz and 12C MC @ 2.1Ghz in specfp_rate;scaling is roughly equal to 2.2Ghz when we look at the results of 2.3Ghz MC 12C chip so 4.5% correction or 2.3/2.2=1.045). Final adjusted score for 12C MC @ 2.5Ghz : 93.5/1.045=89.5 pts.
In FP heavy workloads there won't be any Turbo core available since TDP budget is exhausted . Base is 2.5Ghz(top Interlagos model according to some leaks) ,maximum clock is 3.5Ghz(half cores idle).For specfp_rate we have no Turbo core boost => so 1.0x .
As discussed before,in FP heavy workloads the SMT benefit may come into play. Now we have 8 FlexFP units running 16 threads across 16 FMACs (SMT is done 2way within each module;2 threads run simultaneously on 2 FMACs via unified FP scheduler in FP coprocessor ).So far lets leave this one out and we will return to it later.For final perf. equation SMT benefit is labeled as Y.
Also of interest is scaling with more cores in spec fp rate tests.8C Magny Cours @ 2.3GHz in 4P configuration has 1.27x/1.3x(base) lower score than 4P 12C Magny Cours @ 2.3Ghz in this test(source: spec.org). This shows us that with 50% more cores we have roughly 30% higher score in specfp_rate with Magny Cours.With 33% more cores we will therefore have ~15.3% higher score in specfp_rate test,or 1.15x ,<=this is the effect of 33% more cores in specfp_rate benchmark. All summed up ,with unknown SMT benefit,we get for Interlagos @ 2.5GHz : 154/1.0x/1.15=133pts. To figure out the SMT benefit we divide 133 with the score of MC @ 2.5Ghz which is 89.5 :
133/89.5=1.48 or 48% better per core. Since this may look like a huge jump in performance ,there may be a reason why Interlagos scores so well in FP rate : scaling with more cores is much better than with Magny Cours(which scales not so well in this MT floating point test suite). So if we take the perfect scaling with 33% more cores : 154/1.33=115pts. Now we have 115/89.5=1.28 or 28% better "per core" ,which in turn may be the effect of Simultaneous Multithreading in FLexFP(since each FMAC on its own should be on par with each Magny Cours core,but working in SMT mode we have the boost similar to the boost intel's cores get from their SMT implementation).

Very approximate improvement versus MC per core and per clock in floating point workloads(serial and parallel ): Up to 30% in well threaded(parallel) workloads and more than that(probably >50%) in single threaded(serial) FP workloads. If Interlagos launches at 2.6Ghz base clock then the improvement is down to ~25% "per" core in multithreaded workloads. All very rough approximation of course.

How does all this relate to Desktop performance?

Above I've tried to speculate about how much effect will each aspect of Bulldozer design have on final (server) performance.Since many things are still blurry ,we don't know for sure how high is the core improvement(IPC) relative to Family 10h(K10).It may vary a lot but still it should be higher in the case of Bulldozer. Last March at Cebit event,AMD's John Taylor said in an interview for OCTV that Bulldozer was designed to offer 30-50% more performance within same power envelope as previous design (2:20) (previous design is 125W K10 based 6C Thuban @ 3.3Ghz).Geometric mean of this wide performance range(30-50%) lands at 39.6%. Since I like to use hardware.fr's average chart for performance comparisons, let's see where would Bulldozer land with ~1.38-1.39x better performance than 3.3Ghz Thuban(with Turbo enabled).

Chart looks like this:

38% (average) higher score would put 8C Zambezi at 165.8x1.38=229pts. In the review hardware.fr did when 1090T X6 launched they calculated the effect of 50% more cores in desktop workloads(and effect of Turbo that was really small since in AMD's case it kicks in with workloads that stress no more than 3 cores). So they found out that 1090T X6 @ 3.2Ghz is 24.5% faster than 955BE X4 @ 3.2Ghz. 50% more cores,24.5% higher performance in the chart. Let's see what kind of score we can get if we apply each Zambezi improvement on top of Thuban's base score.

Core count improvement: for Zambezi,this would mean 33% more cores than X6, 10.5%16.7% higher performance in the chart. This would be the "more cores effect" and is as expected because with more cores you get a lot less of performance boost since most desktop applications are not well threaded(they like faster cores,be it IPC or clock or combination).The Turbo effect in 8C Zambezi is equally hard to estimate. Latest rumors point to following models:

So we supposedly have top 8C model with 3.6GHz base clock and 4.2Ghz Turbo for half active cores(or up to half active cores;if 5th core is activated the max Turbo is lowered to some value below 4.2Ghz). Approximate effect of Turbo in desktop workloads(which are mostly poorly threaded):
geometric mean of 3.6 and 4.2GHz => 3.88GHz or 8% over default clock.This in turn is 14% above Thuban X6 @ 3.4Ghz.I used 3.4 instead 3.3Ghz in order to account for limited Turbo effect Thuban has from its own Turbo Core with 3 active cores. Since Zambezi will usually run above stock and close to the geom. mean value (~3.9Ghz) in most if not all desktop workloads,it's safe to say that 14% or 1.14x is approximate improvement that top Zambezi 8C model will see over Thuban 1100T @ 3.3Ghz .
Next is infamous "IPC" or core improvement.Since I've spent a good deal on this topic in first part of this blog post,I will use the speculated results,although the "more conservative" variants : integer ~10% per core ,FP/SIMD 20% per core.In case of integer,~10% is just an average speculative single core improvement.There is a penalty when both threads run in the module(up to 19%).FP/SIMD benefits from sharing so there is no penalty here,but opposite effect. Geometric average between int and fp is ~14%. I decided to figure in the 0.9x figure as an average penalty factor that is applied to final Zambezi score.
All factors combined: Thubans score x ( 1.167x1.14x1.14x0.9)=165.8x1.364~=226pts in the chart above. Score of ~226 fits quite well with what John Taylor said and that is between 30 and 50% faster than previous AMD design while having the same TDP.If Zambezi really ends up being 38% faster than top Thuban(~229pts) it becomes the fastest desktop chip on the market,posting the same average score as intel's top desktop Westmere 6C product 990x.

My final estimate for 8C Zambezi at 3.6Ghz/4.2Ghz Turbo mode : 36-38% faster than 3.3Ghz 1100T Thuban. As you can see,this is very different from server statement of "50% more throughput than current 12C products",even though the core ratio stays the same between appropriate server and desktop parts and is exactly 33% more cores.Note also that clock speed ratio is even higher in case of desktop Bulldozer models vs Thubans(3.6/4.2 Vs 3.3/3.7) when compared to clock ratio of server BD and Magny Cours parts(more even clock speeds of around 2.5Ghz for both). Still,desktop will see lower increase in performance due to nature of workloads-more serial and less parallel in nature.

In the end,here is a quick estimate for Zambezi 6C and 4C models,assuming same clock speeds for both(3.6Ghz base for top 6C and 4C models):

Top Zambezi 6C 3.6Ghz/4.2GHz with Turbo : ~16.7% slower(effect of 33% more cores) on average than 8C 3.6Ghz/4.2Ghz model(229pts).Approx. chart result : 194-196pts <-faster than 2500K

Top Zambezi 4C 3.8Ghz?/4.4GHz?(based on leaked info above) with Turbo : ~17% slower on average than 6C 3.6Ghz/4.2Ghz model. 17% is a combination of "less cores effect"(/1.24) and more clockspeed effect(x1.05).Approx. chart result : 165-166.5pts <-faster than top end Thuban @ 3.3Ghz

If you have any comments or suggestions,please post them in the comments section.

Tuesday, July 5, 2011

Update: I have examined coolaler's latest results from his Sandy Bridge i3 sample downclocked to 1.8Ghz at the bottom of the original blog post.His numbers are very low for SB and skew the results in Ivy Bridge's favor from 10 to 17%(could be bios glitch in SB system).

Coolaler's forum has a new Ivy Bridge(IB) leak.This time we have an ES which is 2C/4T,working at 1.8GHz and having 4MB of L3(vs 6MB of L3 in current SB CPUs).

Cine 11.5- single and multithreaded benchmark,uses a lot of fp SSE instructions. In this particular case,tester used MT benchmark.In order to figure out how SB which has 2C/4T performs in it,we can find a comparable i3 SB retail chip and its score in C11.5 and then scale it down to 1.8Ghz clock.

Summary:
IB ES is roughly on par with SB at the same clock and with the same number of cores and threads.There are some minor differences so actual scores vary from it being 6% slower than SB(SPI) ,being practically equal to it(CPU mark99) to being 4% faster(C11.5) at the same clock/core/thread configuration.All this is based on early ES,so clock speeds will move up and probably Turbo will be a bit more aggressive with IB.This could lead to better overall scores,but knowing intel's ES from the past,they usually perform within 5% from retail parts(at the same clock).This practically leaves only clockspeed/Turbo as a variable.

edit: Note also that there will be mainstream IB parts with 2C/4T and 4C/4T or 8T configuration which will have more L3 onboard,maybe matching or surpassing SB's 6MB of L3.This could bring a few additional % of performance on average in desktop apps.

UPDATE:
Coolaler updated his thread with comparative scores between retail SB 2C/4T @ 1.8Ghz and IB 2C/4T @ 1.8Ghz.But there is some problem with his SB i3 system.

Coolaler's SB i3 numbers are way slower than what reviews show for retail parts that work @ 3.1GHz (if we would scale them down to 1.8Ghz). Something is wrong with his SB i3.
Example:
In C11.5 his SB i3 2120 gets 3.19pts @ 3.3Ghz here,so it should roughly score 1.74pts.
This lines up with i3 2100 score which bit tech reviewed here.Bit tech got 2.97pts @ 3.1ghz,so i3 2100 @ 1.8Ghz should be getting 1.72pts,practically the same as hardware canucks SB i3. One more example is here which shows a scaled down i3's score of 1.73pts(at 1.8Ghz). All 3 of these results are ~11% better per clock/per core than what coolaler gets with his sample... Or in numbers : 1.72pts(average from 3 sources)/1.55pts(coolaelr's SB i3)~=1.11 or 11%.

Let's see what super pi shows us. Here we have i3 2120's result in super pi 1m. It scores 11.9s @ 3.3Ghz. At 1.8Ghz the SB 2C/4T would get ~21.81s. This again is faster than what coolaler's i3 is getting .The difference is 23.83/21.81~=1.09 or 9%. So coolaler's i3 is 9% slower in super pi than other retail SB i3's,per clock.

We move on to CPUmark99.I found this result of retail i3 2100 on same(coolaler's !) forum here. i3 2100 @ 3.1Ghz scores 475pts. So i3 @ 1.8Ghz should be getting : 475x1.8/3.1~=275pts,or 275/235=1.17 or 17% faster than coolaler's own i3 chip. Hmm,similar pattern occurs. His sample downclocked is from 9% to 17% slower than other i3 SB retail chips out there.

Conclusion : coolaler's i3 @ 1.8Ghz is performing anywhere from 9 to 17% slower than "normal" i3 SB CPUs if they ran at same 1.8Ghz clock. This skews the results a bit and shows his IB @ 1.8Ghz being slightly faster than his SB i3 at the same clock. The reason could be some bios glitch or power management issue.

Monday, June 13, 2011

Update: I'm waiting on complete AT review(still in preview stage). He did retest the A8 with 1600 and 1866MHz RAM and it made some massive difference. On the CPU side Llano is around 3-8% faster than Deneb at the same clock,solid improvement but nothing spectacular.For a slightly tweaked shrink it is a good result on the CPU side. Turbo may be a bit of a let down since max. turbo states are rarely hit due to shared TDP in which the GPU part is prioritized over CPU cores.Top desktop part has no Turbo and works at fixed 2.9Ghz clock with power management p-states in between(800MHz-2.9GHz).

I'm making a chart that summarizes Llano's general performance Vs SandyBridge parts,stay tuned.

Sunday, June 12, 2011

You all remember my original blog post about BD ES weirdness that goes on in Far East(and probably elsewhere). Problem is/was that those chips are gimped in many ways so that competition is unable to figure out what is the true potential of the first new AMD core design since original K7.

Intel announced yesterday the new AVX2 ISA extensions that should be introduced with Haswell in 2013. We finally get 256bit integer AVX instructions since AVX1 was limited to FP when it comes to 256bit support.There are other additions like support for FMA(256b/128b but FP only),but main one is 256b integer SIMD support.

It seems someone got a hold of retail A8 3850 part and tested it here. Thanks goes to dresdenboy for the link.
Part works at 2.6Ghz and can turbo up to2.9Ghz (with all cores loaded/GPU idle? correction: this model has no turbo and works at 2.9Ghz). GPU part works @ 600MHz and features 400SP, so on par with 6550M discrete part.

Now on to results!

User ran a set of Futuremark tests: 3dmark11, Vantage and 3dmark06. User also managed to OC the part ,both the CPU and GPU portion to some rather high levels,just via serial bus tuning(45%). He used air cooling. Final OC speeds are 3.77GHz(45% 30% OC) for CPU and 870Mhz(45% OC) for GPU,DDR3 was also OCed to 2320Mhz(45% OC).Memory OC is very important since GPU still depends on memory BW and it's important for ensuring GPU performance scales linear with (GPU)clockspeed increase.

Thursday, June 9, 2011

Sorry in advance for the longer post :).
Disclaimer : This is just my speculation which is probably just that,speculation. I have not signed any NDA documents nor do I have the hardware discussed here.

Since we have all been witnesses of very strange Zambezi(and Llano) ES scores,this is my try to "predict" and explain what may have been going on with these scores. I will post what I expect as an end score,so in the end,when Zambezi launches,we can see how far away from the real thing was I :).

Let's start with my theory about why Zambezi X8 has such a low scores .I do believe there is at least some BIOS microcode patching going on,but mostly it's something else.As dresdenboy suggested before,and I agree with him,there is some power cap pre-programmed in the ES we are seeing in Chinese forums.This may explain the frowned AMD's motherboard partners who received the same tweaked chips for validation process ;).
Just like in Llano's case,actual clocks are being kept really low in order to keep the CPU within the TDP spec(via Turbo 2.0 interface) that AMD designated in the ES sampling process.This may be 35,45,65,95 or 125W. From the looks of things,current BD ES are limited at 45 or 65W and they keep throttling down whenever the limit is crossed(measured and estimated digitally in BD).
What this means in practice? Just as in case of Llano ES in "New Llano leaks" thread,BD ES throttles down to approx. 2x lower clock speed in singlethreaded workloads (from what is shown in CPUz).This happens in MT workloads as the limit is easily reached in this case.There seems to be a limited "Turbo" ability too,so say 2.8Ghz ES part may be able to Turbo to what I think is 2.0Ghz( 10x multi in reality :P ) or upto the power limit - which is reached in this case.
So for example,2.8Ghz ES (1.4Ghz chip with 2Ghz Turbo and advanced C6 power savings turned ON) scored 23.4s in SPi. When the tester disabled the C6 and seemingly locked the ES @ 3.2Ghz(1.6Ghz effectively while preventing cores from going into deep sleep thus reaching the TDP limit sooner) the scores in SuperPI actually went down,to 26.7s. The cores now did "Turbo" to approx. one multiplier up and finished the test at 1.8Ghz. This is in line with the lower SPI score.

Now that I explained my theory and what I think is going on here,let's move on to my prediction of scores,all based on the Chinese leaks thread.

Next one is Fritz chess.This is a tricky one. 1 core score from here is 1877pts,with C6 enabled and limited turbo to 2Ghz. User runs the MT test with 8 cores and gets 9454pts result. How is this possible? Well ,in my opinion ,the TDP limit kicked in again,limiting the each core to 1.6Ghz while multithreaded(MT) test was run. We know that scaling of modules is 80% of native dual hypothetical Bulldozer dual core design(as per AMD themselves),meaning 6.4x factor instead of perfect 8x=> 0.8x8=6.4 . We have : 1877 x 1.6/2x6.4=9600 pts, close enough huh ? Error is just 1% from actual score ;).
What I think will be the score of 3.2Ghz Zambezi 8C in Fritz chess? 19220 pts,give or take 2%.

Next one popular Cinebench 11.5. The "gimped" BD ES scored just 4.6pts. Too low? You bet. This is in line with Phenom X4 @ 3.5Ghz ,while this was supposed to be brand new octal core from AMD running at close 3.2Ghz. Well explanation is easy and is again ,as in previous case,power capping.
So we have supposedly 2.8Ghz scoring 4.6pts in C11.5. As my theory goes,this is actually a score of 1.4Ghz or 1.6Ghz 8C Zambezi which is limited via power cap (since I don't know what they set in BIOS,2.8GHz or 3.2GHz). What will be the score of retail 3.2Ghz Zambezi in this benchmark? My estimate is as follows: 1) worst case scenario 4.6pts x 3.2/1.6=9.2pts ; 2) best case scenario 4.6x 3.2/1.4=10.51 pts. Now 10.51pts may sound too high since 980x scores 9.2pts ,but remember this slide?

This leaked AMD slide by Donanimhaber states that Zambezi 8C @ XX Ghz will be approximately 1.80x faster than Thuban X6 1100T in C11.5,which scores 5.9pts. This is exactly : 5.9x1.8=10.6pts.Close enough?

Following C11.5 is famous 3dmark06.The score is also on the same link as all above. "3.2Ghz" Zambezi 8C supposedly scores ~4500pts... Yeah,that's right,a score just a bit better than Phenom X4 @ 3Ghz that , launched in Jan 2009, is getting. So we supposedly have a brand new core design,packing 8 cores in total,with IPC improvements,that still somehow sucks so badly that it is 2x slower per core and per clock than ancient Phenom II X4(just forget X6,it's miles ahead of this poor Zambezi).
So what I think is the real score of Zambezi in this benchmark? It should be between 7800 and 8800pts. Point is this is one weird test that doesn't scale past 6 cores nicely,so it's a bit trickier to figure out what will "normal" Zambezi score here. The Donanimhaber slide indicates 50% better score for 3+Ghz(I assume) X8 Vs 1100T,which scores 5900pts. So we have projected by AMD : 1.5x5900=8800pts and estimated here,by me, 7800-8800pts.

So there you go,I tried my best to try and figure out what is going on with those gimped Bulldozer ES out in the (Chinese) wilderness . Not much left to go,around 45 days or so.We shall soon know how wrong was I. Stay tuned for more.