I would have liked to post my thoughts earlier than just yet, but higher priority tasks prevented me from doing so.

Brazos

Those Brazos previews, which appeared around mid November, caused a lot of discussion on the net. One reason was the inclusion of a SSD instead of a HDD in the Zacate prototype system. This surely had some effect on benchmarks, which depend on hard drive performance. Further it influenced measured power consumption. Well, the benchmarks with most influence I can think of here, are office benchmarks like SYSMark, where script controlled applications do file operations, start . Other types of benchmarks are mostly or solely CPU/memory bound like Cinebench (surely an interesting benchmark for mobile devices), or CPU/GPU/memory bound like games (they usually load their data into RAM before starting a benchmark).

Another question seldomly answered by the previews was the function of Zacate's power management. Some reviewers included AMD's statements about idle power likely going further down in final systems. The most helpful remark came with the latest paper issue of the German c't mag, where the reviewer noted that the power management itself was not the final version. This might be supported by power consumption numbers running different kinds of code. For example the PC Per review lists a power consumption of 19.1W while running CineBench 11 (heavy CPU usage - esp. FPU) and 28.8W while running Left 4 Dead 2 with both heavy CPU and GPU usage. This is a difference of 10W for using 3D graphics. Some of this difference is likely caused by mem operations, which include power for the mem controller. But it's still possible that the available TDP headroom isn't fully exploited by the power management. This might be answered by an interview done by Xbit Labs, where Godfrey Cheng, director of client technology unit at AMD, answered a lot of questions about Zacate, Ontario and Fusion in general. He states that power management - esp. frequency boosting - isn't final and subject to change. So in case of the previews I wouldn't assume a higher CPU core clock than 1.6GHz and a higher GPU clock than 500MHz for tested Zacate systems. Further he said that Zacate is capable of delivering more than 90 GFLOPS compute performance. This is in comparison to the 400-500 GFLOPS stated for Llano.

I actually wonder if the Zacate APUs used in the prototype systems are of an earlier stepping than the ones sent out in the first batch shipped to device manufacturers as mentioned by AMD's CEO Dirk Meyer at the Financial Analyst Day. However, here are the (p)reviews:

This week news came in about Apple's decision to use Sandy Bridge for their notebook series. There is still discussion going on whether Sandy Bridge's GPU will support OpenCL or not. According to an Ars Technica article, it won't due to the limited capabilities of the GPU. Well, so far an Intel representative already said the chip will support OpenCL 1.1. There was no distinction between GPU or CPU in his statement. However, even without OpenCL support in the GPU there would be a way to use OpenCL on it as Intel has shown at this year's SIGGRAPH. Intel's OpenCL SDK can be found here (incl. videos of the SIGGRAPH talk): http://software.intel.com/en-us/articles/intel-opencl-sdk/

Upcoming server socket C2012 with three memory channels

As I already wrote, the C2012 socket for the single die Sepang processor (likely Opteron 4300 series) will provide three memory channels. Some people wondered, why the bigger G2012 socket for Terramar (likely Opteron 6300 series) will only have four channels (similar to G34 now) and not six. I think the reasons are simple. Sepang and Terramar both will have about the same max. TDP limits. In case of Terramar processors this likely will lead to about 70% the max. clock frequency of Sepang processors. So with half the CPU core count and about 1.4X the clock, a Sepang processor would have about 70% the theoretical throughput of Terramar. Three memory channels would provide about 75% the memory bandwidth of a four channel socket. The other way round Terramar has about 1.4X the throughput of Sepang (0.7X the clock and 2X the cores). Four channels should be enough then. Other considerations for this three/four channel configuration could have involved the increase of attachable DIMMs in a C2012 based system and the power consumption by additional mem controllers for G2012. Harvesting dies with one non-functional mem controller for using them in a Terramar MCM could be another option.

Other stuff

This week Intel presented at the Barclays Capital Global Technology Conference. Apparently Ottelini said there (I didn't listen to the webcast) that Ivy Bridge samples are back from the fab and functioning well.

AMD recently launched new CPU models including their new flagship processor Phenom II 1100T. You can find a collection of review links at XtremeSystems. If you look closely you might see some strange behaviour in some tests (esp. game tests at computerbase): The 1100T sometimes is more than 3% faster than a 1090T, although both base and turbo clock speed would only suggest a 3% improvement. Maybe there were some differences in the test setups. But if not this could be related to some internal changes to the CPU.

If you missed it: In the comments to some of the more recent blog postings there were lengthy discussions about BD's IPC, issue width, die size and further attributes.

And nearly the last one: A successor to Bulldozer (incl. enhanced BD and BD NG) might be called Steamroller. Sounds even logical This could the APU coming around 2014.

Considering the new name, would this be a from the ground up, all new architecture too? I'm guessing it's the one that has SIMDs more tightly integrated into the architecture, which I'm sure would require quite a rigorous overhaul at least.

My impression is that this will be a new design, maybe based on BD components, but with more changes regarding integration of shader-like compute blocks. Same for Jaguar (Bobcat successor). So we share about the same view.

More leaks about supposed Zambezi performance levels(now from donanimhaber!).They said have seen the slides coming directly from AMD and it is claimed that 8 core Zambezi @ unknown clock is 50% faster than Phenom II 6 core @3.3GHz and 4C/8T i7 parts such as 950.

With 50% over 1100T and 950(which are 8% apart in the chart I linked) ,we get 8C BD @ unknown to be around 11 or 20% faster than Gulftown in mixed benchmarks ,depending which CPU we use a base to extrapolate Zambezi's performance level.

I've heard about those things. While the instructions table doesn't suggest that many int ops could be executed in those units, their description suggests so (capable of doing simple logical and arithmetic ops - surely some things which go beyond LEA instructions and maybe linked to certain input patterns).

The known size of branch prediction related tables (known since Hot Chips) doesn't tell much about the performance. But so far it sounds like BD's branch prediction will work similar (but improved) to BC's branch prediction (look at the pipeline diagram).

There are simple predictors (fast and less accurate) and more complex predictors (slower but much more accurate - which is important you know the tremendous effect of a 1% more accurate branch prediction). The simpler ones may be oversteered by the more complex ones (a few cycles later) if the prediction was wrong.

So in most cases (90%) branch prediction would have a small or no taken branch penalty, in a few % of cases there would be a penalty of a few cycles and overall only 3% of incorrectly predicted branches (costing the full branch misprediction penalty depending on case) - with maybe a few cycles shaved off because of a prefetched alternative path.

Thanks Dres really appreciate your thoughts, can't believe I missed the branch prediction stuff from hotchips, can be alot of info to get your head around at times. This really makes me curious about how the predictors perform in real life.

PS Love the twitter feed, even though I sorta miss the more frequent blog posts