You should also post results with clocks not artificially locked to base CPU and max GPU. One of the benefits of newer generations is improved turbo, for example. Not like it really matters, anyone could tell at a glance that this isn't what they are targeting with Kaveri. With that being said, it was still an interesting read.

On a semi-related note: What I'd really like to see is Kaveri gaming results with faster memory and an overclocked GPU. I'd bet it outscales Richland.Reply

It's for HPC. In HPC applications, upfront cost is irrelevant, it's performance per Watt that matters. The longterm costs in electricity/cooling will eclipse the upfront costs of the CPU quite rapidly. And in performance per Watt Intel is literally miles ahead. Reply

The question the article tries to answer, theoretically, is "how capable is Kaveri for raw number crunching compared to several alternatives". And that's exactly what this could have to do with HPC.. if the numbers were better. At DP with 1/16th SP I don't think Kaveri is going anywhere in classic HPC. Could be used in special SP applications with HSA, though.Reply

Yes and no. There are plenty of places such as in academia where you might have computers in rack-mounts without room for dedicated GPU, but are doing HPC-like workloads.

I agree that the use-case range is small, but that was kind of the conclusion of the article. Even with a (relatively speaking) beefy semi-discrete GPU in it, Kavari still falls short of the performance you can get out of the Haswells with Iris Pro. Reply

oc apart from this being an architectural comparison these kaveri are the best amd officially make for the desktop as are the i7's from Intel, you simply cant buy anything better from AMD for the desktop, if they are not making what you the end consumer want to buy, then no sale=no profit for them this time alround.Reply

Kaveri are still only the best AMD APUs. A 2 module APU does not translate to the best AMD has to offer for the desktop, FX processors are still much better if you are going to get a dedicated GPU. You wouldn't call the i7 4770K the best Intel has to offer, would you? There is a whole range of 2011 socket CPUs.Reply

Is there any indication that GT3e will trickle down to non-H/R Haswell CPUs? Or that Broadwell will expand GT3e technology (in some form) to the rest of the Intel lineup? As of right now, GT3e may as well be vaporware as you can only get it in expensive, limited configurations. Kaveri spans the whole AMD product line at significantly lower cost, but then gets its butt kicked by similarly-priced Intel+dGPU setups. I would really like it if GT3e gets a lot cheaper and more widespread while Kaveri gets a lot more potent.

Indeed. The 4770R is only available to OEMs and more, or less unobtainable. Even if you could get your hands on one, you wouldn't want to. Firstly, it comes with a huge price tag, secondly you lose 2M of cache.. that effectively makes it a core i5. Reply

The key to AMD's success with Kaveri will come on budget mobile notebooks and SFF, where the lack of a dGPU would heavily tilt the gaming advantage to AMD. While Intel HD4000/4600 can game pretty well at 768p, Kaveri would steamroll it and be competent up to 900p... assuming Broadwell IGP doesn't greatly improve.Reply

I'm not so sure just yet. I'm hoping for a strong Kaveri on laptops, but past experiences with Llano, Trinnity and Richland showed the clear desktop win from AMD APU's quickly eroding on portable due power constraints.

This year, the gap is much smaller with strong contenders as HD 5000 and HD 5100 in many laptops. I'm not entirely sure about Kaveri's uphand in graphics performance will be large enough to justify the considerable loss of CPU performance and battery life (assuming that Kaveri will perform as poorly against Haswell as it's older brothers did). And then, Kaveri mobile will come just months before Broadwell, which is said to improve GPUs by quite a bit.Reply

I have a A6-1450 11.6" laptop that supposed to be a 9W 'SoC' with 30Wh battery.I also have a i3 ivy bridge 11.6" tablet that supposed to be a 17/14W 'SoC with 54Wh battery.

Expected the A6 to be 60-70% of the i3 battery life based on a light load usage pattern. Got only about 40% of the i3 life.Rough calculation ended up with around 10W/hour average power consumption on the A6average power consumption of 5W/hour on the i3.

Considering I only paid $280 for the A6 and $450 for the i3, I am still quite happy with it.but can't help but wonder if AMD's SDP/TDP is very different compared to intel's.

To my understanding TDP means the max amount of heat you need to dissipate to keep everything running smoothly. Based on that understanding and the 15W power range, you can let the CPU/APU run hotter (thus rejecting 2-3W worth of heat into surrounding pieces: package, motherboard, case, etc) with the same heatsink TDP.Reply

Can't really draw any conclusions from that. The SoC/APU/CPU is usually a very tiny amount of energy draw in modern laptops/tablets. The display accounts for most of the power usage and if there is even a small amount of brightness difference or indeed manufacturer difference, that can account for you scenario easily.Reply

ive noticed the same thing. gaming wise, the desktop a10 trinity creamed ivy bridge. on mobile, though, the performance difference was only 18% higher in favor of amd. with haswell, intel hits the same performance as mobile richland a10s in games, and ets better battery life to boot.on the other hand, the performance of the 45 watt a8-7600 makes me hopefull that amd will give us another 45 watt mobile fusion apu that would be as fast as the desktop version.Reply

I'm dreaming on that too. It would be a shame if mobile Kaveri took the same huge performance hit that it's older brothers saw when moving from desktop to mobile.

If history repeats itself, I think Broadwell will hit Kaveri-M very hard, relegating it to the same shady spot on poor budget designs that llano, trinnity and richland where. I would love to see an AMD APU performing strong on a good laptop. If Kaveri-M ever threats Broadwell, at least for gaming-focused folk, it would cause the healthy impact that competition causes on Intel. Lower prices, better parts.Reply

Hopefully, but that requires design wins which they have been sorely lacking compared to Intel. And AMD seems practically non-existent in the SFF space. Where is their NUC? Hell, where are their mITX boards? Newegg shows a whopping 3 FM2+ mITX boards and 2 FM2 boards. Intel has 24 just for Haswell, and another 19 for Sandy/Ivy.Reply

In any case, the fp64 performance is always artificially capped for consumer GPUs. If they really wanted to, they could just uncap it. Of course they'd be shooting themselves with no reason for pros to buy the much more expensive workstation/compute cards.Reply

Be careful there. 32 bits are more than enough for any straightforward calculation. Any calculation that requires multiple iterations or multiple points (especially anything nonlinear or based on boundary conditions) is going to fail badly with 32 points.

Double is needed due to accumulated rounding errors. It has nothing to do with significant figures (just how often do you have the 7 or so [decimal] figures a float can have). Try running an audio sample through a 64k point FFT to get a good idea what can happen if you need proof.

As far as capping for consumer use, I have to wonder about that. Obviously, the 780 is capped (although how useful a titan is for calculations that require double without the ECC of the even more expensive variety is questionable), but I have to wonder since using some of the weaker GPUs wouldn't be cost effective considering the entire cost of the motherboard slot (motherboard, ram, CPU, power supply, some sort of boot disk...). I wouldn't be at all surprised if they are cheating a bit on the rounding of the float. Like the double, full IEEE754 isn't remotely useful to consumers (this might barely change with more HSA apps). One of the more painful parts of 754 is that the last bit of a multiply has to be rounded from the entire 112 bits of mantissa you get when you multiply two 56 bit mantissas together. Wimping out on float and doing doubles by the book (almost everyone who cares about rounding uses double) could easily make double 1/12 of float.Reply

Maybe, maybe not. I suspect they aren't trying, but I wouldn't write any code that expected strict IEE754 rounding in single (crypto, perhaps). Strict rounding needs close to 4 times the multiplies that you would need for an unrounded multiply, so they could be wimping out there.

Personally, I'd rather have more floats that are off by a bit than strict 754 rounding on my floats, but can't see doing it as long as there are claims of "IEEE754" compatibility. Violating 754 has a *long* history (there have been plenty of -754strict compiler flags that kill performance), and there are plenty of ways to weasel a datasheet, but violating a spec is something an engineer *does* *not* *do*. When a careful engineer sees something like this, he won't go near the edge conditions (and rounding and the other 754 nastiness is about as edge condition as you can get).Reply

I have card with "unbound" DP performance. It is complete brick. It says it should get 400 gigaflops, but in reality It does Prime95 about 24 msec/iter. When Opteron 110W chip does it twice faster - about 12 msec.

All AMD GPU efforts are turned into bricks cause they fail to test their designs with real software.

Very bad AMD does not move into 32 core opteron chips cause that is what I need now.Reply

by all accounts its ok, i wish Intel would OC put it's followup on their mainstream mid/high i somethings and also improve its data throughput compared to that above linked test, we shall see when it arrives or not...Reply

That review has major errors. The AMD APU they are testing (A4-5000) is not Kaveri at all even they keep calling it Kaveri. A4-5000 is actually the low-end Kabini. Kaveri is MUCH faster than Bay Trail. Reply

The problem with Turbo is that you can't be sure about which frequency will be achieved. So on what shall the calculations be based? The base clock is guaranteed, and scaling the result for that number up for higher clocks is trivial.Reply

Is it guaranteed though? Seems like if your cooling is crap, any processor might throttle. And if your cooling is good, any processor might run its turbo 100% of the time. Mine always to anyway (AMD and Intel alike).Reply

It feels like the Kaveri execution resources have been scaled to the capacity of the memory interface considering the GPU requirements. Haswell might benefit really nicely from the four-channel DDR4 interface as well.Reply

and you are aware that the AMD linux Radeon closed source driver as used here is considered to be on par with the windows driver as they use the same code base, and did you forget that kaveri and it's little slower brothers are supposed to be found in the mobile android devices running that kernel etc some day if they manage to get actual orders there to offset their lower windows PC sales today.Reply

The reason the Intel GPU's don't have fp64 under opencl is because the math instruction that includes intrinsics and division doesn't support fp64. see page 134 of Intel Open Source Graphics Programmer's Reference Manual for the 2013 Intel Core Processor Family...: Volume 2b.

From what I can tell GPU's have a larger number of intrinsics with greater numerical accuracy than AVX. Intel isn't correcting this until AVX-512 (see chapter 7.2 of the "Intel Architecture Instruction Set Extensions Programming Reference" and note the "less than 2^-23 relative error). I believe the normal accuracy is 2^-14.AVX does not have a native fp64 rsqrt.The native log and exp for Hawaii is precise to 1 ULP (http://semiaccurate.com/2013/10/23/long-look-amds-...

Why don't you specify that CPU fpu64 numbers of Intel are for AVX2 instructions, but not for AVX? In this way you give unjust performance advantage to Intel! Intel CPU fpu64 has about 2x performance advantage over AMD fpu64 only with AVX2 instructions. That's why, your following statement seems quite untrue:

"As a comparison point, one core in Haswell has the same floating point performance per cycle as two modules (or four cores) in Steamroller."Reply

You can look at the following chart - http://images.hardwarecanucks.com/image//skymtl/CP... for some comparison numbers and examples. As you can see the FPU (VP8) results of the Haswell i3-4330 are about 2x than that of Kaveri A10-7850k. However the older FPU Ivy Bridge i3-3225 results are similar to that of A10-7850k. That's because the new Haswell processors have AVX2 instructions, but not the Ivy Bridge ones. You can also see that, if you compare the VP8 i7-4770K results to i7-3770K ones. That's why, i7-4770 has twice more performance than i7-3770K.Reply

From a floating point perspective, the only difference between AVX and AVX2 is that AVX2 contains FMA instructions while AVX does not. Kaveri/Steamroller do not support full AVX2 but do support FMA instructions. So, from a floating point perspective, Kaveri/Steamroller and Haswell support almost the same instruction set. if you look at the column, AVX with FMA, we already cover this case. Reply

Thank you for your clarification! But as far as I know, Intel Haswell architecture has FMA 256 bit units compared to Ivy Bridge and Kaveri, etc., which have 128 bit FMA ones. That's the only Haswell's FPU big architectural advantage over the others. That can explain the double performance per FPU module, we can observe on the chart I have posted. And as you say, the AVX2 includes FMA instructions, where the big performance advantage is. However I cannot understand your table, where the regular AVX instructions have 4x advantage over Kaveri. As we can see on the chart (http://images.hardwarecanucks.com/image//skymtl/CP... the practical results show different picture. Haswell's FPU advantage over Kaveri (counting the same number of FPUs) is about 50% - 60%, but not more.Reply

Yes, well, our coverage is more about the theoretical peaks. In practical applications, differences will be smaller.About the 4x advantage of AVX over Kaveri, the difference is that each Haswell core has two 256-bit units. Thus, quad-core Haswell has total of eight 256-bit units.Steamroller modules only have two 128-bit units per module. Thus, quad-core Steamroller only has four 128-bit units. Thus, Haswell has twice the number of SIMD units and each unit is double the width, hence the 4x difference.Reply

Thank you! I can absolutely agree with your calculations. However, I always thought that it is more accurately to compare the quad-core two-module Steamroller or Piledrivers with i3 2 core 4 thread processors. Because, as we know, the AMD quad-core processors have only 2 FPU and 4 Integer units. So they are only 2 core regarding the FPU and quad-core integer. I think the AMD definition for quad-core (or any other number of cores) is not quite correct. But that is another story...Reply

And I think that your comment "About the 4x advantage of AVX over Kaveri, the difference is that each Haswell core has two 256-bit units. Thus, quad-core Haswell has total of eight 256-bit units." is just partly correct. Because those units are 256-bit FMA units. And FMA instructions are part of AVX2, but not AVX. That was the subject of my initial comment.Reply

True. A non-FMA AVX op will provide one 128 bit vector to one 256-bit unit at a time. But is it possible that it can provide two different 128 bit vectors in parallel, in order to take advantage of the full 256-bit unit potential? AFAIK, it is not.Reply

i dont see your point ! it seems AMD where all over the shop wjile intel did one change so farhttps://en.wikipedia.org/wiki/FMA_instruction_set"May 2009: AMD changes the specification of their FMA instructions from the 3-operand DREX form to the 4-operand VEX form, compatible with the April 2008 Intel specification rather than the December 2008 Intel specification.[9]October 2011: AMD Bulldozer processor supports FMA4.[10]January 2012: AMD announces FMA3 support in future processors codenamed Trinity and Vishera; they are based on the Piledriver architecture.[11]May 2012: AMD Piledriver processor supports both FMA3 and FMA4.[10]June 2013: Intel Haswell processor supports FMA3.[12]It is currently uncertain whether the 3-operand VEX coded form (here called FMA3) or the 4-operand form (FMA4) will be the dominating standard in the future."

the only thing that really matters OC is the fact that Different compilers provide different levels of support for FMA4:GCC supports FMA4 with -mfma4 since version 4.5.0[13] and FMA3 with -mfma since version 4.7.0

NASM supports FMA3 instructions since version 2.03 and FMA4 instructions since 2.06.YAsm supports FMA3 and FMA4 instructions since version 1.1.0.Reply

The non-FMA AVX ops are currently the most widely used vector instructions in the x86 applications. The newer AVX2 ones are not widely adopted, and thus have just tiny share. The non-FMA AVX 128 bit operands are executed using 256 bit FMA units in Haswell, but take no advantage of those 256 bits, as the 256 bit FMA unit can execute only one 128 bit operand at a time. That's why the 256 bit FMA units in Haswell give performance advantage only for FMA AVX 256 bit ops (AVX2), but not for the widely adopted non-FMA AVX ops. That is what I think and can explain in simple terms. Reply

AMD is shooting itself in the foot if it doesn't have a Kaveri with full GPU FP64 capability similar to 7970. Together with HSA, it should be powerful for a new breed of applications that require FP64. It's a window of opportunity for them to popularize this product in HPC. In gaming, it also requires a "killer app" that utilizes HSA and iGPU to assist new techniques in rendering, e.g renderings that require dependency, compute-based rendering, and interactive GPU physics, and coupled with a dGPU only for rendering.Reply

Hardly. Building it for the target market would increase power draw by a factor of four (well two since the GPU is half the chip). That would kill mobile sales and likely limit desktop power to Intel levels. Not going to happen.

FP64 apps tend to be rare and price insensitive. Intel appears to be going there with the knights landing chip and AMD would get killed trying to make a chip that could compete with that *AND* fit in laptops/tablets (it would have enough trouble competing with that on the desktop).Reply

Regardless of how well the A10-7850 compares to Intel's offering in terms of fp64, I'm wondering what good the extra 33% Stream Processors are bringing compared to the rest of the Kaveri range, as in the 7700k and the A8-7600? Reply

That A6 is so weak that it stays pegged at 100% for much longer periods than an i3. The i3 is able to actually enter into low power states more often. Since an i3 will churn through its tasks faster, it can even result in reduced power consumption from the storage device since more I/O operations can be clustered together.Reply

I am developer who frequently uses OpenCL to accelerate proprietary image processing algorithms. Their code relies on compiler to vectorize which, in my experience using AMD and Intel's OpenCL SDKs, is often a mistake resulting in subpar performance.

I never really considered the fact that benchmark code would be this naive. I assumed that since its purpose was to give an objective standpoint of realizable performance that they would take all steps to ensure maximal numbers. I won't make that mistake again.Reply

"Their code relies on compiler to vectorize" do you also rely on the compilers abilities to vectorize or actually write your code as small independent modules with both assembly code and C code as fall back as it where x264 code style to maximize your data throughput.

where can we find your OpenCL x264 image processing algorithms patches to improve that generic app for 1080P/UHD1 encodingReply

I don't understand what you are trying to say. Can you explain it more clearly?

There are two ways to vectorize execution: explicitly (and there a few ways to do so) or letting the compiler figure it out from vector naïve code. The source does not explicitly vectorize by using the vector data types available in OpenCL.Reply

I'm confused about FlexFPU. Surely the idea was to allow for two SSE or one AVX instruction per cycle, and considering we're talking four units per dual module/quad core Kaveri, wouldn't that be equivalent to a Phenom II X4/Llano? The unit is supposedly designed to work in a HyperThreaded-style manner, could that be the limitation, or is it for SSE2 only?

Also, as far as I recall, K10 doesn't support fused instructions. So, it's another reason to be confused about the results.Reply

I think, there are mistakes in the table “CPU floating-point peak performance” in the column for Ivy Bridge i7-3770K processor. The 3770K has 4 cores each having 1 FPU with 2 128-bit FMA units. That is total of 8 128-bit FMA units. Steamroller A10-7850 has 4 cores, each two sharing 1 FPU with 2 128-bit FMA units. That is 2 FPU times 2 FMA units, which gives total of 4 128-bit FMA units. Hence i7-3770K has twice more AVX peak performance power than Steamroller, Richland and Trinity. Therefore the following numbers in the table corresponding to 4 times more performance power are wrong:- i7-3770K, AVX fp32 (/cycle) 64. Should be 32;- i7-3770K, AVX fp64 (/cycle) 32. Should be 16;- i7-3770K, AVX fp32 (gflops) 224. Should be 112;- i7-3770K, AVX fp64 (gflops) 112. Should be 56.Reply

Or in other words, Sandy Bridge/Ivy Bridge ALU can execute either 1 256-bit addition or one 256-bit multiplication per cycle per core. While the two 128-bit Steamroller FMA units can group together to execute the same 1 256-bit addition or one 256-bit multiplication per cycle per module. Hence in the most cases, 1 Steamroller module should have the same throughput as 1 Ivy Bridge core. As non FMA AVX multiply and add operations are rarely mixed together, one could not expect many cases where both operations are performed on both 256-bit Ivy Bridge ALUs at the same cycle. In some ideal scenario, one of the Ivy Bridge hyper threads would provide 256-bit addition and the other - 256-bit multiplication. I can agree that in those cases the CPU will reach your maximum numbers of peak performance.Reply

I'm not sure about my understanding, but maybe FPU in bulldozer don't work as a single core:"What he could tell me was that the 128-bit FP units are symmetrical, and that, on any cycle, either integer core can dispatch a 256-bit AVX instruction (assuming software compiled to support AVX). Or, both integer cores can dispatch a single 128-bit instruction at the same time."