AMD A8-7600 Kaveri APU Review - HSA Arrives

The AMD Kaveri Architecture

Kaveri: AMD’s New Flagship Processor

How big is Kaveri? We already know the die size of it, but what kind of impact will it have on the marketplace? Has AMD chosen the right path by focusing on power consumption and HSA? Starting out an article with three questions in a row is a questionable tactic for any writer, but these are the things that first come to mind when considering a product the likes of Kaveri. I am hoping we can answer a few of these questions by the end of this article, but alas it seems as though the market will have the final say as to how successful this new architecture is.

AMD has been pursuing the “Future is Fusion” line for several years, but it can be argued that Kaveri is truly the first “Fusion” product that completes the overall vision for where AMD wants to go. The previous several generations of APUs were initially not all that integrated in a functional sense, but the complexity and completeness of that integration has been improved upon with each iteration. Kaveri takes this integration to the next step, and one which fulfills the promise of a truly heterogeneous computing solution. While AMD has the hardware available, we have yet to see if the software companies are willing to leverage the compute power afforded by a robust and programmable graphics unit powered by AMD’s GCN architecture.

(Editor's Note: The following two pages were written by our own Josh Walrath, dicsussing the technology and architecture of AMD Kaveri. Testing and performance analysis by Ryan Shrout starts on page 3.)

Process Decisions

The first step in understanding Kaveri is taking a look at the process technology that AMD is using for this particular product. Since AMD divested itself of their manufacturing arm, they have had to rely on GLOBALFOUNDRIES to produce nearly all of their current CPUs and APUs. Bulldozer, Piledriver, Llano, Trinity, and Richland based parts were all produced on GF’s 32 nm PD-SOI process. The lower power APUs such as Brazos and Kabini have been produced by TSMC on their 40 nm and 28 nm processes respectively.

Kaveri will take a slightly different approach here. It will be produced by GLOBALFOUNDRIES, but it will forego the SOI and utilize a bulk silicon process. 28 nm HKMG is very common around the industry, but few pure play foundries were willing to tailor their process to the direct needs of AMD and the Kaveri product. GF was able to do such a thing. APUs are a different kind of animal when it comes to fabrication, primarily because the two disparate units require different characteristics to perform at the highest efficiency. As such, compromises had to be made.

GPUs perform best using high density transistors running at lower speeds, as more parallel units can be packed into a chip. The lower clock speeds are not necessarily a hindrance to these massively parallel processors, so the focus is primarily that of maximizing transistor count to die space. CPUs on the other hand seem to work better with more spacing between transistors and being able to run at a higher clock speed without breaking any power and TDP envelopes. These are generalizations, but the truth of the matter is that CPUs and GPUs are very different beasts when it comes to design considerations at a very low level.

The 28 nm bulk/HKMG process at GF is more of a compromise that is optimized for good performance for both the GPU and CPU. It offers good enough density and good enough speed to make for a competitive product in the marketplace. It is a bit more biased towards the GPU portion, as the CPU takes a hit when it starts to run at the higher TDPs. So at 95 watts, the CPU portion of Kaveri is running as fast as it can while being constrained by TDP concerns. Even though 28 nm HKMG in theory should offer a little more headroom than the previous 32 nm PD-SOI based process, in the end Kaveri will run oh-so-slightly slower than the previous generation Richland in terms of raw CPU clockspeed. The GPU portion will run significantly slower than the previous VLIW4 based part in Richland. These are not necessarily bad things, because the efficiency improvements in both the CPU and GPU offset the clockspeed disadvantages.

Steamroller Improvements

Some years back AMD decided to go the CMT (clustered multi-threading) route for multi-threaded efficiency vs. die cost. The first product to sport these new cores was the Bulldozer based FX-8150. The results were not very positive. The part showed some real issues with power consumption, heat production, and single threaded performance. While it did very well in heavily multi-threaded apps, it was not exactly a winning formula. The next update to the architecture was Piledriver. This is found in both the Trinity/Richland line of APUs as well as the FX 8300/6300/4300 series of parts. Piledriver had some small improvements in performance per clock, but the biggest improvement was power. Piledriver did not get as hot or pull as much power per clock as did Bulldozer.

Kaveri introduces the new Steamroller architecture for the CPU portion of the APU. Steamroller is another improvement over Piledriver, especially in terms of performance per clock. Kaveri is comprised of two Steamroller modules which each contain two cores, so a two module unit can address four threads. The front end of the module was reworked in a very significant manner to improve not only single thread performance, but also multi-threaded performance as well.

The biggest improvement is the addition of another decoder. Previous iterations only had one instruction decode unit per module, so each module was limited to one thread per clock. We can see right off the bat that single threaded performance will suffer because a good portion of the execution units in each core will be waiting for instructions every clock. Multi-threading also suffers because each module only addresses half of the potential threads vs. core count.

AMD did not just stop there. They improved essentially every piece of the front end, as well as how the D-caches handle and store data. The integer and floating point units look to be left untouched, but every other aspect of the chip was touched upon and improved by AMD’s engineers. The integer and floating point/SIMD units were seemingly fast enough for the job, but they just could not be fed data and instructions effectively and efficiently.

AMD showed us estimates of a peak 20% improvement in performance per clock. They then told us that in most real-world situations that number is likely to be 10%. Still, this is a pretty big jump in single thread performance, and it will be able to handle multi-threaded loads more efficiently as well.

Power does not seem to be an issue with this design, though as mentioned in the process section AMD did take a hit in CPU performance in the high TDP range. With more tweaking of the process we can expect faster parts to be released down the line, but for the now the A10-7850K will be the top SKU for this introduction. Also, AMD will be offering these products in the 15 watt TDP range later on this year. That is a pretty significant range of TDPs for essentially a single design. AMD did disclose all of the power saving features, but they seem to be very comparable to what was introduced with Richland.

Definition of Compute Cores

AMD is coming out with a new description for cores with Kaveri. Compute cores were bandied about during the tech day, and they actually make a bit of sense. At CES, NVIDIA came out with their “192 core” Tegra K1, but that actually seems a bit of a misnomer as compared to how AMD is defining “cores”. Those Tegra cores are more akin to SIMD units than standalone cores. My understanding is that a single SMX unit could be considered a “compute core”.

On the other hand, AMD’s GCN compute clusters can be defined as cores in a more historical sense. The top end APU has a total of 12 compute cores; 4 of them are the CPU cores in the Steamroller modules, while the other 8 are the GCN units. Each GCN unit contains 4 x 16 wide vector units (SIMD), a single scalar unit, branch and message unit, a scheduler, texture and texture fetch units, and a bunch of cache. Each GCN unit has around 146 KB of cache divided between vector registers, a scalar register, local data share, and L1 cache. It also has such basics as a program counter, which certainly fits in with their traditional definition of cores. Each GCN unit can theoretically assign new jobs/work to the CPU when needed. While you certainly can’t boot up an OS from a GCN core, it can do a significant amount of work independently from the CPU.

Sources at AMD have stated that FX branded processors will be back, but AM3 is a dead end. These things point to AMD eventually releasing a FX processor on FM2+. Now, this FX processor will be a APU and not the traditional FX products we have seen so far.

Sadly, this seems true. I understand now why there won't be an FX Steamroller CPU; it's just nowhere near competitive to the Intel counterparts. As a longtime AMD enthusiast, I am saddened by this, but by the same token, if I were the CEO of AMD, it would be hard for me to make a business case to invest the engineering resources to catch up (strictly referring to integer performance). The future seems to be phablets, tablets, and convertibles.

but that is kinda the point isnit? race to the bottom.
Most users but the cheapest option possible, that is why there are so many people with atom (pre silvermont) and low end celeron systems, constantly complaining how shity their laptop is.
A good enough cpu {and lets face it modern day low end cpu's are more than powerful enough for 90% of home users} with a good entry level gpu for around $600 will be the biggest sellers. It's just good business. AMD pushing the low end by making said low end systems good enough to play games at very decent quality will encourage the pc gaming ecosystem to once and for all dominate the console.

The money(Profits) have never been in High End gamimg, for Intel, at least, and have never been in high end gaming. Intel has always developed for the server, and mainstream market. Intel bases it chips for the enthusiast market arouund its server SKUs, with the server specific functionality removed, or fused off! Intel has always subsidized its gaming SKUs, with its profits from its server, and mainstream sales! AMD can not afford to do this subsidizing, and never really could, to the degree that Intel could! The whole profitable part of the market has shifted to the Moble Tablet/Phone, and low cost laptop/chromebook markets, that is where the money is, and AMD currently can only remain viable as a ongoing concern, by shifting its resources towards the GPU/APU market where it beats Intel, and competes with Nvidia! AMD does provide Intel with plenty of competition in the LOW cost, low to midrange(With Karvei) CPU/APU market! Intel is in deep trouble, in the low cost x86 market, and currently is not a factor in the Mobile CPU/APU market!

Loan AMD half a billion to restart its high end development, and you better have a few billion in reserve for a revolving line of funding, beacuse that is the level of subsidizing that gaming high end development costs!

well this cements amd's mobile shift they probably have nothing to compete with intel's performence on the desktop/server side till excavator or after... I will give them credit though they were handed a bunch of lemons and made the best lemonade they could.

I've always wondered with AMD how long it would be before they try and make a push to have a GDDR5 memory slot included on Fusion Motherboards. Presumably it would give the built in GCN cores quite a boost in performance and gaming.

To be fair in the end, the tortoise won the race. Though your point still stands.

I also don't think Intel is stalling for them to catch up (though that would be nice). Intel just plain doesn't care, they consider ARM the threat now. They know AMD isn't going to be catching up for quite a while. Given the way things are going, they can just sit on new stuff until someone actually comes to compete.

Haswell-e is not due yet, Sandy-e or the newer Ivy-bridge-e are the only options. Because Ivy-e uses the same platform as Sandy-e I decided to go with a 4770k for now and I'm saving for when Haswell-e is released. From what I've read that will have a fairly decent performance improvement. Ivy-e is not much better than Sandy-e just like the Sandy/Ivy-Haswell improvements are minimal.

Please do not forget to define AMD's mobile tablet based APUs use of the Mobile/Full versions of openCL, openGL, etc, as Nivida's Tegra K1, now supports the desktop/descrete GPU, full versions of openCL, etc., on Nvidia's new mobile Tegra K1 based platforms! AMD needs to offer Full version support of OpenCL, etc. drivers on any SKUs that compete with the Nivida Tegra K1s! In the Future, with respect to any reviews of Mobile devices built around the AMD Kaveri mobile APUs in competition with the Nvidia Tegra K1 based mobile devices based "APU Type" CPU/GPU systems, please make sure to tell the reader if the mobile device will allow loading of a full Linux distro, and if that mobile divice's CPU/GPU APU, or "APU" type(Nvidia k1, etc) system supports the full openCL, etc. versions of the drivers! I am seriously looking for the K1 based tablet devices, and their K1 based ability to run full desktop style applications via full version openCL, OpenGL driver support, to allow me to run Blender 3d(Light Mesh Modeling) and Gimp for graphics, on a mobile tablet, that can run a full Linux Distro. Full openCL, openGL driver support on a Tablet/Mobile processor(K1, YES), (Kaveri, ??), is big news, and I look forward to your device reviews.

here's the problem, One chip, one price, just like how the Nvidia didn't get the consoles.. all three.. they will not be able to beat the price/performance point. I own two NVidia cards. from a business stand point, this spells bad news for Nvidia and anyone else in the smaller factor market. lots of power, all in one, less Watts. this is not high end. I hope they release high end chips but.. I'm beginning to think its not going to happen in the economy

What did the post, that you replied to, have to say, about what you are talking about? the Poster needs full Linux capability, and full OpenCL, OpenGL from a Mobile "APU"/APU type device(tablet) Nvidia's K1, can provide FULL openCL, openGL support, and the poster hopes AMD can provide an equivalent level of support, with its competiting SKUs!
The poster will buy any tablet, even it was made by marvin the martian, if it meets the posters needs, the poster wants a tablet that runs a full linux rooted distro, that can run Blender 3d, and gimp(both reguire Full OpenGL, and Maybe some Full OpenCL support, and run under windows[Hell no], or Linux)! the Poster would prefer a mobile x86 AMD platform(if it has full OpenGL, openCL, etc. support like The K1), but will use the K1 if there is A blender 3d, and Gimp, Arm based build available, to run on the linux distro based ARM platform! The K1 will compete very well in its intended form factor against the AMD kaveri tablet APUs! BUT no Blender 3d, no Gimp, and No full linux rooted distro, on the tablet, NO BUY!

Good upgrade when compared to the previous generation... it is matching and at times even beat the A10 5800K despite being a lower end part. Yes, it doesn't beat a core i3 in single threaded stuff but the multi-threaded performance is decent.

It seems AMD is still in the business of shooting them self in the foot with an RPG.

After the Phenom II series of CPU, their new core module design has lead to around a 50% lower IPC on each core.

For example, compare the fx8350 to the X6 1100t it beats it in overall performance by a very small margin, and largely fails against it when it comes to single threaded performance, all while having a 700MHz across 8 cores, lead over the Phenom II.

AMD is still using that flawed design, and it can barely compete with a dual core, core i3. This is a low end part and thus it will not attract high end users, and it will not attract gamers.

It is very niche because it will struggle to even attract general computer users as the most common tasks done by general/ entry level users, rely more on single threaded performance *which the core i3 offers nearly twice the single threaded performance.

Furthermore, this CPU is unlikely to attract even casual gamers, as the games they commonly play, do not even need the GPU horsepower of the APU, and if they are into running demanding games, they will likely want to run them at high settings. But it will be too much to ask to have them give up nearly half of their single threaded performance for slightly better GPU performance (keep in mind that browsers and many other common applications are still single threaded.

AT implies AMD has one more bulldozer-design before passing at a new architecture, the Excavator.

I think someone at AMD must have realized that speed demon CPU architectures aren't appropriate for nowadays' computing demands and will fall back to a wide and high IPC core. They may do as Intel with Conroe and fall back to their mobile architecture: a modified (wider and cache-beefed) Jaguar. That however takes time, so in 2016 maybe we'll see a competitive AMD CPU.

Bone stock Phenom II on 32/28nm would be a better CPU than the FX line, but that train left the station years ago.

For me I am just annoyed, I currently have a Phenom II x6 1075t overclocked to 4GHz and, and to match its performance, an 8350 would have to be overclocked to nearly 5GHz

While there are some things that it does faster, overall the core module crap has lead to a slower CPU because the vast majority of the computing we do, still rely heavily on single threaded performance.

This is why intel has been dong so well. virtually every new generation increased IPC. after the pentium 4 and pentium D, they made sure to never sacrifice IPC for a higher core count.

AMD on the other hand slowly increased ICP until the Phenom II, and then took a massive dive in order to make an 8 "core" CPU.

AMD's core module is the equivalent of going to the pizza shot and ordering 2 large pizzas, and the shop simply just putting an extra crust over the top of a single large pizza. sure it is more food but it is no longer as good and it is not 2 large pizzas.

I cannot see AMD as a valid choice now until they step away from this core module crap and go back to improving IPC

High core counts are of very little benefit. multiple CPU cores do not scale perfectly, 1 core at twice the ICP will perform better than 2 cores at half the IPC each.

AMD should have built upon the Phenom II and made it 28nm or smaller, and optimized it to improve IPC

According to HWbot a Phenom II x6 1075T @ 4.2Ghz gets close to stock FX8350 multi threaded performance whereas most FX 8 cores at over 4.5Ghz beat the the 1075T. FX IPC changes according to how many cores are loaded if you only use 1 core the stock FX 8350 gets around 1.23(A Phenom II x4 I had got 1.25 at 4Ghz) points in single threaded Cinebench test but once you use all 8 cores it will only just get over 7 points which points to the fact that once 2 cores in a module have to work at the same time they slow down due to the resource sharing. So no your 1075T is not better than an FX8350 it's close but not better.
BTW with very repetitive parallel work loads like crunching prime numbers/video encoding an FX 8350 will catch up to and sometimes beat an i7 3930k/3960X/4930k/4960X.

Strange thing about those kaveri's 8 compute "Cores" (on the GPU side a least) is that they(the ACEs) can actually do context switching, which no prior AMD GPUs could do, according to what Charlie the D., says "Once a GPU can context switch, it is essentially a very wide heterogeneous CPU, and that is exactly the point of Kaveri." and "...just wait until the software catches up...", so there is not a lot of software out there compiled to take advantage of AMD's version of HSA, (hQ) and atomics, and such, but as soon as the SDKs, frameworks, APIs, begin to take advantage of AMD's special brand of HSA, things should look different. The Benchmarking software will have to account for this in some way (when needed), and will have to be re-run to measure this New AMD HSA tuned hardware once the changes work their way into the software ecosystem!
I wonder how this will affect the Ray tracing benchmarks, in particular, being able to have 8 of the GPU Cores/ACEs that have the ability to context switch between their own ( hQ max depths of up to 8 threads per ACE), all doing ray tracing, in addition to the 4 ARM cores, and their SIMD instructions! Kaveri is still, pending the release of truely compatible, and optimized software, very much a wait and see siduation.

People may not like Charlie D., but his techinical hardware analysis skills are top notch!

Im not a computer engineer (although sometimes i wish i was given the ODD decisions made by these companies...) but i have to agree that a Phenom II die shrink/tweak would have been best for AMD...at least for the FX line and left the "experimental" , module architecture for the APU line until HSA was fully implemented. A die shrunk Phenom II x4 and x6 with an updated memory controller would still have been able to compete at the time bulldozer was released if they kept the costs low enough.....it just boggles my mind that no one in-house wouldnt have seen this?? They could have even labeled it "Phenom III" instead of resurrecting the FX label. The only thing really keeping Phenoms back nowadays is a few instruction sets and the memory controller!! That architecture at 4ghz+ would have been a hell of a thing!

No the posting system would not confirm that the post was made, in fact it displayed an error message after each post attempt, and assuming the post itself was not properly posted, the poster continued to try to post!

And Now that the poster thinks about it, maybe the error message was actually concerning the Failure of the post affirmation message, and not the posting mechanism that did its job of posting after each cryptic error message!

So maybe the logic of the posting system needs to be changed to not fully complete the posting transaction, unless the affirmation message transaction completes without Throwing an error!

this post took more than one try also, this posting system breaks down after the post counts reaches past one page in length!

AMD is giving a copy of BF4 away with a purchase of the A10-7850. It has been posited that Kaveri will be "good enough" for a low cost system to play BF4 because it's coming with the processor. I've tried to temper their enthusiasm with my own experience running a GTX660 with an i5-3570k.

What is game play going to be like with just the newest APU on a 1920x1080 display?

I assume hybrid CrossFire will also be possible, but what is the max card that could be used (Some thought as high as a Radeon 7850)? I was skeptical, I wasn't sure you'd even be able to run a 77xx card and get any good out of the iGPU.

i don't think this CPU should be placed against a Intel's GT2, price-wise and also platform-wise, yeah sure it makes sense... but generation-wise it definitely doesn't.. instead i think it should be compared with Intel's GT3 or even Intel's GT3e

Well, I can't say that I'm not a little disappointed in amds latest desktop offering. That said, I'm now very excited to see how this new manufacturing process and low power tuning have worked for their new mobile kaveri. I don't expect amd will ever conquer the high end cpu market again but I think they have all the tools to dominate in tablets and convertable notebooks. Its a matter of where Read leads the company. Either way, great review josh and ryan, I look forward to mobile reviews of kaveri (I hope) by second quarter.