It’s time for AMD to take a page from Intel and dump Steamroller

Share This article

AMD’s Kaveri is, in many respects, a huge step forward. The new APU’s low-power performance is excellent, its integrated graphics torpedo anything Intel offers at an equivalent price point, and it includes support for features like Mantle, HSA, and TrueAudio. Yet, despite these lauded capabilities, there’s a clear problem sitting in the middle of Kaveri like a turd in the proverbial punchbowl: Steamroller.

Despite significant improvement in the low-power segment, it remains fundamentally incapable of matching Intel clock-for-clock. HSA might one day help address the problem, but it’ll be years before HSA-compatible software is readily available.

It’s time to take a page from Intel’s book and dump the core. The good news is, much in the same way that Intel’s Pentium M core would eventually replace NetBurst, AMD already has a core that’s capable of stepping into Steamroller’s shoes — it just needs to be fine-tuned for the role.

How killing the Pentium 4 saved Intel

The Pentium M (codename: Banias) was created because Intel recognized that the Pentium 4 wasn’t going to be capable of addressing the mobile market very effectively. The Pentium M design team took Intel’s older Pentium 3 core (Tualatin) and optimized it for high efficiency and low power.

Banias used the P4’s quad-pumped front-side bus, added support for SSE2, and inherited the sophisticated branch prediction unit that the P4 relied on to keep its 20-stage pipeline fed. Over the next few years, as it became increasingly clear that the P4 had run out of gas, Intel cross-pollinated between the two architectures. Efficiency-boosting technologies like SpeedStep and the Pentium M’s indirect branch predictor were ported to the P4 as well. In the long run, it was the Pentium M that gave Intel a path to the Core 2 Duo and Nehalem architectures — not the broken, fundamentally flawed Pentium 4.

Calculating relative efficiency between Kabini, Kaveri, and Richland

The simplest way to measure the efficiency of the two chips is to divide their respective benchmark scores in a given application by (CPU Frequency * Core Count). This normalizes both variables and gives us a measure of intrinsic core performance. The next step was to turn each of these clock-and-core normalized figures into a percentage. In a test like Cinebench, a score less than 100% indicates that Kabini is less efficient than its big-core rival, while a score of greater than 100% means Kabini is more efficient.

Our test data was drawn from both our own tests and test results published at other major industry sites. The second set of efficiency figures is based on results in 18 synthetic and real-world tests, while the first set compares only real-world results (10 in total). Even if we omit the synthetic tests where Kabini does quite well, the core is still extremely competitive with AMD’s “big core” architecture, with an efficiency gap of less than 10%. More importantly, there’s low-hanging fruit that would close that distance. Turning the L2 cache back up to full speed would help close the performance gap between the two, as would more aggressive branch prediction.

Tagged In

Post a Comment

Bryan Meyers

I normally don’t weigh in on articles like this, but I have to say I’m a little disappointed in the analysis in this article. Apart from not having any charts comparing Jaguar processors to Kaveri processors from a performance perspective, I think there is a general lack of understanding about AMD’s design methodology.

Bobcat and Jaguar were designed to be high efficiency processors (performance/watt) for the embedded arena. These cores run exceptionally well for highly multi-threaded applications, but lack the single-threaded performance that many applications still need. The decision to use Jaguar for modern consoles was the result of many factors. Console developers are exceptional at squeezing performance our of embedded class hardware. For consoles, latency is one of the largest issues. By having large numbers of cores, more threads can be handled, and more importantly: threads are switched less often. This is critical for the real-time performance needed in a console environment.

Steamroller, Piledriver, and Bulldozer were designed to provide better single threaded performance, while also maintaining multi-threaded ability. This means that these processors are able to handle many different workloads, as expected from a general-purpose processor.

But this also means that AMD has a very interesting set of opportunities in front of them. Development on Jaguar-style architectures allows them to focus on higher performance per watt, in application domains where single-threaded performance is less important than multi-threading or power consumption (Think Kyoto powered micro-servers, mobile phones, tablets, laptops, digital signage, machine tool controls, automation, etc.).

Development on Steamroller style designs allows for focus on improving other aspects of the processor. Things like memory bandwidth, cache latency, floating-point performance, and optimizations for HSA and OpenCL applications for general purpose computing. Kaveri architectures push this further by leveraging HSA and OpenCL without compromising on single-threaded performance.

All of this has been done in order to plan for the “post-pc” era. AMD is now able to start moving into mobile by leveraging low power parts like Jaguar that don’t need the performance of a full desktop system. Steamroller and Kaveri allow them to maintain their presence in the PC and server markets. But this also means that as multi-threaded applications replace single-threaded applications, that AMD will be able to leverage their development on Jaguar in order to improve core counts and efficiency in their future APUs.

But I am NOT saying that Jaguar or Steamroller is the solution to all of their problems. It will only be a combination of the two architectures that will ever meet the needs of today’s computing workloads. That is why this parallel development is important, and it is how AMD has been able to make so much progress these passed few years.

Joel Hruska

Looks like Disqus ate my big reply to this. The comparisons above give efficiency rates for the A10-7850K versus the A4-5000. I can give you granular data if you’d like it on a per-test basis.

Regarding the relative efficiency of Jaguar versus BD cores, I think we have to consider the efficiency AMD got, not the efficiency it hoped to get. In a number of cases, Kabini is very nearly as fast in single-thread as Steamroller and it doesn’t suffer from the 10% multi-threading penalty that Steamroller still takes. So how did the BD family end up providing better single-thread performance? It did so by leveraging clock rates — clock rates that are now being pulled sharply downwards.

I’m not suggesting AMD should stop building low-power Kabinis in favor of a high-powered flavor — I’m saying that foundry technology and design priorities continue to favor high-efficiency low-power parts. Next-gen Excavator cores are targeting a 65W TDP envelope. If AMD improves Excavator IPC by 10% but has to cut clock rates 10% to hit that 65W target, it won’t have improved the underlying situation one iota.

If AMD can hit that 65W target with a smaller, leaner, and more efficient Jaguar-derived core, I believe it frees up their resources to burn more on the GPU side or on wider memory channels.

szatkus

1. Looking at single thread benchmarks Jaguar has 10-20% lower IPC (module is not 2 cores and never will be).

2. Probably Excavator won’t need lower clock to hit 65W (shrink).

3. Jaguar core has to be completely redesigned and enlargen to hit 3.5-4.0 GHz.

4. Why Jaguar if they still have good, old K10?

5. Bulldozer fixed many of K10 bottlenecks (http://www.agner.org/optimize/microarchitecture.pdf) and… introduced a lot new ones. Maybe it’s better way to fix Bulldozer line than go back to old microarchitecture? Or merge some elements of Bulldozer to K10 (like Intel implemented P4’s “good parts” to Nehalem and Sandy Bridge designs)?

Anyway I think that only AMD engineers can answer these questions.

Joel Hruska

1). Jaguar’s single-threaded IPC is between 10-20% lower depending on the application, yes.

2). TSMC and GloFo are only predicting modest improvements at 20nm of 15-20% on power and performance. That means 20% less power consumption at the same clock in the sweet spot, and 20% faster performance at the highest end. Shifting from 95W to 65W is already an enormous cut; AMD certainly *will* have to cut clock speeds to hit that target or build an all-new extremely efficient architecture.

3). I would never suggest clocking Jaguar up to 3.5 – 4GHz. More like 2.5 – 3GHz.

4). Because K10 doesn’t support AVX and wasn’t shrunk to 28nm.

5). AMD has had three years to fix Bulldozer. It obviously couldn’t.

john

well intel had far more time to get a decent gpu out the door that doesn’t cost a kidney to build… still they press on(after 8 years + )… people are stubborn :D

pelov lov

Intel’s Larrabee has become MIC/Xeon Phi, and it’s actually quite good. Intel realized that their ‘x86 even in GPUs’ idea was idiotic, went back to the drawing board and they introduced their new iGPU in Sandy Bridge while adapting Larrabee into a co-processor (soon to be standalone processor). Intel’s GPU problems today revolve around crappy software support/drivers and poor downward scaling to low TDP products (smartphones/tablets).

AMD has likely already grabbed the largest IPC gains from the Bulldozer architecture. Historically, the biggest gains you see from a microarchitecture are the first 2-3 iterations; that’s not a good sign considering we’ve already seen two. Piledriver was about 5-6% IPC bump while Steamroller is high-single or low-double digits. This brings it to rough parity with Thuban, and that was back in 2008/2009. Upward scaling in voltage/frequency is going to be even more painful with the 20nm and 14nm-XM nodes, and I suspect that’s why Carrizo tops out at 65W. Anything higher than that might not make sense and potentially might not even be feasible.

john

Well the cerry atop bulldozer isn’t none of the improvements or tech in any cpu till now… the crown jewel will be dinamic context switching… in lay men’s terms… automagical offloading to the gpu without opencl or needing to recompile anything. If current hsa benches are holding true we will see up to a factor increases in execution speed or even more depending on how much gpu there is… and what you’re executing…

So i would not write it off just yet… kaveri is a decent chip even a brilliant one in some aspects… excavator however will be the last piece of the puzzle… if it works amd will start being competitive at all levels if not… well then they might want to think of something else… but excavator is what bulldozer piledriver and steamroller are building up to… if we start hearing about any other chip but excavator it will mean it did not work… till now however it all looks ok

pelov lov

“Well the cerry atop bulldozer isn’t none of the improvements or tech in any cpu till now… the crown jewel will be dinamic context switching… in lay men’s terms… automagical offloading to the gpu without opencl or needing to recompile anything.”

This is patently false. The GPU is limited to ISAs/Languages just as any other processor. If you don’t believe me, try using CUDA on GCN. And there’s also the problem of languages not being treated equally – compare OpenCL and CUDA performance on nVidia’s GPUs. Code has to be specifically compiled and written for a GPU in mind, otherwise it’s not getting executed there. OpenCL is here today, but there will be HSAIL, Java, Python, etc., all of which will need to be written and recompiled for a GPU. For example, taking serial code written in Java and expecting it to work ‘automagically’ will net you with nothing but an error. The code has to be highly parallel AND written in a language AND a way the GPU can understand. Furthermore, HSA tools won’t be available until later this year, with Java coming in 2015 and who-knows-when (or if) for the rest of the popular languages.

Expecting HSA/OpenCL to be a cure-all for AMD’s woes is frankly nonsensical. It takes years if not decades for the software world to make such drastic transitions, and this is no different. Developers certainly aren’t going to go back and recompile/rewrite their code in order to cater to less than 1% of the market.

HSA/OpenCL is a ‘this might be really good sometime in the future’ type of thing and not a ‘this is going to solve all the problems’. HSA’s adoption will depend on what happens in the ARM camp with companies like Qualcomm, Samsung, Apple and not the likes of AMD.

this is what i was talking about… all sse derivates can to a large degree be executed on a gpu. Not saying that the entire x86 can! Nor that you can magically execute opencl on cuda cores…

ziffster1

“none of the improvements or tech in any cpu till now… the crown jewel will be [dynamic] context switching… in lay men’s terms… [automatically] offloading to the gpu without opencl or needing to recompile anything.””

well as pelov already said “**dynamic context switching is likely coming next year with Carrizo. The toolset isn’t even available yet, but when it becomes available it will require support from software developers”

but let’s forget that for the moment as it appeared you do not actually understand what [dynamic] context switching, aka preemptively context switch means, and your assumption that that tech has not been used already in commercial consumer devices is also wrong.

“The term preemptive multitasking is used to distinguish a multitasking operating system, which permits preemption of tasks, from a cooperative multitasking system wherein processes or tasks must be explicitly programmed to yield when they do not need system resources.

In simple terms: Preemptive multitasking involves the use of an interrupt mechanism which suspends the currently executing process and invokes a scheduler to determine which process should execute next. Therefore, all processes will get some amount of CPU time at any given time.

In preemptive multitasking, the operating system kernel can also initiate a context switch to satisfy the scheduling policy’s priority constraint, thus preempting the active task. In general, preemption means “prior seizure of”. When the high priority task at that instance seizes the currently running task, it is known as preemptive scheduling.”

in effect AMD are finally providing a very primitive “AMP” “Asymetric MultiProcessing” to multitask (although it may be in fact cooperatively) preemptively all processes in the same flat address space but relying on 3rd party SW and half baked API’s, and no mainstream patches to apps you might actually use today to at least placate the AMD users that may buy this.

OC lets call this AMD HSA flat address space “Chip RAM” given that all these co-processor chips be what they may can all use that dedicated ram,and this chip ram is that RAM that has contention between the various multiprocessing chips and the main processor. And where there is contention it slows the processor down.

being slowed down by contention with other things like DMA and, depending on its hardware setup, graphics refresh. (old PC gfx boards have their own RAM that is in effect their own equivalent of Chip RAM, except that nothing else can use it (unlike the AMD “Chip RAM” which can be shared with other things.)…

“The chips in the chipsets can all be considered “AMP” “Asymetric MultiProcessing” with the COPPER commands as an example of communication between the radically different CPU’s”

“

There is nothing that special in the old chips that makes dragable screens that much easier to do.

… Peter Kittel said:

Umm, cough, Copper “nothing special”? You see, to get draggable screens on common gfx chips, you have to copy the whole screen contents to some frame buffer, where the complete screen resides in its several parts. With Copper, such copy actions are not necessary.

I don’t know whether the alternative is possible with such gfx chips, to replace the Copper functionality by some rasterline interrupt and let the interrupt routine change the video source pointers on the fly. Does that work fast enough? I doubt this a bit. And of course you need such a rasterline interrupt in first place, which is not standard.

The Blitter[The blitter does bitwise operations on blocks of memory such as bitplanes, for instance to copy one to another, with maybe some masking done in the process. (It is a function with three arguments.) It also does line drawing and flood filling. Though it is started by a command from the processor or copper, it then operates independently of them and signals when it is finished in a fully asynchronous multi-processing manner. It thus takes a lot of load off the main processor. It is interesting that modern gfx boards are now being built with blitters, but none seems as integrated as the A****'s one has always been.]“

john

Preemprive multitasking has been done on the same core usually x86 NEVER has it been done between multiple different type cores… what amd wants to do with dynamic context switching is not x86 only it’s spreading across both x86 AND gcn! That is the new thing about it that is why hsail is needed in the first place to transmit data and instriction memory pointers and call specific cores. Never has contextswitching happened between 2 different types of cores let alone dynamically. So yeah it is very new and also i still am very skeptical of amd actually pulling it off with excavator!

And using mathematical coprocessors is very similar you have a set of instrictions offloaded to another core… BUT math coprocessors were slaves like any other chipset “core” gcn and x86 are on the same level they can both enquee one another! So don’t bring up the dead! There is a reason why they died some of those reasons are bothering me and making me nervous about what amd is doing right now too!

Russell Barlow

x86 in GPU is not “idiotic”. It just depends on your renderer, tho, I think a shit ton of RISCs with addition of sufficently wide vector instructions would be better than a few CISCs for that.

In the future these new Parallella chips might be interesting for that task, tho they’re rather immature and their lack of the the SIMD goodness Intel/AMD and ARM have developed really knocks their “64 SMP cores” down a peg.

While theoretical gflops is close that’s only if each core is maxed out via using something like OpenCL, tho since SMP, you wouldn’t be limited to OpenCL, it wouldn’t make a bad OpenMP or MPICH accelerator.

ATM Knights Ferry will still wipe the floor with them. Problem with that is I’d need to sell a kidney to buy one.

szatkus

1. Do you agree that single core performance is still important?
2. How about 14XM?
3. So we would have CPU worse than Kaveri/Carrizo…
4. But was designed to work at <140W TDP (the hardest part, compare Kabini die-shot with Phenom or something). Adding AVX isn't problem and 28nm isn't an option for CPU which can be releases at least 2-3 years from now.
5. Steamroller at last has similar or better single thread IPC than K10. Is it fixed now?

Joel Hruska

I believe that single-threaded performance remains one of the most important metrics for ensuring a good consumer experience with a CPU, yes.

2). 14nm-XM should offer stronger perf, but AMD won’t have a chip taped out and ready for it until 2016.

3). A 2.5GHz clock rate + 15% IPC improvement would put Future Jaguar on par with today’s Steamroller in the 45W space — but with a smaller die.

4). K10 could push 125-140W, but the 28nm SHP process wont’ ramp that high. Neither will 20nm. AMD doesn’t need a high TDP chip, they need a low TDP, high efficiency chip.

5). In some tests, Steamroller is still lagging Thuban.

Pangolin_user

foor point 2.
based on kaveri benchmark 45 watt kaveri is 80-90% the performance off 7850k (83% frequency) with only 47% of 7850 TDP.
I think AMD already calculated that 20nm GF and TSMC would be really bad for high frequency excavator, thats why on recently released amd plan, max excavator would be at 65Watt at 3.3-3.5 Ghz part increasing IPC but lowering TDP and Frequency.

Pangolin_user

if you try to test bay trail vs intel core i7 I think you will get similiar result, so do you think intel should ditch its core series to atom?

Joel Hruska

I won’t see similar results. I’ve done the math on it. Haswell is significantly faster than Bay Trail, clock for clock.

MWisBest

Joel, you can’t simply say that Steamroller is getting lowered clock speeds. With the direction AMD is going with their APUs (that being HSA), the 28nm production they’re using for Kaveri at the moment is best suited for a balance between all parts of the chip, meaning the CPU takes a bit of a hit in clock speed for a better overall chip. For example, I believe Kaveri is their first chip to sport official DDR3-2133 support (previously they were all DDR3-1866), and since APUs are extremely sensitive to RAM speeds this was a good trade-off. Kaveri is also their first chip to have PCI Express 3.0 support, and their first APU to have a GCN-based GPU.

In my opinion, AMD’s idea with the Bulldozer design was to eventually supplement the reduction of FPU cores with the much better suited GPU cores. With HSA, this is really starting to become a reality, and they have plenty of big companies backing HSA, such as ARM, Qualcomm, Samsung, TI, Broadcom, Oracle, and Sony, just to name a few. HSA is definitely not going away anytime soon, and 10 to 15 years from now I can see FPU cores being phased out completely.

You keep discussing “memory channels,” but the thing is, DDR4 is getting rid of the multiple DIMMs per channel approach, in favor of a more GDDR-like implementation, so therefore your argument on that is irrelevant and in my opinion puts a dent in any credibility you have for not knowing that. That’s news I’ve known for years, and I’m certainly not a journalist whose job it is to keep up with news like that.

Jaguar can’t simply hit frequencies of 3.5+ GHz without a different design. If it were that simple to scale architectures to higher frequencies, it’d be done already. Intel didn’t scale down their existing successful Sandy Bridge or Haswell architectures for their Atom chips, nor did they do the reverse of that. They needed a different design to target that market (low power), and so does AMD.
For example, look up high res pictures of the APU designs, from Llano to Kaveri. You’ll see that the size of the K10 cores in Llano is similar to the Piledriver and Steamroller modules of Trinity/Richland and Kaveri, however since they are using 2 modules and not 4 (you get what I’m saying here…), there’s more die space for the GPU.

Joel Hruska

1). Richland supported DDR3-2133.

2). The AM3+ chipsets have PCIe 3.0 (at least, later boards do).

3). DDR4 implements a one channel per DIMM approach, yes, unless splitters are used. I’ve written about this:

4). Jaguar cannot hit frequencies of 3.5GHz. Jaguar doesn’t need to hit frequencies of 3.5GHz. Just as Core 2 Duo outperformed the P4 at far lower frequencies, Jaguar could be tweaked to outperform or equal Steamroller.

5). Steamroller was designed at a time when AMD assumed high-power process nodes would continue to be available. Look back at 2011, when GlobalFoundries was still talking about a 20nm high-power node, or when FD-SOI was still on the table for future AMD chips. Those options are no longer on table.

6). Even if a future tweaked Jaguar was just as efficient as Steamroller, it could hit that point in a smaller overall die, thus giving *more* space to implement a larger, more powerful, GPU.

MWisBest

1. My apologies, you’re correct on that.

2. AM3+ chipsets themselves don’t support native PCIe 3.0. Motherboard manufacturers got tired of that, and decided to get an external solution for it, similar to what was done for USB 3.0 before chipsets and/or CPUs had native support for it, and native support is definitely better than those external chips (especially speed and cost-wise).

3. Again my apologies, I’ve honestly had a hard time finding (at least reliable) information on DDR4. That article was a good read.

4. Core 2 Duo outperformed P4 for lots of reasons. That was the dawn of multi-core CPUs, among lots of other innovations. The gap between Bulldozer and Jaguar isn’t of that kind of magnitude.

5. Well, you never know what the future holds in store. There might still be a chance of those options being available somewhere.

I would need to know where Jaguar needs to get clock-frequency-wise and IPC-wise to come even with Steamroller. If they really need to bump up the frequency, it would probably require some changes to the Jaguar design, changes that would add to the die space. Clock speed doesn’t simply scale linearly, I can see numerous places where a bottleneck could occur. Usually when the frequency gets higher and higher, there are more and more diminishing returns.

Joel Hruska

I can show you clock scaled, if you’d like to see it. At least I think I can. Slowing down Steamroller to 1.5GHz shouldn’t be impossible.

But clock-scaling is usually linear within a given space. Here’s what I mean by that: Each application may scale differently, but scaling within the application should always be consistent until you hit a bottleneck. So if increasing clock speed by 20% yields a 10% improvement in application performance, we should see that rate hold until RAM speed becomes a bottleneck.

I have a Jaguar laptop here. I could run at least some simple tests against a single-channel Steamroller at 1.5GHz. I can’t do anything about the half-speed L2 cache, however.

Joel Hruska

Regarding #2: I reviewed an FX-9590 earlier this year. To the best of my knowledge, there is no bridge chip being used. The Gigabyte board I tested did not appear to have any such PLX chip (and I looked for one, because I was trying to figure out why performance characteristics looked a certain way.)

I know Asustek announced a motherboard with a PLX bridge, but I don’t think Gigabyte used one.

Joel Hruska

Check the newest comment in the story: I ran some Kabini vs. Steamroller tests.

Joel Hruska

Oh — to address two points you raise more directly:

1A). HSA is a great idea, and a worthy goal, and I support AMD’s work on the standard. That said, no one else has announced hardware backing it. Given the complexity of implementation, we’re looking at 18-24 months before other manufacturers are shipping fully HSA-compliant solutions (not just heterogeneous computing-capable, but HSA-compliant).

That means AMD can’t afford to sit back and say: “Well, we did HSA, so we’re done.” It has to continually reevaluate the relationship between old cores and new cores, including changing where it spends its transistor budget if it makes sense to do so.

1B). There’s no reason AMD can’t do a quad-channel DDR4 design. Intel’s Haswell-E, expected to debut later this year, will have such a configuration. A quad-channel interface is one way to fix the perennial bandwidth problems of the APU, and a Jaguar frees up core area to do it with.

pelov lov

You’ve neglected a few important aspects here:

– Single-threaded performance HAS NOT improved. Single-threaded FP code utilizing AVX has, but only because K10/K10.5 did not have the ISA extension. Integer hasn’t budged much, if at all, either. Thuban hit roughly the same clock speeds on a 45nm SOI process and had

– The HSA/OpenCL improvements have more to do with the memory controller and I/O than they do the core architecture – and by core architecture, I’m referring to x86 core/modules.

– The current process/fabrication side heavily favors wider cores (or going wider as in more cores) at lower frequencies than it does the Netburst-era ‘let’s hit 10 Gigahertz’ line of thought. Bulldozer is neither of these.

I think Joel should have outlined what AMD originally slated to do with Bulldozer and its derivative architecture in order to clear up some of the fuzzy logic in these replies due to misconception.

AMD wanted to ‘hold the line’ on IPC while scaling upwards in core count – specifically integer cores. The architecture was designed to take back considerable market share in servers, where AMD had then slipped to below 10%. Dirk Meyer and the engineers felt that increasing x86 integer cores (moar coars) while offering decent FP performance despite halving the 1-to-1 ratio (now with AVX) would sell well in the server segment.

But what happened?

The clock speed targets (consumer space) were mid-4ghz and the IPC was supposed to hit Thuban-K10 levels. Neither of those happened, but for different reasons. The IPC was 10-20% below early estimates (silicon-in-hand in Q1 2011 but computer simulations failed them) while GloFo fumbled the 32nm SOI process. Llano, the first chip on GloFo’s 32nm, suffered horrendous yields and was the likely culprit for the lower clock speeds. Yields hadn’t picked up until about late Q2, but the process was still lackluster and Bulldozer never hit its design targets. Vishera addressed some small issues with Bulldozer’s design, but the clock speeds still never reached mid-4ghz at full load.

It’s now 2014 and the silicon world has been turned on its head since 2008, back when Bulldozer’s design started. Low power rules at both the microarchitectural level as well as at the fabrication level, and those who choose to pursue high TDP and clock speed designs are abandoning them in favor of putting their money elsewhere (Intel and GPUs/MIC) or scaling back R&D or ‘opening up’ (see IBM’s recent announcements for Power).

TL/DR – Even if the Bulldozer architecture didn’t suck, it still would be a misfit in today’s market due to very high power consumption and nowhere to build it. Server market share is sub-5% (lower than ARM) and the march towards low power and SoC-like architectures has left AMD fumbling about with now 4 separate architectures and not enough funds to cater to them (GCN, ARM, Jaguar, Kaveri).

Joel Hruska

Pevlov,

Yeah, you pretty much nailed this with everything I wanted to say. Only thing I’d add is that Llano missed because of GPU problems, not anything on the CPU-side. My sources have told me that BD yields were excellent, but AMD kept respinning the core to improve it.

tgrech

Is there a comparison vs the 65W or 45W Steamroller parts? With Richland, generally a 45W decrease from 100W(So of course, a full 45% decrease in TDP) generally only resulted in 5-15% speed decreases. I don’t think the 100/95W K-models should be used for efficiency comparisons when it seems the whole point of them is to sacrifice efficiency for slight boosts in stock speed and more OC potential.

Also, it’s generally a lot harder to scale up than down, the changes required to Jaguar to reach higher clock rates to compete with the big cores, and the efficiency losses from using these higher clock rates, could stop, or at least greatly delay the time when the cat cores are truly more efficient and competitive with the big cores.

In terms of mass market appeal, for next generation I think it does make more sense to release bigger cat-core based SoCs for mainstream desktop/laptop, but that’s more because most people can do pretty much anything they want fast enough for them on current/outdated mid end hardware.

john

why would it be a problem jaguar is a k10 remote relative… why do you think it wouldn’t work for high frequency provided more cache and deeper pipelines? It’s just a matter of adjusting things exactly like joel stated in the article… that’s not the crux of it…

tgrech

Jaguar being a K10 relative is massively overstated, the modern Jaguar core is almost completely different to the classic K10, and even the K10 based Llano cores had trouble reaching clock speeds most people would consider high. Also more cache can rapidly increase overall TDP, so it certainly has major drawbacks. Whereas as we’ve seen with Richland, getting much better efficiency out of a Big core just requires slight clock speed drops and presumably the use of less leaky silicon, and it’s a lot easier to fuse off less useful parts of a CPU to save power or even in the case of say a GPU, adding more cores and then downclocking the cores can give much better power efficiency than decreasing core count and increasing clock rate. Everything I see points to it being a lot easier to make big cores fit lower TDPs than little cores perform well in big ones.

Of course, there’s also always Intel technique of using higher wattage parts and applying an SDP to fit the higher TDP chips in lower power designs without really having to do anywhere near as much work or engineering as it would normally take.

john

It is because current silicon for kaveri is actually GPU silicon not CPU one. This improves drastically density but sacrifices insulation and with it also higher frequencies possibilities. This is why Kaveri will do best in the 2-3Ghz range and decay in efficiency until 4.5Ghz above which power draw and TDP is just insane. That’s why you see this nice scaling downwards not because the cpu is built to scale down but because the processes has maximum efficiency probably round about the 2ghz mark.

Yes cache does increase TDP and most noticeably both idle and peak power usages are very much influenced by it but using a deep pipeline needed for higher frequencies will also require a bigger cache.

If kaveri would be on CPU process it would have much lower density but work at much higher frequency I’d say GPU could hit 1.2 ghz and cpu 5ghz… problem is AMD wants powerful GPU’s… and they do best by using density rather then frequency… so it was a gamble, they improved the CPU IPC like they once did 20% is something we all expected gen2gen improvement back in the day before C2Q… But it was overshadowed by 2 facts: 1-it was actually a reparation not an improvement and 2-they sacrificed frequency to the point of making the IPC improvement to net 5-10% in performance, meaning very intel like (as of late…).

So I’m still not convinced scaling down is easier then up… What if the big core has power inefficiencies that can’t be fixed without changing it completely? The other way round works just as good… what if the small core has bad branch prediction and you end up loosing lots of cycles for nothing and realize the performance gain with up scaling is not increasing linearly with frequency and you end up with double or more the power draw and just half of improvement? I still think the AMD’s way is the best at this technological point, with the tools and background of this age. HSA & dynamic context switching is the future for both down and up scaling because you can use the best of both worlds + it should provide for a much finer grained power management chance, because having multiple small compute units you can shut them down and power them up as needed much like arm does and scale in small steps not in leaps.

tgrech

The good down scaling is not merely a side effect of the process used for Kaveri, Richland and Trinity used very much CPU-focused processes and seen very similar scaling. Even with FX cores, we see a 10% increase in clocks(FX8350 to FX9370) requires an almost 85% increase in TDP(120W to 220W).

Joel Hruska

The 28nm bulk silicon for Steamroller is flatly incapable of scaling to a 225W TDP the way 32nm SOI did. At least, that’s my understanding.

john

Well using 225w won’t be a problem… i think 4.4ghz would sufice… :))) you would however not achieve 5ghz unless ln cooling and a kw are used so efficiency goes out the window. Maybe the process can be improved but i don’t expect very much as 28nm is already fully matured…

Joel Hruska

I didn’t say Jaguar is a K10. I said it scales like one.

That means it doesn’t pay the shared-core penalty that BD does.

tgrech

I didn’t think you mentioned K10 in this article, I was referring to Johns statement “jaguar is a k10 remote relative”.

Joel Hruska

The 45W Steamroller chips are much faster than the 45W chips they replaced, but the comparison here is designed to normalize clock speed. I ran figures against the A8-7600 and the A10-7850K. The percentages came out identically in that regard as one would expect.

I expect that AMD would inevitably lose some headroom and die size advantage if they built using a Cat core instead of a BD core. But for all Steamroller’s improvements, it’s still a core that was designed with the anticipation of high clock speeds and high-end processes that would enable that scaling.

So to me, the question is this: Which is harder? Scaling up a Jaguar-class core, for better long-term perf, or trying to rebuild Bulldozer into not-Bulldozer, and get a fundamentally different balance of execution capabilities?

I’m guessing that scaling up Jaguar is easier because it starts from a place that’s better aligned with where the industry as a whole is moving.

The other thing is this: AMD knew Bulldozer had problems before the chip launched. They started work on Piledriver before BD had shipped. So if you think about the dates, AMD has had nearly three years to respin Steamroller. The chip we got is much improved in terms of IPC, but it’s not hitting where it needs to hit to be competitive with Intel.

john

Hrm… erm… well… you know AMD is working at bringing dynamic context switching don’t you? That would effectively mean that the FP which is shared and slow on kaveri and all bulldozer derivates can be “moved” the the GPU and use a much wider machine there… This means that no HSA “addoption” will be needed… HSA compilation will at times provide an additional bump and openCL too but with dynamic context switching the advantages will diminish drastically. Coding specifically for the GPU inside an APU will bring you benefits but nowhere near the benefits we’ve seen in the past or see now with the introduction of HSA. Plus HSA enabled OpenCL drivers will make transition seamlessly… you run the same code only much faster because you skip the memory ping-pong. Now with the next HSA iteration (hopefully) amd will make this happen for normal x86… (macroOP)code best suited for parallel execution will go to the GPU(the GCN cores) sequential(non SSE and derivates) will go to the integer pipelines (if those don’t get to be moved to sequential execution units on the GPU too – the likes nvidia has on it’s architecture)…

So there is no point in ditching one arch for the other the decoder efficiency & preemption efficiency as well as memory/cache latencies will be defining factors… the x86 core itself will bring little to the table you will probably be able to improve some specific scenarios by 10-20% if you do the way you suggested… The way AMD is going makes far more sense – at least to me. Plus Intel did not have a GPU or a tech to offload execution + pentium 4 was just dreadfull… steamroller is not half that bad it’s better IPC then Richland, castrated to some extent by the process used but well you can’t get it all now can you when you’re the 10 times smaller underdog now can you? Still Kaveri is a decent chip, enough x86 for ANY normal(excluded are professional and extreme enthusiast users) household or office task and the GPU is decent for 720P/1024p

massau

opencl 2.0 brings full support for the context switching modle and will work in parallel whit hsa because many key members of opencl are also member of hsa.

but have you ever tried to use opencl? it has an steep learning curve and it is very verbose to just call a function/algorithm that has to run on the GPU.

john

Well yes i do code in opencl frequently… my point is hsa plus dynamic context switching for x86 instructions is a much better approach then the one proposed by the author… just my own opinion…

Joel Hruska

John,

Even if you’re right, I think the small Cat cores are still a better long-term fit. If HSA and OpenCL become paramount to the future of computing, then why devote more resources to the core than you have to? Steamroller cores are still much larger than Jaguar cores, and I don’t expect that ratio to change much.

john

Just saying that at the moment i think kaveri with dynamic context switching would improve performance by a huge margin just look at current hsa benches… with dynamic context switching you won’t get as much improvements but still probably 70% of it could be achieved without changing a thing in the compiled program. If this is the case it doesn’t matter what “core” you use i guess a small bobcat would be enough no idea you may be very well be right about it.

Joel Hruska

Let’s say AMD’s next CPU has dynamic context switching, and that hte market embraces HSA, and that the OpenCL drivers are fully baked. Since OpenCL 2.0 drivers from AMD aren’t expected until 2015, that means 2016 for serious software uptake and hardware use.

If AMD really has a dramatic alignment shift thanks to software support, then they don’t need to dedicate as much hardware to driving the CPU core, and Jaguar is a better fit. If I’m right, and such improvements are a long way off, they still need a CPU core that fits the demands of modern foundries and consumers better — and Jaguar is a better fit.

john

Yes you may be right, already said that… only problem is i don’t know if jaguars are ready for full hsa or what corners they’ve cut with decoder and macroops there. It may be too dumbed down… no idea just saying. But you may very well be right as stated before too.

john

I still think you don’t fully grasp the dynamic context switching. At the moment it is like this: hsail can be used to talk between any cores on the apu and they are already studying how to do it in a cluster. The hsail code acts like the inter core glue, with it you can adress cores and exchange memory pointers. Now dynamic context swithing means that you no longer need to write and compile hsail code to do this but rather the cores do this themselves calling external resources wherever applicable. Meaning when composing the macroops instead of pushing the fp ops down the x86 cores they defer execution to the gpu using a preset hsail glue code that the compiler should be oblivious to. This means that -if it works- the x86 core can just dump work on the gpu WITHOUT any recompile. Meaning you don’t need a special language like opencl or to write the hsails yourself, the decoder(or a specialized part) does it for you. Now if this comes with excavator… i am skeptical, it is very complex mostly because of different clocking and all the memory coherence problems. It can be done but it is a huge technical feat. Plus it is a risk if the decoder wrongly assigns the gpu to execute something that it shouldn’t performance goes to waste. Plus you have to account for delays in the inter core comunication and such… as i said… skeptical.

However you do not need ocl 2 for current ocl to get a huge boost from hsa. Most of the time wasted in ocl is the memory ping pong… you tell the driver here are your arrays move them and run the core and then the driver retreives the result array back to main memory for your main program. Even with current syntax you can update the driver to use hsa and not copy the arrays all over the place but instead just move the memory pointer when possible.

Joel Hruska

My problem with what you’re describing is that the GPU has to be able to execute that code. Let me describe it in steps:

In order for what you’ve said to be true, the GPU has to be capable of executing native micro-ops. In your example, in Step #3, the decode units spin this code off to the GPU instead of sending it to the FPU. But that means there has to be a significant hardware block that can parallelize the code and optimize it for running on the GPU.

Since the CPU decode blocks are aimed at general compute rather than extracting tons of parallelism, I don’t see how you *avoid* the need for HSAIL and a high-level wrapper for the code.

john

Well nobody said it would be easy or little work… i am very skeptical that they can pull it off with excavator. But since you have basically 2 risc cores specialized for different tasks you can move the simd instructions that are already highly paralell in nature like the sse ones ans move them to the gpu since that is the best place to execute such instructions anyway… it is a huge amoumt of work and research needed to do it… yet amd has commited to do it when they announced dynamic context switching. Yes the decoder would be considerably more complex and many other things can go horribly wrong.

Yes compiling for gpu and cpu specifically and also writing the glue code yourself will yield much better execution paths… but if hsa plus ocl can give a factor improvement with loads of work… this would yield probably 30-50 % of that optimization provided modern compilers read sse enabled compilers wer used in the first place. However doing it wrong will result in cpu and gpu waiting to synch all the time and not getting anything done… that’s why i’m very skeptical it will work in the first place… but it IiS the programmera holly grale if they pull it off

Joel Hruska

I assume you’re talking about hQ? Heterogeneous queuing?

My understanding is that this does not involve CPU-level instruction dispatch. It does allow the GPU to execute GPU code, and to call the CPU to execute CPU code, but it does not allow the CPU to fire native code at the GPU for execution.

What you’re describing is, I think, cost-prohibitive and rather rigid. I just don’t see us reaching a point where the front-end of the CPU can decode and optimize for both CPU-style and GPU-style code, or a place where you can easily switch from executing one or the other in exactly the same silicon.

john

Nope, what i am talking about is preemptive multitasking, search preemption on wiki it explains it way better then i am able to!

Amd has commited to do exactly this dynamically. However they did not state with what cpu this will come, excavator is only a guess(mine and many others – it might just be whishfull thinking on our part). They are also studying preemption in cluster scenarios and apparently have made some inroads into it.

In this particular case tasks both gpu and cpu are interwoven and the cpu and gpu can switch execution to one another dinamically and wait for the other core(s) to finish so the preempted task can continue. This can be done at the moment just by compiling very specific code with hsail glues in the future this should happen automatically for some instructions without passing through the os & driver stacks or at least so we were led to belive.

john

Just saying that at the moment i think kaveri with dynamic context switching would improve performance by a huge margin just look at current hsa benches… with dynamic context switching you won’t get as much improvements but still probably 70% of it could be achieved without changing a thing in the compiled program. If this is the case it doesn’t matter what “core” you use i guess a small bobcat would be enough no idea you may be very well be right about it.

massau

yeeey someone that talks about opencl and uses it.
I think that i will have to do some more exercises at opencl before i judge but it will take a lot of time before i “master” it.

john

Well… it’s not THAT hard… you have cores, compile, load, load the arrays run the cores retrieve data… simple… syntax is c90 so all basic stuff. I run it from java and it is really easy… it is way harder to actually write algos that efficiently use it in conjunction with java…but that is not language dependent now is it :D. I remember the first things i did was to compute arrays of geo data into pixels… quite simple and easy to do plus you can draw it afterwards and it is rewarding. just take geodata for the borders of a country you should pick a high res one so you get a few thousands points. I had a lot of fun with this when i started learning opencl and you can whip it up quite easy. Not very usefull but nice to play with and get a sense of it.

massau

i think i better start whit just a histogram of a bitmap picture maybe some detections later on or just an haar transformation.

john

Hrm those are all copy paste from tutorials… do something ground up will help you a lot more even if it is simpler in the end… you will have learned a great deal more about it in the end

massau

i shall see. the problem is that i do not have real need for it at the moment. so i just want to learn it so i have a tool in my toolbox and hopefully i can use it when the time comes.

massau

do you still have the APU?
Maybe you should try to undervolt and underclock the cpu i’m pretty sure the apu will shine at lower TDP because the bulk is better at lower clocks and it is also targeted at mobile platforms.

also if you do this than you would have a media exclusive article.

john

I just got mine… will surely try it :) both this and overclocking the gpu to 1ghz… seen it is possible :) my new 2.4ghz memory sticks should make this apu shine…

massau

please keep me up to date . i think the mobile versions will be clocked at 2.5GHz up to 2.8GHz and 3GHz to 3.2Ghz turbo. this would give you a good undervolt. Maybe power measurements would be cool but i don’t thing you have the hardware to measure at the power line itself.

john

Funny enough i do have the equipment necesary :D… anyway i still have to wait for mb and the rest of the comonents tip monday aprox as i’m building a new system with it. Will keep in mind and test :) and do NOT ask why i do have this kind of equipment…:D

massau

if you have that equipment why not become a reviewer ? :P let me gues you do some kind of work in hardware analysis/creation or high performance computing where you have to check performance per watt? (just wild guesses)

massau

any news on the underclocking / undervolting of the kaveri chip?

Dozerman

Hmm… I think I remember something along the lines of this article being discussed in the comments of a previous article awhile back, something about big cats being scaled up to take BD’s place?

ET3D

It’s interesting that Intel is doing it now, using Silvermont for Pentiums and Celerons.

I imagine that AMD does think about this option, it’s just not at a place where it feel that it’s practical to switch.

szatkus

Actually Intel does completly opposite. Look at Broadwell’s TDPs.
They’ve released some models as Pentiums and Celeron because Atom has bad reputation. It’s still the same power envelope as in older Atoms.

I think AMD needs a new CEO. It’s been 6 years they have been hopelessly stuck at core i5 level performance, in my opinion, Read is too conservative, painfully slow and not a good decisión maker, for what i have seen so far. If they released a new APU every six months with each quick iteration increasing IPC, they would reach core i7 level performance by sometime next year but the new AMD being painfully slow as always, will continue releasing new cores every year even though they know AMD is so far behind Intel and on top of that, the yearly IPC increases are rubbish, at best. I see no ambition at all, I see a fearful AMD, too scared of being the best it could be, remember the old Athlon XP days???, AMD was a worthy contender back then, not a joke. Collette LaForce looks like a better alternative for the top spot at AMD, at least in the interim.

Joel Hruska

Read has only been CEO for about two years. Three tops.

ziffster1

he probably should just ask ARM ,Samsung and Via to make him a new interim design :)

actually im curious how does the latest VIA QuadCore processors fair against these apu as regards power,data throughput of the core at its 1.2+ GHz processor clock for clock etc

ZIff, you seem obsessed with the idea that adopting CCN is some sort of magic bullet for AMD’s woes, or that ARM doesn’t have to deal with the physics of CPU power consumption and TDP. Neither is true. CCN is not some bulletproof, automagically awesome button for high performance. It’s just an IP block.

Via hasn’t sampled CPUs in years. They weren’t able to effectively capitalize on the Nano, which is a shame. They’re still building on 40nm.

ziffster1

no. not a magic bullet, but rather a stopgap real life solution to the ever present memory bottle neck, show me something in the x86 landscape or else where and ill listen. as you already pointed out there’s apparently noting in AMD’s mid term as regards WIO etc and yet your advocating AMD take 2 or 3 years to make a quad DDR4 controller and related improvements

That’s the CCN-504 and that’s a 16-core design. Sure. But the CCN-504 and CCN-508 are not HSA compatible. They are designed to fit the needs of huge core installations, though they can be scaled downwards, that’s not necessarily the best use of their resources. And the 504 doesn’t offer more memory bandwidth than is currently available.

AMD is almost certainly using the 504 for its upcoming Cortex-A57 chip, but I just don’t see the 508 serving their needs more effectively than their own IP. Moving x86 to the 504 is non-trivial.

john

lol… Read has made a drastical decision in his short tenure… he has rehashed steamroller which incurred a half year delay but the result is ok…(and was a huge gamble) wonder how bad it would have been if they did not… He has branched out in 2 years in so many directions it’s scary, microservers & the entire custom chip flank are both his babies. Now you want to call Read a conservative… you’re hilarious… His decision making turned AMD 180 from a bankruptcy into black…

Not even Intel has 6m cicles… and as of late it is struggling even with yearly cicles… so what planet are you from??

Marcus Mendez

Brilliant and insightful analysis all around john. I often wonder if all the tech blog sites in the us are in the intel or nvidia pockets.

John Smith

One of the best articles on extremetech for a very long time.
I was wondering myself for a long time ever since Jaguar was released why AMD doesn’t spin off a high-performance core out of it.
It’s massively more area efficient than Bulldozer.

Jose Vasconcellos

AMD, make a Core i7 level processor fast, something that trades blows or beats Intel (remember the Athlon XP????, so it’s possible), if you offer that kind of quality (not rubbish) to your customers, sales and therefore profits go up and because of the halo effect, it will be easier to sell your other products since a great CPU core (NOT RUBBISH) will win you respect in the market, simple!

john

Yeah we do remember… also what good did it do? The market let itself be fooled by intels strategies time and again. Doing something like this from amd’s current position would bring it nothing but grief as the infernal intel marketing machine would start and tear through amd like it did last time… amd needs a much better foothold this time around… and it is building it slowly but surely. Wether it will be able to strike at intel decisevilly or not remains to be seen.

Joseph Valverde

I spent several nights thinking about this, just make the core wider (with the same 1% power/2% performance) that intel uses while borrowing key elements from its bigger brother while later implementing simultaneous multithreading and we could have a beast of a core. Would take some time, but leveraging “pieces” from the big cores while carefully scaling up performance and power down could do well for AMD and the overall computing market

gadget_hero

Let’s not forget that for a large part of the superstar team Rory Read has been assembling; Charles Matar (from Qualcomm), Jim Kellar, Raja Koduri, and Wayne Meretsky (all from Apple), Mark Papermaster (from IBM) have yet to have design influence as most were brought on board anywhere from a year to less than a year ago. Its likely that Piledriver and Steamroller were both set in stone as it were and their first major influences will be seen in Excavator.

Asdacap Cap

Hold on people, the mobile parts are not here yet. 45 W seems to have major improvement, it is likely we may see this in mobile too. By the way, I do think they are taking a page from intel by focusing on mobile, then desktop. The thing is, it seems they not have much resource to look at desktop at all. But I’m not sure about the micro architecture though. I do agree that they stick with piledriver for server processor. That should allow them increase core count (integer core count) much easier than with steamroller, but I’m not sure about 32nm SOI though. Saying they should abandon steamroller in favor of jaguar is like saying intel should drop core architecture in favor of atom. If they focus jaguar on performance instead of efficiency, then how will they compete in the tablet/low-power segment? And if they focus on power without having another architecture for performance, we might lose desktop all together.

Joel Hruska

No. It’s like saying Intel did the right thing by dumping Pentium 4 in favor of Pentium M. And let’s keep in mind, Intel’s Core is far faster than Bay Trail, which AMD can’t claim.

David Stanley

I doubt the author has any idea about what is coming
Kaveri can out compute anything Intel has by a very large margin

Ken Yap

Dump steamroller because HSA isn’t ready? how backward looking is that! HSA would never be here if amd didn’t release all these APUs. Even intel is desperately improving their graphics and investing more die space for their igps on their chip. Has amd backtrack now you can be sure intel will once again be ahead of amd when HSA is popularize. Amd fell far behind intel for they when ahead believing in HSA/APU would be the biggest success for the future as they bought ATI. There is no reason for them not to 100% commit on what they sacrificed for now at this stage.

Joel Hruska

Steamroller is not HSA. Steamroller is a CPU core, not the HSA execution model. Any CPU can be made HSA compatible (that’s the entire point of the HSA program).

Thus, a future version of Jaguar will support HSA. Intel, Qualcomm, and Samsung could build chips that support HSA. The HSA standard is designed to allow any manufacturer who wants to implement it to do so.

ziffster1

“A steamroller (or steam roller) is a form of road roller – a type of heavy construction machinery used for levelling surfaces, such as roads or airfields – that is powered by asteam engine. The levelling/flattening action is achieved through a combination of the size and weight of the vehicle and the rolls: the smooth wheels and the large cylinder or drum fitted in place of treaded road wheels.” :)

HSA Foundation PRM Working Group Chair,Chien-Ping Lu of mediatek

” Mediatek is a staunch supporter of heterogeneous system architecture and very pleased with the public release of HSAIL.

Opening and standardizing the interface between CPU and GPU allows for parallel operation of these 2 key processors in mobile chipsets, and most importantly, creates portability of high-level software applications.”, said Mohit Bhushan, VP & GM, MediaTek.

No one can start writing HSA code until AMD releases the SDKs, and that’s not happening for awhile yet. Mediatek can stand up and yap all it wants to — show me when they announce products with implemented hardware support.

There are quite a few new libraries and sdk’s there. There are already softwares out there using hsa compiled code… barely a few but they are and partners will start releasing things too, chief amongst these is oracle with jvm 8 which will have some nice features to leverage hsa. Open office is also another good software partner… if you build a 500$ system youst deffinately are not paying 350-400$ for office so it is a good bet for mainstream users… the java angle is very good for servers as it will help improve execution for really large businesses that buy hardware by the shipload… also steam support and linux support in general should be a priority for amd going forward…

As to hardware mediatech saidthey would use some heterogenous tech to make true octocores on arm… not much i know and very fuzzy too but it is a start

Joel Hruska

So, a few things:

1). Code samples aren’t the HSA SDK. That’s expected in Q2, AFAIK. You can’t *really* start serious development until that SDK is out. I know, because multiple benchmark authors who have OpenCL tests already finished have told me that SDK availability is when they can seriously start testing HSA code.

We’ll see benchmarks not long after the HSA SDK ships, but it’s going to take longer for mainstream software. I hope to see it by end-2014 or early 2015.

2). Heterogeneous computing just means OpenCL. MediaTek adopting its own HSA implementation would be a much bigger deal. To the best of my knowledge, no one has announced HSA hardware plans yet.

3) Java bring-up for HSA compatibility is, again, something I’m looking for in the back half of the year at top speed.

john

Well yes all these are happening this year, not tomorrow.

Well libraries are a start! Don’t be so negative :P. True, complete and polished SDKs are preferable. But you can’t say there is no support at all… there is some… this tech was launched 2 weeks ago cut them some slack man! I would be happy if in h2 we will have:

These 4 points are sufficient for me to call 2014 a succes for amd software wise speaking.

Joel Hruska

The OCL 2.0 driver isn’t coming until 2015. Mantle doesn’t use HSA.

We should get #1 and #4.

Ken Yap

Steamroller/bulldozer family of cpu is designed with HSA/APU in mind. Thus the lack luster single thread and FP performance, as they were meant to be covered by the IGP side. A good APU chip is where having both cpu and gpu to cover each other on their weaker points. Having backtrack to a CPU chip designed for pure x86 performance might see it lose importance when HSA is widely adopted.

Joel Hruska

There’s nothing AMD had to give up to get HSA. It’s just a bad core.

Ken Yap

Bad core for a CPU, probably a good one for APU running HSA applications.

Joel Hruska

No. Just a bad core.

There’s nothing in Steamroller that makes HSA intrinsically work better. There’s no specialized decode or fancy dispatch. If there was, HSA couldn’t be a standard anyone could implement.

http://www.hikingmike.com/ hikingmike

“There’s nothing AMD had to give up to get HSA. It’s just a bad core.”

How about die space, to the GPU? They gave up die space for the GPU, and the GPU is the driver of pushing HSA. Of course the GPU does graphics too, but the idea is if the GPU wasn’t on there, they’d have more die space for CPU. They might not have had nearly as much push to make APUs in the first place if they didn’t plan to use the GPU portion via something like HSA (hmm why’d they buy ATI). Now if you try to bring Intel into it again (“hey they have GPU too”), I can just bring up the huge fab difference and process node advantage. If you can control for those differences, then you might have an argument.

Oh yeah, they also gave up SOI (silicon on insulator) which was better for higher CPU clockspeeds, in favor of the bulk process which is better for GPU. I think you wrote about that, or at least I read it somewhere.

actionjksn

I have never heard anything about a CPU having to suck in order for HSA to work. Where did you get this idea? After reading about it on AMD’s site, the way they are describing it what is different, besides software is in the way that the CPU, GPU and ram are connected together, not the specific design of the CPU itself. It’s also supposed to work on other processors like ARM, So I see nothing indicating that the CPU must be designed like the Bulldozer- Steamroller for HSA to work properly.

Joel never said anything in his article about ditching the integrated GPU that they’re using.,He was talking about developing and scaling up the Jaguar architecture and incorporating the same GPU as they are using on Steamroller. Just because the CPU has all the weak points does not mean that it needs the weak points for HSA to work. No matter what you say having an efficient CPU is good, even if you also have HSA.

Ken Yap

Who ever said HSA work only with a lousy CPU? All i am saying is what you guys think of steamroller/bulldozer being weak at is actually the strong part of what the GPU can do, and is made less relevant in HSA computing. Having dump steamroller and improve it weak points don’t seem to be much of an improvement towards HSA, thus the backward thinking if AMD has to drop commitment to HSA just to improve x86 computing.

I have no idea how to come to the idea that i said anything about a weak CPU is good for HSA. The bulldozer family have seen multi-threaded performance being very good, it is also the strong point of the design that compliments an APU for HSA. No matter how efficient Jaguar is, it would definitely not be the best fit vs a design that is made to excel at higher power draws.

Joel Hruska

“Who ever said HSA work only with a lousy CPU? All i am saying is what you guys think of steamroller/bulldozer being weak at is actually the strong part of what the GPU can do, and is made less relevant in HSA computing.”

The flaw in this argument is demonstrated by Intel’s continued improvements to FLOPS and integer performance in x86 cores without needing to concentrate on integrating an entirely new chunk of hardware + writing an all-new software model.

Bulldozer’s multi-threaded performance takes a 20% performance hit due to the shared threading model. Steamroller reduces this to a 10% penalty. What that means is that, given an application that scales at 3.9x with a conventional quad-core CPU, Bulldozer will scale at ~3.1x on a quad-core. Steamroller will scale at 3.5x. Jaguar and Intel chips will scale at 3.9x.

The original point of Bulldozer was that AMD would be able to hit high clock speeds, high single-threaded performance, *and* high core counts. If, for example, AMD had delivered the ~10% improvements to efficiency over K10 and hit the 4.5GHz clock speeds it initially forecast, than a Bulldozer FX-8150 would’ve been quite competitive with Sandy Bridge in a smaller die footprint than we would’ve seen with an eight-core K10 design.

Regardless, I have not suggested that AMD drop HSA, only that AMD embrace a different chip architecture that’s better-tuned to play to their own strong points. There is no reason AMD cannot build an HSA-capable Jaguar-derived chip, and I’m certain we’ll see such a CPU in the future for mobile parts.

Ken Yap

The only problem i see is that HSA isn’t ready or here to challenge the dominance of intel in x86 computing. http://www.extremetech.com/wp-content/uploads/2014/01/HSA-LibreOffice.png
The potential of HSA isn’t something generic CPU could offer, being obstinate in x86 cores is not the way into the future. And i’m definitely not against improvements of the CPU core in an APU, but maybe not so much if the direction of it is to backtrack so to compete with intel on their ground instead of looking forward to building the chip for “better” future computing. Even intel is already investing more and more die space for their IGP and the CPU side of performance has stagnant for a while, more like Intel has taken a page from AMD and dumped “CPU improvements”.

I believe HSA is already existent in smaller cores like Beema and Mullins, there is no doubt of HSA in mobile parts, but i’m really not sure if replacing steamroller with smaller cores is good for desktop computing.

Joel Hruska

Actually we don’t know if Beema and Mullins are HSA-capable yet.

http://www.hikingmike.com/ hikingmike

The flaw in this argument is demonstrated by Intel’s continued improvements to FLOPS and integer performance in x86 cores without needing to concentrate on integrating an entirely new chunk of hardware + writing an all-new software model.

You’re logic is killing me. I don’t see any reasons in your comment that are valid for your argument. This might work for helping to demonstrate Intel’s integer units and floating point units are better than AMD’s, or that AMD hasn’t improved their integer and floating point units much… not an argument of Steamroller vs Kabini, or module with 2 integer + 1 floating point units vs cores with 1 each. That after all is the main concept behind Bulldozer/Steamroller.

And geez, there are more differences between AMD and Intel CPUs than just the module/core structure. Therefore you can’t bring in Intel as an example of why they shouldn’t do the 2 integer + 1 floating point units… because Intel increases FLOPS??? Your flaw in @ken_yap:disqus ‘s statement doesn’t pass muster. His argument still stands.

“Needing to concentrate on….” – You know, even if they had the same performance as Intel in the x86 cores (and didn’t “need” to do anything), they *might* still want to push HSA. There is a reason and it’s huge performance gains! Competitive advantage with strong GPU! They have been planning to do that since buying ATI.

Of course I agree with you about HSA not being specific to Steamroller. They could put HSA workflow into any APU. Ken never said anything to disagree. He said using HSA plays to their APUs’ strong points and that’s a good reason to keep the Bulldozer/Steamroller arch. The x86 cores have less floating point power relative to integer power, so work toward offloading to something that does floating point way better. Why the hell not? In the future if it works better almost all the time, offload it all, and beef integer power more on the x86 cores. This is specialization at work.

Now if HSA isn’t ready yet, and you think that’s a reason against the 2 integer + 1 floating point unit module structure, then that’s an argument you can make. But you never countered that in Ken’s original comment, you just went on about how Steamroller is not HSA (which everybody already agrees with).

Now where is the argument against Steamroller?
In your article:
-“HSA might one day help address the problem, but it’ll be years before HSA-compatible software is readily available.”
-“It’s a poor fit for modern foundries”
-Core size
These are valid arguments. You could also argue pipeline length (should be shorter) vs clockspeed ( like the Pentium 4 issue)

Joel Hruska

Disqus appears to have eaten my response to this.

My argument, in aggregate, is that the bets AMD made with Steamroller no longer make sense in the aggregate for reasons that have little to do with HSA. HSA’s presence or absence is orthogonal to the question of whether or not Steamroller should be seen as the long-term future of AMD’s CPU designs.

Here’s the real difference, though: Other people continue clinging to the idea that Steamroller is the result of a master plan. In reality, the timeline looks more like this:

2009: AMD promises a new Bulldozer core that will beat K10 on IPC and frequency with a shared modular design that will help AMD compete with Intel on die size.

Late 2010 / Early 2011: Realizing that Bulldozer will miss on frequency and IPC, AMD commissions Piledriver, a short-term follow-up that will ship within nine months of Bulldozer’s debut. Early Bulldozer yields are strong but chip’s performance is abysmal. AMD spends months respinning core. GF fabs, newly converted and ready to ramp 32nm for BD sit idle. Low 32nm yields for Llano allows AMD to force GF into paying only for good die.

Late 2011: BD launches, badly.

Late 2011: Krishna, Wichita canceled due to fabrication problems at GF. AMD agrees to pay more than $700M in penalty to GF in exchange for permission to manufacture Jaguar/Kabini at TSMC.

Q2 2013: AMD launches Kabini. Suspends all work on 20nm projects to focus on getting consoles out the door. Rory Read later confirms in conference call that AMD will tape out 20nm designs in Q1/Q2 of 2014.

That’s a graph of the A10-7850K versus the A10-6800K at 1.2 – 4.8GHz. I’m still working to determine the low-end minimum voltages, so you can ignore that section if you want. Just look at the high-end parts. The stable red line is because I was able to skip a voltage adjustment for Richland at the upper end, so platform TDP didn’t increase much. That’s full-load power consumption at each clock speed with voltage scale-down for each lower clock.

Steamroller is not the polished design of a happy company. It’s a flawed architecture that never hit its goals. Bulldozer was supposed to at least have Piledriver’s frequency. Piledriver was supposed to be the IPC-boosting core. Steamroller was six months delayed as AMD struggled to tape it out. Krishna and Wichita had to be dumped at GF and rebuilt at TSMC.

At no point did AMD get the “big” core they actually wanted or predicted and the scaling issues with having to yank the core speed southwards are only going to make that worse.

http://www.hikingmike.com/ hikingmike

Ok you listed a timeline that I don’t disagree with. I also don’t disagree that they’re not hitting their goals. For sure. I don’t know how that refutes a master plan though. Also plan and a resulting timeline being different doesn’t mean the plan is bad. It means they anticipated the conditions and the future badly… and yeah that’s part of what the plan was based on. It’s a little semantic but I hope you get what I’m saying. I was of the understanding that Bulldozer, Piledriver, Steamroller, and Excavator were (very roughly) planned when they started on this journey with Fusion. Here is a “The Future Is Fusion” image from 2010 that mentions Heterogenous computing – http://www.xbitlabs.com/articles/other/display/news-overview-2010_10.html . I see they had those names, and 1st – 4th generation modular cores at least as early as 2012. So they had this plan, even if it started out rather badly… and maybe the original planners aren’t in charge anymore.

The bad execution and circumstances don’t entirely discount the plan. The plan didn’t cause them to miss IPC and frequency, have fab problems, and consoles delaying them. And I still disagree that HSA isn’t related. The ability to use the GPU for some tasks is the whole reason of only having one FPU per “dual core” module.

“So the idea that yes, Bulldozer would have a smaller FPU and rigorous GPU offload capability was valid, but that entire roadmap kicked off about two years behind schedule.”

Ok, so you agree somewhat, but believe the fact that it’s behind schedule means the plan is no good anymore?

Thanks for your responses Joel, and I appreciate your discussion!

Joel Hruska

“Ok, so you agree somewhat, but believe the fact that it’s behind schedule means the plan is no good anymore?”

This depends on what you mean by “Plan.” In AMD’s original vision, the plan was for Bulldozer to hit Piledriver clock speeds and Kaveri IPC (at least) simultaneously in 2011. Instead AMD got there in early 2014. We can agree that hitting those targets more than two years late had a dramatic impact on Bulldozer product sales and AMD’s market share.

So yes: That plan didn’t work. And I think AMD needs a new plan because BD-derived architectures aren’t going to meet the company’s future needs. Excavator is still going to happen, because roadmaps take time and CPU architectures take time and AMD has made various commitments to server and consumer partners.

Now: If by “plan” you mean “The plan to move towards heterogeneous computing,” then I have no problem with that plan. One of the reasons I think Kabini is a better fit than Steamroller is because I think Kabini could be ramped to match Steamroller’s IPC *in a smaller die size.* That means AMD still has the option of adding more cores *or* using that die space for the GPU, or doing both simultaneously.

AMD isn’t going to stop building CPU cores, so the question isn’t “Should AMD build a CPU at all?” The question is: “What CPU will give AMD the most bang for its buck going forward in multiple workloads and contexts?” I think a Kabini-derived chip will answer that question better than an Steamroller-derived chip. But I think the impact of having a better CPU core on an HSA-capable APU is that you have more flexibility and attractiveness to a wider audience, not a weaker overall proposition.

http://www.hikingmike.com/ hikingmike

“This depends on what you mean by “Plan.” In AMD’s original vision, the plan was for Bulldozer to hit Piledriver clock speeds and Kaveri IPC (at least) simultaneously in 2011.”

Ok, I can see the “plan” including all of those details. They are kind of like results to me, but that is what they had in mind. They certainly didn’t meet those goals.

Yes, the second “plan” definition is really what I’m getting at. I think we’re understanding each other better now. The driver for the module structure was the ATI merger and the desire to move toward heterogenous computing. They can’t win the game with Intel so that’s one way to change the game. I see the module structure, GPU integration, and shifting of workloads as intrinsic to the Bulldozer/Piledriver/Steamroller/Excavator plan. I think it would be a bad idea if they dumped that idea along with Steamroller. That’s how I’m reading the headline of this article, because we’re talking about the APUs that are the flag bearer for this concept. Why get rid of that concept when they are only going to get better at utilizing the diverse resources in the future?

Now if you say something similar to – get rid of Steamroller, take Kabini, scale up as needed somehow, maybe implement a similar 2 integer + 1 FPU module structure for more emphasis on integer performance in the x86 cores, and GPU with good heterogenous capability built into the package, then that’s great with me.

I don’t know about the having room for more cores though. Software is finally doing better on multithreading, but they probably don’t need to go beyond the 4-8 cores they’ve already been doing for a while yet. I guess 8 maybe since they only had 8 on the non-APUs, and it did benefit some workloads a lot. But maybe also scaling up another way would be more helpful, and of course continuing to beef the GPU until they take over more of the mainstream discrete GPU market.

Joel Hruska

Exactly. I’m not claiming AMD should abandon HSA or its work with that standard — not at all. The question is whether BD-derived architectures are the best way to carry the CPU side of that equation forward.

That’s why I referenced Steamroller (CPU codename) and not Kaveri (APU codename). The GPU-side is great. The HSA implementation is good. So we keep what works. :)

As for more cores, I agree with you, more than eight isn’t going to be very helpful for most products. On the other hand, one Kaveri core is about 3x the size of one Jaguar core. If we assume that a “fat Jaguar” core would be significantly larger than present, you could still put about eight Jaguar CPUs in the same space as four Steamroller CPU cores.

In a test like Cinebench, eight Jaguar cores at 2.5GHz would significantly outperform four Steamroller cores at 3.7GHz. That doesn’t mean AMD should go build eight-core Kabini’s, but there are some real options on that front.

MrMilli

This is a seriously flawed article. You’re doing ‘CPU Frequency * Core Count’ over different architectures! You just can’t do that. Obviously a core optimized for low frequencies like Kabini will look much better when you nomalize the frequencies. But there are many more issues to take into account.
– Kabini just can’t run higher than 2.2-2.3Ghz. The pipeline doesn’t allow stable operation at high frequencies.
– You are counting a module as two cores. We all know that not fully true. It’s not really a single core but not dual either. Just AMD’s marketing likes to call it two cores. Other manufacturers don’t do that, like the SPARC T5. Each core can handle two integer threads at the same time but Oracle still counts it as one core. If AMD would have called a module a core from the start, your scores would have looked very different.
– Are you dividing by the nominal or turbo frequency?
– How many of your benchmarks are FPU dependent? Let’s not forget that each BD module has only one FP unit.
– You compare the sq mm of a chip that need to run at 4Ghz to a chip that is designed to run at 2Ghz. Of course it going to be much bigger. A core optimized for high frequencies can’t be that densely packed. There’s a reason why Jaguar is that small.
– Since when does performance scale linear to frequency?
– ….. and much much more
I feel you seriously overshot your basic knowledge of cpu’s here.

Joel Hruska

I picked CPU Frequency * Core Count because it maps easily and well. It also shows basic efficiency per core. And Steamroller’s two cores are much closer to a dual design than Piledrivers.

This is, to be sure, a rough estimate of CPU efficiency. It does not include performance per watt.

Performance scaling will scale at a steady multiplier of frequency until the chip encounters a performance bottleneck. In a 3D rendering test, for example, an additional 10% frequency will earn you 8.5-9.5% additional performance. The slope of this line will be steady until another bottleneck is reached. If +10% frequency = +9% performance than +20% frequency will be +18% performance.

actionjksn

Ever since Bulldozer came out I have been saying it is AMD’s Pentium 4. I think it sucks and they need to chose a different course. If designing a CPU that requires a really high clock speed to achieve a high performance worked, then we would have seen the Pentium 4 achieving a high performance. Intel couldn’t do it and neither can AMD. I agree with your article.

Harrison Ford

A module is by definition two cores, else my 486sx had no cores. A module is two x86/AMD64 cores that share an FPU. The FPU is not a conditional necessity of a core.

MWisBest

I find it interesting that you point out that the Pentium 4 and Pentium M were designed for 2 different purposes, and that each design would sort of reference the other, e.x. if the Pentium M could benefit from something the Pentium 4 did it used it or something similar to it and vice versa.

May I point out to you that the Pentium 4 was produced for nearly 8 years? The Pentium M was introduced about 2 and a half years after the Pentium 4 was released, and stopped production at the same time as the Pentium 4 as well.

Bulldozer: Introduced Late 2011.
Jaguar: Introduced Mid 2013.
That’s about a 1 and a half to 2 year long gap, not much different than the P4 and PM.

Now, you’re saying AMD should drop Bulldozer altogether, right now, yet are comparing it to the P4 and PM. If you want to compare it to the P4 and PM, you should give Bulldozer another 3 to 4 years. That’ll let AMD refine both Jaguar and Bulldozer to eventually “breed” them to come up with their next architecture, which proved successful for Intel.

If you want to try and compare Bulldozer and Jaguar to the K10, please get out. If you want to credit AVX for any performance improvement over K10, get. out. I’ve used both K10 (in a Llano APU, so probably a little more refined versus other K10 chips) and Piledriver (in a Trinity APU). In any sort of test or real-world usage I threw at the Piledriver chip, it was consistently better than the K10 chip, and the majority of that was WITHOUT AVX. They had the same amount of cores (4), and same CPU and GPU Frequency as well, and also used the same RAM and same HDD, resulting in a completely fair test. I have no idea whether or not you had a controlled testing environment such as that to determine any comparisons between the architectures, you share little details about it.

I also see you mentioned “Quad-Channel DDR4.” This puts a big credibility hit on you, as DDR4 got rid of the multiple DIMMs per-channel approach, something I’ve known for years.

The Willamette core launched in November, 2000.
Northwood launched in January, 2002.
The Pentium-M launches in 2003.
Prescott launched in early 2004.
Dothan launches later in 2004.
By early 2005, Intel declares it has canceled Tejas. Pentium-M, not P4, will drive the future of Intel products.
The first Core 2 Duo desktop parts debut in mid-2006.

Compare this to AMD:

The first Brazos parts launch in early 2011. Bulldozer launches late in the year.
Trinity launches in 2012.
Richland and Kabini both launch in early 2013.
Steamroller launches in early 2014, 2.5 years after Bulldozer.

Any change AMD makes to adopt a high powered Jaguar now will take at least 18-24 months to execute. We’ll be deep into 2015 before AMD could bring out a different chip and it makes more sense to aim the design at 2016 and 14nm-XM.

Thus, the soonest I’d expect to see AMD shifting to a high-end, Jaguar-derived architecture would be five years after the debut of Bulldozer, or just one year shy of Intel’s own six-year record.

RoboJ1M

I remember the article that put a Dothan in that special Asus (?) motherboard and it destroyed every CPU on the planet. Shortly after that Intel told the world they were dropping netburst and using Pentium M as the basis for everything and then along, er, I forget the name of the core. It was the Core Solos and the like.

Joel Hruska

I did an article on using Dothan with a DFI motherboard back in the day. It didn’t destroy everything else, even overclocked, but it was competitive with the Athlon 64 in a way that the P4 wasn’t. Once Intel fixed the low speed in media encoding, that chip surged like a rocket.

RoboJ1M

Indeed.
I’ve googled, it was a magic adapter (not motherboard) by Asus called the CT479 that allowed you to pair a Dothan with a full fat desktop motherboard and chipset.
Overclocked at 2.6-2.8 it was matching the performance of the Pentium EE £1000 chips and the FXs.
Shame I can’t remember the article but there’s loads of similar ones about.
I never did get a Conroe, or anything Intel actually after that.
For some reason after they dropped Netburst I always ended up with AMD parts, my last was a Prescott.
Probably just because they were cheaper really, it was long after the good-enough plateau was reached and it made more sense to invest in SSDs.
The only heavy lifting left for a CPU now is x264 for me and fancy VT tech for running multiple virtual servers at home.
However with console gaming flatlining (woo more CoD… ¬¬) I think I’ll be building a steam box to replace my console.

I’ll be starting with a Kaveri-on-ITX box.

Joel Hruska

Ahhhh. Getting a Dothan up to 2.6GHz was beyond the board I had, which did not allow for voltage adjustments and was limited to a custom (tiny) heatsink. I still got up to maybe 2.2GHz, I think. That was close enough to be competitive with the Athlon 64, which, if memory serves, topped out at 2.4GHz back then.

Joel Hruska

I decided to put together some Steamroller versus Kabini comparisons in three separate workloads: Cinebench 11.5, POV-RAY 3.7, x264 Encoder, and Maxwell Render 3.0.0: I downclocked the Steamroller core to 1.5GHz and used 4GB of single-channel DDR3.

Cinebench 11.5: Single-Threaded

Kabini: 0.39
Steamroller: 0.39

Cinebench 11.5: Multi-Threaded:

Kabini: 1.49
Steamroller: 1.41

Kabini is faster 5% here because it has four full cores of scaling to work with, whereas Steamroller has two modules.

Maxwell Render 3.0.0:

Steamroller: 69 minutes, 33 seconds.

Kabini: 94 minutes, 34 seconds.

Steamroller is about 38% faster in this test.

POV-RAY 3.7:

Steamroller: 3,118 seconds.

Kabini: 3,828 seconds.

Steamroller is about 22% faster in this test. Both of these applications likely rely on L2 cache speed, where Jaguar is crippled by design to save on power.

Steamroller is 21% faster in the first pass, Kabini is slightly faster in the second.

Conclusion: The size of the gap will depend significantly on how much the data sets rely on L2. Steamroller’s L2 bandwidth is much higher than Kabini’s (I couldn’t change that for test purposes). I’m also comparing a mobile A4-5000 in a laptop against a full desktop chip, which means the A4-5000 in question isn’t sitting at full Turbo speed all of the time.

I picked one test that I knew would show flat scaling between the two cores (Cinebench), one test I knew would favor Steamroller (Maxwell Render) and two tests I didn’t know the results of (POV-Ray, x264). Given that Jaguar is explicitly designed for mobile, there’s no reason to think that the chip’s performance cannot be further improved.. Dothan suffered from similar deficits against the P4 when it launched in a handful of desktop motherboards.

Tralalak Aviatik

Good job Mr.Hruska.
I delivered VIA Nano X2 results for diit.cz review two years ago: They (“we”) tested two cores at same cock at 1.6GHz:

The Nano was a great little core. I had sincerely hoped it would be the start of something big for VIA.

Tralalak Aviatik

I hope that the true refresh VIA Isaiah microarchitecture: native and monolithic (single-die) VIA QuadCore CN-R with L3 cache, SIMD up to AVX2 and 28nm lithography will be also great little core. The next chip has some tools to do computationally intensive things where HW provides a big advantage. But I don’t want to say yet what they are.

Joel Hruska

They are doing a refresh? I haven’t heard from anyone on VIA’s CPU division for years. I thought the chip was dead.

Cinebench is actually an oddity — Kaveri does usually have a performance advantage over Kabini. My tests show it at 15-20% generally.

kiko

Hey Joel I have a noob doubt (and a bit offtopic) and I hope you can answer it:

I’m aware that the node affects the frecuencies you can hit, but it’s more important than that?.

It affects performance?. With this I’m asking if a hypotetical Kaveri on a more suitable node for a CPU running at 3.7Ghz would outperform the A10 7850K just by the node, or it would outperform it by being able of run faster at stock (like Richland passing 4Ghz).

Or it affects scaling?. With this I’m asking if the performance of the chip degrades faster as the frecuencies increase. I would really like to test Kaveri vs Richland at sub 3Ghz speeds to see it myself but I don’t have those chips xD.

By the way, very good article.

Joel Hruska

Sorry Kiko, only just now saw this. Hopefully you’ll get a ping from Disqus.

So, “node” impacts a number of things. Generally speaking, a CPU at a new node will use less power at the same clock speed. So, a 32nm chip will draw less power than 45nm version of the same chip.

Scaling, however, can be very different. I explored this back when Ivy Bridge was new.

Look at the graph. Note that you can overclock Nehalem by 53% and yet its power consumption increases by just 1.5x. Ivy Bridge, in contrast, is approaching a 2x power consumption envelope by the time it reaches a 40% frequency boost.

In AMD’s case, it had two process nodes it could choose from — GlobalFoundries 32nm SOI process and GlobalFoundries 28nm bulk silicon process. The 28nm process was tuned for lower power consumption but trades off frequency to get there — a chip built on that process will use less power at certain frequencies, but will scale poorly by comparison.

If GF had built AMD’s parts on its 28nm FD-SOI process, that might not have been the case, but AMD chose to stop paying for FD-SOI research rather than shift to that product line.

TLDR: New nodes matter for power consumption, but node focus matters, too. A low-power 20nm node is a very poor fit for a high-power product.

That’s not a n00b question. That’s actually a great question. Unfortunately, the answer seems to be “Nobody knows.”

Here’s the deal: The consortium behind FD-SOI continues to promote it aggressively with slides like the ones you’ve shown. I’ve also heard that it’s possible to combine FinFET and FD-SOI.

But no one appears to be using FD-SOI for next-gen process nodes, with the possible exception of IBM and ST Micro, which paid GF to build some 28nm FD-SOI.

The big question is, why not? I can only assume it’s because customers and foundries both collectively decided FinFET was the better tech for now. That doesn’t mean FD-SOI won’t happen — there have been times when the industry eventually adopted both complementary technologies over a period of time.

Tralalak Aviatik

I hope that the true refresh VIA Isaiah microarchitecture: native and monolithic (single-die) VIA QuadCore CN-R with L3 cache, SIMD up to AVX2 and 28nm lithography will be also great little core. The next chip has some tools to do computationally intensive things where HW provides a big advantage. But I don’t want to say yet what they are.

http://www.hikingmike.com/ hikingmike

Is there supposed to be a Kabini in the bar graph in the article? The graph titled “Kabini vs. Kaveri vs. Richland”. Ah I understand, but you might as well put Kabini in there with 100% so you can compare visually by bar graphs.

Mombasa69

I’ve been using an FX-8350 since release, combined with 2 Sapphire R9 280 X Toxic’s, on an Asus Formula Z MB, it’s a great set up, even rubbish clunky old engine’s like Fun Com’s Dreamworld Engine in The Secret World runs pretty sweet maxed out in DX11, and that only uses 1-2 cores.

The next gen games are all optimized for 8 core CPU’s and with AMD Mantle coming out fairly soon, the CPU will become even more redundant. This is just another sad Intel love fanboi article.

Joel Hruska

1). AMD’s Mantle is already out. Covered it last week.
2). If you think that recommending AMD switch to a better CPU architecture constitutes “Intel love,” I suggest you read the article again.

The Piledriver family is moribund. AMD has published no roadmaps or plans for a consumer follow-up to the FX-8350.

Harrison Ford

No roadmaps, true, no plans? Not true. There was no point to make a Steamroller FX, since that core only improved TDP somewhat and it is doubtful it would be worth it to make an FX Steamroller. Even at 28nm.

AMD had no incentive to provide an update to Piledriver, it is a damn fine CPU. Look at the competiton, Haswell is a flop. Way too hot, near impossible to overclock.

Even today the best CPU for most people is the AMD FX 8350. It provides indisputably the best price/performance ratio, along with being unlocked, support for more RAM than any Intel i-series and 8 cores, providing unparalleled multi-threading.

It is probably the best AMD CPU ever made.

Joel Hruska

There are multiple ways to measure the “best” core AMD ever made.

If we measure based on increased revenue, then the K8 is hands-down the best core AMD ever built. That’s the core that took them to 20% server market share and above 20% in desktop.

If we measure based on a surprise “Come from behind” victory then K7 wins — though platform problems really hurt that chip.

The PD-derived parts have single-threaded efficiency lower than the old K8 in some tests. A 32nm K10 with AVX support and eight full CPU cores would’ve outperformed any modern Bulldozer.

I don’t care much what AMD calls the CPU, just that they continue with the FX line – i.e. high powered performance CPUs. Kaveri *is* Steamroller. The Steamroller moniker is just a name for a core tech.

If the FX line needs to be an APU, the fine. But telling AMD to “dump” Steamroller and go with Kaveri, when Kaveri is Steamroller based makes no sense.

Joel Hruska

Kabini, not Kaveri.

Harrison Ford

As I say, I don’t care much what AMD calls the CPU, Kabini, Kaveri, Katana, just as long as they continue the amazing FX line for enthusiasts.

Joel Hruska

….I don’t think you understand.

Kabini and Kaveri are two completely different CPU cores. What I said was “AMD needs to get rid of this CPU design that isn’t working well, and substitute in its really good CPU architecture.”

What you’re saying is: “I don’t care what they call the CPU.”

I don’t care what they call it for a codename, but the codename refers to a specific implementation of technology. What you’re arguing is essentially that the distinction between a Lamborghini and a John Deere is meaningless because “you don’t care what they call it.”

Harrison Ford

…I know you don’t understand. I don’t care what they or you call it. I know the codename refers to a specific implementation of technology.

I don’t care whether they call it John Deere or Lamburghini, just as long as they make enthusiast CPUs.

Amazing that it took you 3 months to come to this conclusion. You deserve an award.

Joel Hruska

I happened to be back in the story and thought I’d check the thread.

The 8-core FX family is dead in its current form. AMD will continue to build the FX-8350 and the AM3 platform but has no plans for future releases.

They have retained the FX brand name for quad-core mobile processors and may revive it again in 2016 when they reveal their next-generation architecture.

Harrison Ford

Great job. Thanks for not understanding what I wrote and then re-iterating it as if you were completely oblivious to anything stated above.

Your time has been well spent. You’ve told me nothing I didn’t already know.

Joel Hruska

“If the FX line needs to be an APU, the fine. But telling AMD to “dump” Steamroller and go with Kaveri, when Kaveri is Steamroller based makes no sense.”

I didn’t tell AMD to do this. I told AMD to dump Kaveri for Kabini. You were factually wrong. When called out on being wrong, you retort with: “I don’t care much what AMD calls the CPU, Kabini, Kaveri, Katana”

Except you criticized my title based on your own misperception of what it meant, then tried to cover your own mistake by claiming WELL IT DOESN’T MATTER ANYWAY.

Now, when I politely attempt to explain the difference, you still can’t admit to a simple error…. three months later. Instead of admitting that simple ignorance caused you to make a mistake, you resort to snide comments.

Seriously, dude. Grow up.

Harrison Ford

You didn’t tell AMD to do anything, first of all. Easy on the self-importance.

Second, yeah I called Kabini Kaveri or whatever. I just don’t care about these names. I literally do not care.

There’s nothing “factually” wrong, I just can’t be bothered to learn those stupid names like some autistic nerd.

Because, it doesn’t matter at all what they’re called, just what they are. You think you’re making some grand point or explaining something, but you are *not*. Don’t you get that?

There was nothing polite about your “explanation”, just a whole lot of patronizing and an unnecessary addition to a long dead discussion.

Graeme Willy

I don’t know, I’m completely happy with my FX8150 and HD6970 GPU. I got that 8150 back in 2011 and still serves me beautifully. I manged to overclock it from 3.6ghz to 4.8ghz. That was an incredible achievement…though, I recently backed it back down to 4.72GHZ because, for some reason the voltage leap from 4.7 to 4.80 was astronomical and lead to, too much unnecessary heat and noise (given the higher fan speed for my water pump).

For the dollar, it doesn’t get much better than an AMD. Why do you think you see AMD’s in consoles and not Intels? After all, I don’t have $300.00+ to dump into an Intel, after I’ve already spent $400.00 for a graphics card. I’d buy their APU in a heart beat, once they can start releasing the big boy GPU’s in APU form. I’d love to be rid of large systems and large, hot brick of a graphics card…

Russell Barlow

I have a friend still running a Phenom II with Thuban core and it still rips newfangled 6 core FX processors, even competing with fx 8200 series chips.

I feel like this guy always posts Intel biggoted articles. You can get a 1000$ Intel CPU, that is for sure. But that is for the ‘how do I computer’ people.

I upgraded for half the money using an AMD and properly configured my DDR3 ram, overclocked while using Turbocore to maintain idling/max efficiency per core usage, and have a decent entrylevel 750TI that will run anything for the next 3-5 years or so even with its great power efficiency. That is whta an ‘advanced’ user wants. An overall system of 500$ with proper settings will perform the same as the 1000$ Intel CPU for the retards who “don’t know how to use a computer”.

I will never throw out money to go out of my way and buy an Intel CPU which requires a unique socket (new motherboard) and expensive components when AMD is so modular and still pushes the envelope. You can feel secure knowing there’s an option to upgrade because of the whole AM2+/AM3/AM3+ and so on progression. They rarely change socket type and when they do it’s usually not even worth it to upgrade for at least 10 years at that point. AMD is for anyone with a brain who knows how to set up the overall computer. If you thought spending over 600$ more only to be fragged by me at the same FPS in any game, and yet perhaps gain a few hundred points on a useless benchmarking test, or perhaps even run a really unoptimized program/game that was coded horribly and runs on like 1 core ‘only slightly’ better (not even visually noticible) then you’re clearly just a delusional Intel fanboy.

I like AMD more even having both CPUs because of what I’ve just stated.. The price and efficiency ties in overall and it’s more than enough to run anything. Plus you can overclock an AMD to be way faster than an Intel and essentially ‘get 300$ or more for free’. That is a good saving to me when the top hardware of anything (cpu/gpu) ranges from like 300-1000$ anyway. I wouldn’t waste money on what I don’t need, and Intel is a clear ‘hype machine’. All AMD really has to do is improve TurboBoost further so that it does what Intel does and actually shuts down or significantly reduces the cores not in use, and throttle the core that is in use even more (and do so more efficiently) then it Intel will lose its ‘singlethreaded – better performance per single core’ type of hype just based on a simple logic hardware/software tweak. That is really all Intel has going for them. If you run a horribly coded/bad coded game like Arma3 that runs on 1 core, you tended to get an Intel because of higher clock speeds (.. not anymore, as of like 2009+) and performance per core (.. not anymore, as of like 2011+). A lot of those dumb people will still say ‘oh intel is priced more, so get more if u pay more right?’ Yet almost all of AMD negative-review users just throttle at max performance while using like 1% or less of each core and then complain it’s overheating with the stock fan, or the cpu died and they complain.. Obviously an Intel won’t overheat as easily, but this is just an example of why paying 500$ is just for those Intel fanboys aka ‘how do I computer’ type of people. Frankly it’s dumb to invest too much money into Intel if you know how to configure an AMD and get the same for literally 300$ less on average. Intel is just hype. I am a developer and advanced computer user and AMD is the best to me. This isn’t a marketting ploy, it’s literally like ‘look at the facts people’.. Intel is just overpriced and useless. You can overclock an AMD with some cheap-custom cooling that’s neck and neck with an Intel and still achieve better results. It’s a no-brainer. More for less.

Use of this site is governed by our Terms of Use and Privacy Policy. Copyright 1996-2015 Ziff Davis, LLC.PCMag Digital Group All Rights Reserved. ExtremeTech is a registered trademark of Ziff Davis, LLC. Reproduction in whole or in part in any form or medium without express written permission of Ziff Davis, LLC. is prohibited.