Brazos and Llano were both immensely successful parts for AMD. The company sold tons despite not delivering leading x86 performance. The success of these two APUs gave AMD a lot of internal confidence that it was possible to build something that didn't prioritize x86 performance but rather delivered a good balance of CPU and GPU performance.

AMD's commitment to the world was that we'd see annual updates to all of its product lines. Llano debuted last June, and today AMD gives us its successor: Trinity.

At a high level, Trinity combines 2-4 Piledriver x86 cores (1-2 Piledriver modules) with up to 384 VLIW4 Northern Islands generation Radeon cores on a single 32nm SOI die. The result is a 1.303B transistor chip (up from 1.178B in Llano) that measures 246mm^2 (compared to 228mm^2 in Llano).

Trinity Physical Comparison

Manufacturing Process

Die Size

Transistor Count

AMD Llano

32nm

228mm2

1.178B

AMD Trinity

32nm

246mm2

1.303B

Intel Sandy Bridge (4C)

32nm

216mm2

1.16B

Intel Ivy Bridge (4C)

22nm

160mm2

1.4B

Without a change in manufacturing process, AMD is faced with the tough job of increasing performance without ballooning die size. Die size has only gone up by around 7%, but both CPU and GPU performance see double-digit increases over Llano. Power consumption is also improved over Llano, making Trinity a win across the board for AMD compared to its predecessor. If you liked Llano, you'll love Trinity.

The problem is what happens when you step outside of AMD's world. Llano had a difficult time competing with Sandy Bridge outside of GPU workloads. AMD's hope with Trinity is that its hardware improvements combined with more available OpenCL accelerated software will improve its standing vs. Ivy Bridge.

Piledriver: Bulldozer Tuned

While Llano featured as many as four 32nm x86 Stars cores, Trinity features up to two Piledriver modules. Given the not-so-great reception of Bulldozer late last year, we were worried about how a Bulldozer derivative would stack up in Trinity. I'm happy to say that Piledriver is a step forward from the CPU cores used in Llano, largely thanks to a bunch of clean up work from the Bulldozer foundation.

Piledriver picks up where Bulldozer left off. Its fundamental architecture remains completely unchanged, but rather improved in all areas. Piledriver is very much a second pass on the Bulldozer architecture, tidying everything up, capitalizing on low hanging fruit and significantly improving power efficiency. If you were hoping for an architectural reset with Piledriver, you will be disappointed. AMD is committed to Bulldozer and that's quite obvious if you look at Piledriver's high level block diagram:

Each Piledriver module is the same 2+1 INT/FP combination that we saw in Bulldozer. You get two integer cores, each with their own schedulers, L1 data caches, and execution units. Between the two is a shared floating point core that can handle instructions from one of two threads at a time. The single FP core shares the data caches of the dual integer cores.

Each module appears to the OS as two cores, however you don't have as many resources as you would from two traditional AMD cores. This table from our Bulldozer review highlights part of problem when looking at the front end:

Front End Comparison

AMD Phenom II

AMD FX

Intel Core i7

Instruction Decode Width

3-wide

4-wide

4-wide

Single Core Peak Decode Rate

3 instructions

4 instructions

4 instructions

Dual Core Peak Decode Rate

6 instructions

4 instructions

8 instructions

Quad Core Peak Decode Rate

12 instructions

8 instructions

16 instructions

Six/Eight Core Peak Decode Rate

18 instructions (6C)

16 instructions

24 instructions (6C)

It's rare that you get anywhere near peak hardware utilization, so don't be too put off by these deltas, but it is a tradeoff that AMD made throughout Bulldozer. In general, AMD opted for better utilization of fewer resources (partially through increasing some data structures and other elements that feed execution units) vs. simply throwing more transistors at the problem. AMD also opted to reduce the ratio of integer to FP resources within the x86 portion of its architecture, clearly to support a move to the APU world where the GPU can be a provider of a significant amount of FP support. Piledriver doesn't fundamentally change any of these balances. The pipeline depth remains unchanged, as does the focus on pursuing higher frequencies.

Fundamental to Piledriver is a significant switch in the type of flip-flops used throughout the design. Flip-flops, or flops as they are commonly called, are simple pieces of logic that store some form of data or state. In a microprocessor they can be found in many places, including the start and end of a pipeline stage. Work is done prior to a flop and committed at the flop or array of flops. The output of these flops becomes the input to the next array of logic. Normally flops are hard edge elements—data is latched at the rising edge of the clock.

In very high frequency designs however, there can be a considerable amount of variability or jitter in the clock. You either have to spend a lot of time ensuring that your design can account for this jitter, or you can incorporate logic that's more tolerant of jitter. The former requires more effort, while the latter burns more power. Bulldozer opted for the latter.

In order to get Bulldozer to market as quickly as possible, after far too many delays, AMD opted to use soft edge flops quite often in the design. Soft edge flops are the opposite of their harder counterparts; they are designed to allow the clock signal to spill over the clock edge while still functioning. Piledriver on the other hand was the result of a systematic effort to swap in smaller, hard edge flops where there was timing margin in the design. The result is a tangible reduction in power consumption. Across the board there's a 10% reduction in dynamic power consumption compared to Bulldozer, and some workloads are apparently even pushing a 20% reduction in active power. Given Piledriver's role in Trinity, as a mostly mobile-focused product, this power reduction was well worth the effort.

At the front end, AMD put in additional work to improve IPC. The schedulers are now more aggressive about freeing up tokens. Similar to the soft vs. hard flip flop debate, it's always easier to be conservative when you retire an instruction from a queue. It eases verification as you don't have to be as concerned about conditions where you might accidentally overwrite an instruction too early. With the major effort of getting a brand new architecture off of the ground behind them, Piledriver's engineers could focus on greater refinement in the schedulers. The structures didn't get any bigger; AMD just now makes better use of them.

The execution units are also a bit beefier in Piledriver, but not by much. AMD claims significant improvements in floating point and integer divides, calls and returns. For client workloads these gains show minimal (sub 1%) improvements.

Prefetching and branch prediction are both significantly improved with Piledriver. Bulldozer did a simple sequential prefetch, while Piledriver can prefetch variable lengths of data and across page boundaries in the L1 (mainly a server workload benefit). In Bulldozer, if prefetched data wasn't used (incorrectly prefetched) it would clog up the cache as it would come in as the most recently accessed data. However if prefetched data isn't immediately used, it's likely it will never be used. Piledriver now immediately tags unused prefetched data as least-recently-used, allowing the cache controller to quickly evict it if the prefetch was incorrect.

Another change is that Piledriver includes a perceptron branch predictor that supplements the primary branch predictor in Bulldozer. The perceptron algorithm is a history based predictor that's better suited for predicting certain branches. It works in parallel with the old predictor and simply tags branches that it is known to be good at predicting. If the old predictor and the perceptron predictor disagree on a tagged branch, the perceptron's path is taken. Improving branch prediction accuracy is a challenge, but it's necessary in highly pipelined designs. These sorts of secondary predictors are a must as there's no one-size-fits-all when it comes to branch prediction.

Finally, Piledriver also adds new instructions to better align its ISA with Haswell: FMA3 and F16C.

Post Your Comment

271 Comments

I think it *needs* to be at $600 to sell, because SNB + GT 540M is already at $600. However, HP has hinted that their sleekbooks with Trinity will start at $600 and $700 for the 15.6" and 14" models, respectively. "Start at" and "comes with a reasonable amount of RAM and an A10 APU" are not the same thing. Until HP actually lists full specs and a price, I have to assume that the $600 price tag for the 15" model is going to be 4GB RAM, 250GB HDD, and an A6-4400 APU. Hopefully I'm wrong, but the fact is we don't know Trinity's real price yet, so in the article I'm referring to the price I think it should be at in order to provide a good value.Reply

The CPU in Trinity is close to a 17W CPU with a 17W GPU. It performs about the same as an intel 17W chip. It's graphics engine is far better and the CPUs should cost about the same. The only real disadvantage over 17W Sandy Bridge is that in a prototype chasis Trinity uses more power, but a few watts should be shaved on production models.

This means AMD has caught up to Intel again! Yes AMD is going to lose spectacularly when ULV Ivy Bridge comes out and I doubt Trinity is going to scale at higher power but at low power, AMD has caught up!

(Yes I know that Sandy Bridge includes a GPU but if you look at your benchmarks, ULV Intel with a dGPU scores similar to Trinity when transcoding [The only really CPU limited test in this review])Reply

Something I just read at The Tech Report: when using MediaEspresso to transcode video, the result of VCE was much smaller than QuickSync or software, yet they didn't notice a difference in quality. I would like to know what your experience was. If that's really the case I'd prefer VCE over other Intel's solution even if it's slower.Reply

As far as i know VCE is not yet supported or been made available by AMD.

All those tests are due to openCL and not VCE since that part cannot be reached at this point in time. (yes blame AMD for that one, this is already taking 6months and still their is nothing about VCE)Reply

Quote from Page 2:"Trinity borrows Graphics Core Next's Video Codec Engine (VCE) and is actually functional in the hardware/software we have here today. Don't get too excited though; the VCE enabled software we have today won't take advantage of the identical hardware in discrete GCN GPUs"Reply

When you go to the llano review, the HD4000 gets stomped by Llano's desktop graphics offering. When you look at Trinity, the notebook version of trinity barely beats Llano. Why is it that Intel can practically fit the full power of their IGP (get nearly the same performance from notebooks as from 3770k) but AMD's is drastically weaker?

Also - will we see a weaker HD4000 in the dual core/cheaper IVB variants? I think Trinity desktop GPU will stomp on the HD4000 and might actually be a viable budget gaming solution as long as CPU improvements are good enough. We could see it take down quite a bit of the discrete graphics market I think, considering the HD4000 already can do that.Reply

It's an odd move by Intel, perhaps, but I think it makes sense. The mobile Sandy Bridge and Ivy Bridge parts basically get the best IGP Intel makes (HD 3000/4000), and what's more the clocks are just as high and sometimes higher than the desktop parts. Yeah, how's that for crazy? The i7-3720QM laptop chips run HD 4000 at up to 1.25GHz while the desktop i7-3770/K/S/T runs the IGP at up to 1.15GHz. SNB wasn't quite so "bad" with HD 3000, as the 2600K could run HD 3000 at 1.35GHz compared to 1.3GHz on the fastest mobile chips.

Anyway, the reason I say it kind of makes sense is that nearly all desktops can easily add a discrete GPU for $50-$100, and it will offer two or even three times the performance of the best IGP right now. On a laptop, you get whatever the laptop comes with and essentially no path to upgrade.

For AMD, if you look at their clocks they have them cranked MUCH higher on desktops. The maximum Llano clocks for mobile chips are 444MHz, but the desktop parts are clocked up to 600MHz. What's even better for desktop is that Llano's GPU could be overclocked even further on many systems -- 800MHz seems to be achievable for many. So basically, AMD lets their GPU really stretch its legs on the desktop, but laptops are far more power/heat constrained. It will be interesting to see what AMD does with desktop Trinity -- I'd think 900MHz GPU core speeds would be doable.Reply