AMD has clearly taken NVIDIA’s comments on geometry performance to heart. Along with issuing their manifesto with the 6800 series, they’ve also been working on their own improvements for their geometry performance. As a result AMD’s fixed function Graphics Engine block is seeing some major improvements for Cayman.

Prior to Cypress, AMD had 1 graphics engine, which contained 1 each of the fundamental blocks: the rasterizers/hierarchical-Z units, the geometry/vertex assemblers, and the tessellator. With Cypress AMD added a 2nd rasterizer and 2nd hierarchical-Z unit, allowing them to set up 32 pixels per clock as opposed to 16 pixels per clock. However while AMD doubled part of the graphics engine, they did not double the entirety of it, meaning their primitive throughput rate was still 1 primitive/clock, a typical throughput rate even at the time.

Cypress's Graphics Engine

In 2010 with the launch of Fermi, NVIDIA raised the bar on primitive performance, with rasterization moved to NVIDIA’s GPCs, NVIDIA could theoretically push out as many primitives/clock as they had GPCs, in the case of GF100/GF110 pushing this to 4 primitives/clock, a simply massive improvement in geometry performance for a single generation.

With Cayman AMD is catching up with NVIDIA by increasing their own primitive throughput rate, though not by as much as NVIDIA did with Fermi. For Cayman the rest of the graphics engine is being fully duplicated – Cayman will have 2 separate graphics engines, each containing one fundamental block, and each capable of pushing out 1 primitive/clock. Between the two of them AMD’s maximum primitive throughput rate will now be 2 primitives/clock; half as much as NVIDIA but twice that of Cypress.

Cayman's Dual Graphics Engines

As was the case for NVIDIA, splitting up rasterization and tessellation is not a straightforward and easy task. For AMD this meant teaching the graphics engine how to do tile-based load balancing so that the workload being spread among the graphics engines is being kept as balanced as possible. Furthermore AMD believes they have an edge on NVIDIA when it comes to design - AMD can scale the number of eraphics engines at will, whereas NVIDIA has to work within the logical confines of their GPC/SM/SP ratios. This tidbit would seem to be particularly important for future products, when AMD looks to scale beyond 2 graphics engines.

At the end of the day all of this tinking with the graphics engines is necessary in order for AMD to further improve their tessellation performance. AMD’s 7th generation tessellator improved their performance at lower tessellation factors where the tessellator was the bottleneck, but at higher tessellation factors the graphics engine itself is the bottleneck as the graphics engine gets swamped with more incoming primitives than it can set up in a single clock. By having two graphics engines and a 2-primitive/clock rasterization rate, AMD is shifting the burden back away from the graphics engine.

Just having two 7th generation-like tessellators goes a long way towards improving AMD’s tessellation performance. However all of that geometry can still lead to a bottleneck at times, which means it needs to be stored somewhere until it can be processed. As AMD has not changed any cache sizes for Cayman, there’s the same amount of cache for potentially thrice as much geometry, so in order to keep things flowing that geometry has to go somewhere. That somewhere is the GPU’s RAM, or as AMD likes to put it, their “off-chip buffer.” Compared to cache access RAM is slow and hence this isn’t necessarily a desirable action, but it’s much, much better than stalling the pipeline entirely while the rasterizers clear out the backlog.

Red = 6970. Yellow = 5870

Overall, clock for clock tessellation performance is anywhere between 1.5x and 3x that of Cypress. In situations where AMD’s already improved tessellation performance at lower tessellation factors plays a part, AMD approaches 3x performance; while at around a factor of 5 the performance drops to near 1.5x. Elsewhere performance is around 2x that of Cypress, representing the doubling of graphics engines.

Tessellation also plays a factor in AMD’s other major gaming-related improvement: ROP performance. As tessellation produces many mini triangles, these triangles begin to choke the ROPs when performing MSAA. Although tessellation isn’t the only reason, it certainly plays a factor in AMD’s reasoning for improving their ROPs to improve MSAA performance.

The 32 ROPs (the same as Cypress) have been tweaked to speed up processing of certain types of values. In the case of both signed and unsigned normalized INT16s, these operations are now 2x faster. Meanwhile FP32 operations are now 2x to 4x faster depending on the scenario. Finally, similar to shader read ops for compute purposes, ROP write ops for graphics purposes can be coalesced, improving performance by requiring fewer operations.

167 Comments

Anand also tested with 'outdated' drivers. It is ofcourse AMD fault to not supply the best drivers available at launch though. But anand used 10.10, Reviews that use 10.11 like HardOcp see that the 6950 performance equally or better than 570GTx!! and 6970 trades blows with 580GTX but is overall little slower (but faster than 570GTX).

And now we have to wait for the 10.12 drivers which were meant to be for 69xx series.Reply

That said, Anand would it be possible to change your graphs?Starting with the low quality and ending with the high quality? And also make the high quality chart for single cards only. Now it just isn't readable with SLI and crossfire numbers through it.

According to your results 6970 is > 570 and 6950~570 but only when everything turned on.. but one cannot deduct that with the current presentation.Reply

$740 for HD6970 CrossfireX dominates GTX580 SLI costing over $1000.That's some serious ownage right there.Good pricing on these new cards and solid numbers for power/heat and noise.Seems like a good new series of cards from AMD.Reply

By a small average amount, and for ~$250 extra.Once you get to that level, you're not really hurting for performance anyway, so for people who really just want to play games and aren't interested in having the "fastest card" just to have it, the 6970 is the best value.Reply

True. However AMD has just about always been about value over an all out direct card horsepower war with Nvidia. Some people are willing to spend for bragging rights.

But I'm a little suspect on AT's figures with these cards. Two other tech sites (Toms Hardware and Guru3D) show the GTX 570 and 580 solidly beating the 6950 and 6970 respectively in the same games with similar PC builds.Reply

A lot of people were anxious to see what AMD will bring to the market with 6950/6970. And once again not much. Some minor advantages (like 5FPS in handul of games) is nothing worth writing or screaming about. For now GTX580 is more expensive, but now with AMD unveiling new cards nVidia will get really serious about the price. That $500 price point won't live for long. I expecting at least 50$ off that in the next 4-6 weeks.

GTX580 is best option today for someone who is interested in new VGA, if you do own right now 5850/5870/5970 (CF or not) don't even bother with 69[whatever].Reply

at that price point a 580 the best buy, get lost. The 580 is way over prized for the small performance increase it has above 570-6970 not to mentioning the additional power consumption. Don't see any reason at all to buy that card.

Indeed no need to upgrade from a 58xx series but neither would be to move to a nv based card.Reply