AMD's Radeon HD 6850 and 6870 graphics processors

For a journalist, there's nothing better than having a good story to tell. At least, that's always been my way of thinking, and we've had no shortage of intrigue, one-upsmanship, and swings of momentum in the GPU arena over the past year or so.

AMD grabbed the lead with the debut of its DirectX 11-class Radeon HD 5000-series graphics processors last September, well ahead of long-time rival Nvidia's competing chips. These new Radeons were quite good products, with a few strokes of brilliance like the Eyefinity multi-monitor gaming feature, but those highlights were counterbalanced by a frustrating series of supply problems stretching into 2010 caused by TSMC's troubled 40-nm chip manufacturing process. That same chipmaking process was a major contributor to uncharacteristically long delays in Nvidia's DX11 GPUs, which left a frustrated AMD with a market largely all to itselfa market it couldn't fully supply. Consumers groaned as a nearly unprecedented thing happened: prices on Radeon HD 5800-series cards rose above their introductory levelsand held there.

At the very end of the first quarter of the year, the first Fermi-based GeForces finally arrived. They ran hotter and louder but not much faster than the Radeon HD 5870, not exactly a winning combination. The outlook for Nvidia looked rather dim at that point, but a funny thing happened on the way to AMD's coronation as the kings of the DX11 generation. The new GeForces' performance quietly crept upward as Nvidia tuned its drivers for this novel, unfamiliar architecture, and then, in the middle of July, the GF104 debuted. This GPU, derived from the Fermi architecture, was smaller and more tightly focused on achieving strong performance in today's games. Onboard the GeForce GTX 460, it gave the incumbent Radeons much stiffer competition. Soon, we were declaring the GeForce GTX 400 series the new kings of value and hinting strongly that AMD needed to cut Radeon prices to win our recommendation.

Oddly enough, AMD didn't budge for a while, likely because supply constraints meant the firm was selling all of the graphics chips it could secure from TSMC. But AMD had, well, another card or two up its sleeve that would allow it to challenge the GTX 460 much more directly. Those cards, we now know, are called the Radeon HD 6850 and 6870, a pair of new offerings that come as part of AMD's annual fall refresh of its GPU lineup. They are both based on a leaner, meaner new graphics chip code-named Barts, a part of AMD's "Northern Islands" series of GPUs.

Barts? Where's Homer's?
The funny thing about Barts is that it's made using the exact same 40-nm fabrication process that has caused both AMD and Nvidia no end of trouble, mostly because AMD had little choice in the matter when TSMC outright canceled its plans for a 32-nm fabrication process. Both of the major GPU makers had to adjust their plans rather abruptly at that point, focusing on improvements to their chip designs to deliver additional goodness in this next generation of products.

Yet in the midst of some real frustrations, there's good news on several fronts. AMD Graphics CTO Eric Demers told us last week that TSMC had finally gotten a handle on the problems with its 40-nm process technology over the summer. If so, the latest chips from both AMD and Nvidia should be cheaper, faster, and more plentiful. That trend should be reinforced by some choices AMD has made along the way, especially the fact the Barts is actually smallerand thus cheaper to producethan the Cypress chip it replaces. Barts' mission is to address the value and performance sweet spot in the middle of the market, obviously opposing the GeForce GTX 460. Although the cards based on Barts are dubbed 6850 and 6870 and promise performance fairly similar to the products they replace, they should be less expensive, draw less power, and produce less heat than their predecessors.

A block diagram of the Barts GPU. Source: AMD.

The image above maps out the major components of the Barts chip in a familiar fashion. For the most part, this is the same core GPU architecture we know from the Cypress chip behind the Radeon HD 5800 series, only scaled down slightly and tweaked in several ways. Cypress has 20 SIMD arrays, each with 16 five-ALU-wide execution units, giving it a total of 1600 arithmetic logic units, or ALUs, with which to process the various types of shaders involved in the DX11 graphics pipeline. Barts dials back the SIMD array count slightly to 14, giving it a grand total of 1120 shader ALUs. With this GPU architecture, that change has some natural implications. The texture units, for instance, are aligned with the chip's SIMD arrays, so those drop in number proportionally, as well. Here are the vitals on Barts and some of its closest friends, to give you a sense of things.

ROP
pixels/
clock

Textures
filtered/
clock

Shader
ALUs

Rasterized
triangles/
clock

Memory
interface
width (bits)

Estimated
transistor
count
(Millions)

Approximate
die
size
(mm²)

Fabrication
process node

GF100

48

64

512

4

384

3000

529*

40 nm

GF104

32

64

384

2

256

1950

331*

40 nm

RV770

16

40

800

1

256

956

256

55 nm

Cypress

32

80

1600

1

256

2150

334

40 nm

Barts

32

56

1120

1

256

1700

255

40 nm

*Best published estimate; Nvidia doesn't divulge die sizes

With the GF104, Nvidia held texturing capacity steady at the GF100's rate while reducing nearly everything elseROP rate, rasterization rate, memory interface width, and ALU count. The result was a GPU probably better tuned to the needs of current games.

With Barts, AMD has made a different set of choices, reducing shader processing and texturing capacity versus Cypress while retaining the same ROP rate and memory interface size. Oddly enough, these very different choices may also produce a GPU better tuned for the usage patterns of today's game engines, given the present state of AMD's GPU architecture. After all, Cypress doubled up on RV770's resources in nearly every way but memory bandwidth. If that left it, at times, with an excess of shader and texturing power, then Barts may well be a more optimal balance of resources overall. That may especially be the case when high levels of antialiasing are in use, since Barts has the same ROP blending power, clock for clock, as Cypressand as a smaller, newer chip, Barts may have a little more clock speed headroom.

Cypress (left) versus Barts (right)

By the way, you may have noticed the presence of two "ultra-threaded dispatch processor" blocks in the diagram above, and if you're into these things, you may have recalled that the diagrams of Cypress only showed one of these blocks. Truth is, though, that this diagram of Barts is simply more detailed than the earlier one of Cypress. AMD's David Nalasco tells us both chips have dual "macro sequencers," as AMD calls them internally, to "dispatch instructions to the SIMDs." (There's also a "micro sequencer" in each SIMD.) As the diagram shows, each macro sequencer has instruction and constant caches. One bit of detail missing above is a crossbar between the two "rasterizer" blocks and the macro sequencers, so either sequencer can be fed by either rasterizer.

To take you further down the rabbit hole, the presence of two rasterizers in the diagram above may be a little bit misleading. As with Cypress, Barts has dual scan converters, but it lacks the setup and primitive interpolation rates to process more than one triangle per clock cycle. That's in contrast to the GF104, which can process two polygons per clock tick, or the GF100, whose max is four.

Although the setup rate hasn't changed in Barts, the chip's internal geometry processing throughput should be higher thanks to some selective tweaks. One of DirectX 11's key features is tessellation, in which a relatively low-polygon model is sent to the GPU, and the chip then adds additional detail by using a mathematical description of the surface's curves and, sometimes, a texture map of its bumps. Adding detail once the model is on the chip can reduce host-to-GPU communications overhead, oftentimes dramatically; it also makes much higher degrees of geometric complexity feasible. One of the challenges tessellation presents is the management of data flow. As essentially a very effective form of compression, tessellation involves a relatively small amount of input data and a much larger, sometimes daunting amount of output data. To better deal with this data flow in Barts, AMD "re-sized some queues and buffers," according to Nalasco, "to achieve significantly higher peak throughput" in certain cases. At the same time, thread management for domain shaders, which handle post-expansion geometry processing, has been improved.

AMD claims these changes had "negligible impact" on Barts' transistor budget and power draw, yet the firm has measured tessellation throughput for Barts at up to twice that of Cypress in directed tests. The biggest gains come at lower tessellation levels, as show in the image below. At higher levels, the chips' common setup rate likely becomes a limiting factor, and the two are separated only by Barts' slightly higher clock speed.

Barts vs. Cypress tessellation throughput. Source: AMD.

Interestingly enough, we were able to measure a substantial difference between Cypress and Barts ourselves using the hyper-tessellated Unigine Heaven demo.

Barts hasn't quite matched the GF104 and friends, with their truly parallel geometry processing capabilities, but it has narrowed the gap quite a bit.

Barts also has some image quality improvements, one in hardware and one in software, that we'll discuss shortly, but that's about it in terms of changes to the core graphics hardware. We were a little bit surprised to see Demers claiming rather large gains in performance per chip area for Barts versus Cypress, on the order of 25%, given that the two chips share the same underlying architecture and are made on the same fabrication process, but that's precisely what happened during the press event for this product. Strangely, the comparison being made was between the Radeon HD 6870a fully enabled Barts chip running at peak clock speedsand the Radeon HD 5850a partially disabled Cypress variant with lower clocks. I also run faster than Usain Bolt if you cut off one of his legs below the knee, but that's not something I like to advertise.