A new, narrower SPU
The big adjustment in Cayman comes at such a minute level, it isn't even visible in the big block diagram. Inside of each of the chip's SIMD shader processing engines is an array of 16 execution units or stream processing units (SPUs). In every AMD GPU architecture dating back to the R600, the fundamental SPU layout has been essentially the same, with four arithmetic logic units (ALUs) of equal capability and a fifth "fat" ALU capable of handling special functions like transcendentals. These execution units play a key part in the larger GPU symphony. Instructions for the ALUs are grouped together into a single, very long instruction word, and then all 16 of the SPUs in a SIMD engine execute the same instructions on different data simultaneously.

Scheduling instructions in VLIW5 groups like that can be a challenge, since the real-time compiler in AMD's graphics drivers must ensure that one operation's output isn't needed as input for another operation. If such dependencies are present, the compiler may not be able to schedule instructions on all five ALUs at once, and some ALUs may be left idle. The fact that only the one, "fat" ALU can handle transcendentals further complicates matters.

Thus, Cayman introduces a new, slimmer SPU block with four ALUs. Each of those four ALUs has absorbed the capabilities of the old "fat" ALU, so they can all handle special functions. Both the symmetrical nature of the ALUs and the narrower VLIW4 instruction word should simplify compiler scheduling and allow fuller utilization of the ALUs. It should also ease register management and make performance more predictable, especially for non-graphics applications. AMD claims a 10% improvement in performance per square millimeter over the prior VLIW5 design. However, AMD Graphics CTO Eric Demers, who was chief architect on Cayman back when the project started and was also deeply involved in R600, said almost wistfully that AMD would have retained the five-wide ALU if graphics workloads were the only consideration. Obviously, GPU computing performance was a big impetus behind the change.

In fact, some of the enhancements in Cayman apply almost exclusively to GPU computing applications and may affect AMD's FireStream lineup more directly than its consumer Radeon graphics cards. Among them: the ratios for double-precision floating-point math have improved somewhat, since DP math operations happen at one-quarter the single-precision rate, rather than one-fifth in prior designs. Cayman has taken another step toward the data center by incorporating ECC protection for external memories, much like Nvidia's Fermi architecture. Unfortunately, unlike Fermi, internal memories and storage aren't protected. Of course, ECC protection won't be used in consumer graphics cards, regardless.

Cayman's support for processing multiple compute kernels simultaneously is more robust, as well. According to Demers, Cypress could execute multiple kernels, but with only one pipe into the chip, their entry into the GPU had to be serialized. Cayman now has three entry points, with the possibility for more in future GPUs. Each kernel has its own command queue and virtual address domain, so they should be truly independent from one another.

The laundry list of compute-focused changes goes on from there, encompassing dual, bidirectional DMA engines for faster communication with the host system; the coalescing of shader read operations; and the ability to fetch data directly into the local data share attached to each SIMD. Many of these capabilities may sound familiar because Nvidia added them to its Fermi architecture. Clearly, AMD is on a similar architectural trajectory, toward making its GPU into a very competent general-purpose and data-parallel processor.

More tessellation from the, uh, tessellinator?
One of the flash points in DirectX 11 GPU architecture discussion has been the question of geometry throughput. Tessellation—the ability to take a low-polygon mesh and some additional information and transform it into a much more detailed, high-poly mesh on the GPU—is one of DX11's highest-profile features. Add the fact that Nvidia has taken a much more sweeping approach to parallelizing geometry processing, and you have the makings of a good argument or three.

The underlying issue here is that polygon throughput rates in GPUs haven't risen at nearly the rate other forms of graphics power have. There's more to it, but the fact that setup and rasterization rates didn't, for ages, eclipse one triangle per clock cycle is a good indicator of the problem. Without parallel geometry processing, the limits were fairly static. GPU makers are finally pushing past those limits, with Nvidia quite clearly in the lead. The GF100 and GF110 GPUs can rasterize up to four triangles per clock cycle, for example.

AMD created some confusion on this front when it introduced Cypress by claiming the chip had dual rasterizers. In reality, Cypress was dual core "from the rasterizers down," as a knowledgeable source put it to me recently. What Cypress had was dual scan converters—a pixel-throughput optimization for large polygons—but it lacked the setup and primitive interpolation rates to surpass one triangle per clock cycle.

Caymans' dual graphics/vertex engines. Source: AMD.

By contrast, Cayman has the ability to setup and rasterize two triangles per clock cycle. I'm not sure it quite tracks with what you're seeing in the simplified diagram above, but Cayman has two copies of the logic block that does triangle setup, backface culling, and geometry subdivision for tessellation. Load-balancing logic distributes DirectX tiles between these two vertex engines, and the processed tiles are then fed into one of Cayman's two 12-SIMD shader blocks. Interestingly, neither vertex engine is tied to a single shader block, nor vice-versa. Future variants of this architecture could have a single vertex engine and dual shader blocks—or the reverse.

Of course, two triangles per clock is the max theoretical rate, but delivered performance will be a little lower. I'm told AMD has measured Cayman's throughput at between 1.6 and 1.8 triangles per clock.

That's a big improvement over prior Radeons, but by comparison, Nvidia's biggest chip, the GF110, has four raster engines; 16 "PolyMorph engines" for setup, transform, and geometry expansion; and a four-triangle-per-clock theoretical peak.