Graphics Core Next in Radeon HD 7970: A look at instruction issue rates in graphics contexts

AMDs recently launched Radeon HD 7970 brought the first installment of „Graphics Core Next“ in the shape of a 4.312 billion transistor processor code named Tahiti. With Graphics Core Next (or GCN for short) AMD has taken a radically different approach to compute than with their former Very Long Instruction Word (VLIW) based micro-architectures. The emphasize not only in the last sentence is on compute because that's the area were the new architecture is set to flex it's ALU muscles first and foremost - for graphics only, AMD remained adamant, VLIW was (and is) a very efficient way of cramping highly potent circuits on a small amount of die space.

Gaming performance of the Radeon HD 7970 is somewhere between 25 tp 50 percent up from HD 6970 and somewhere from 10 to 30 percent above the former single GPU champ Geforce GTX 580 - no matter if 1,5 or 3 GiB. With that settled, let's take a look at Tahiti in it's XT version and dig a little deeper.

The Graphics Core Next
The Radeon HD 7970's GPU features quite a few industry's firsts. It is the first available consumer GPU to be produced in TSMCs 28nm process, it is the first to be compliant to DirectX 11.1 once Microsoft makes the corresponding Windows update available and it is the first graphics card to be able to utilize the PCI-Express 3.0 interface which has already been shown to work at speeds more than 50 percent higher than with PCIe 2.0.

At the heart of the processor is another little processor actually bearing the name Graphics Core Next. In terms of compute it consists of four 16-wide SIMDs (or vector units) and a scalar co-processor each with their own set of registers - 64 kiB for each vector unit and 4 kiB for the scalar co-processor. They are fed by an individual scheduler for each GCN or Compute Unit (CU). A local data share with 64 kiByte facilitates rapid data exchange between the vector units if need be and a two-level, full read/write caching system is employed with a 16 kiB first level that also doubles as texture cache. In order to save precious space, compressed textures are stored as such and in contrast to earlier AMD processors not indecompressed form. Attached to each CU is a texture unit that can adress and fetch 16 samples per clock, thus delivering 4 bilinear filtered texels.

Shared between a set of four CUs in Tahiti are the instruction and data caches for the CUs and the scalar cache as detailed in AMDs presentations (p. 15) on Fusion Developer Summit 2010. Each L1 cache has an accumulated r/w bandwidth of 64 bytes per clock (1894 GByte/sec total with 925 MHz engine clock) and each 128 kiB partition of the L2 cache is also able to transfer 64 bytes per clock. Additionally, there's the Global Data Share that is used to facititate sychronization between the CUs.

Similar to Cayman, and partially to Cypress, there's a twin front-end feeding two blocks of CUs although there's a way for the data to cross over to the other block for better load balancing. Contrary to Cypress though, Cayman and Tahiti have the ability to work on individual primitives in the front-end. The Raster Backends currently at eight sets of four are independently scalable and not tied to the number of memory controllers.

Above it all sits the graphics command processor and dual asynchronous compute engines or ACEs. They both extract wavefronts of 64 work-items from the program kernels (vertex, hull, domain, geometry and pixel) and feed them to the individual CUs. Each CU is capable of keeping up to 40 wavefronts in flight (ten private to each SIMD) until their respective completion by the assigned CU.

The vector units
One of the main differences between AMD's earlier processors and the current GCN is the move from a VLIW approach to a quad SIMD. Where before, the driver compiler had to rely on extracting parallelism from the instruction stream, grouping four ALU operations for all VLIW lanes together in one clause, now each SIMD executes it's own instruction. This in effect marks the shift from a dependency limited to an occupancy limited design, which in turn means that contrary to earlier AMD architectures, GCN should be able to have higher throughput in less-than-vec4-scenarios, i.e. scalar operations. Since the scheduler can only issue to one of the four 16-wide SIMDs at a time, you loose some throughput here on very short shader programs as my friend Damien mentioned in his rather excellent HD 7970 review over at hardware.fr. Another issue has crept up during testing, which I will share with you here. For comparisons sake, I've included my Geforce GTX 480, which was overclocked to closely match Nvidias current top-of-the-line GTX 580 as closely as possible.Important note: GPU Bench is a rather old OpenGL program and as such, it is using the graphics pipeline, not the newly created compute workflow. That means, it is very much possible if not outright likely, that what is shown here only applies to graphics, more precisely, to graphics running into a rasterizer limit. It therefore may not reflect the pure compute capabilities for Tahiti and/or graphics core next, but it can provide some insight as to why Tahiti sometimes in graphics, especially games, shows much less of an improvement over Cypress as you would think going by the numbers (+40 percent) even without adding the new architecture on top of that.

In order to better visualize the effects, below's the same data formatted as a line graph. The data gathered clearly shows that for this OpenGL-based test, the HD 7970 lacks the ability to fully capitalize on scalar or vector2 instruction widths when shaders are fairly short.

Now, compared to previous architectures such as Cayman, any improvement from the move to narrower instructions is an advancement since the VLIW-based architectures could not profit apart from cases where other bottlenecks like operand-collection existed. Additionally, the Graphics Core Next architecture can impressively flex it's muscles with longer shaders starting from approximately 64 and 128 instructions respectively – depending whether you're looking at Vec2 or Scalar – upwards, leaving Nvidias GF100-chips in the dust. With longer instructions sequences, Tahiti XT can establish a lead of factor ~2.3 for scalar instruction streams as well, were it was slightly behind the overclocked GTX 480 with short, 48-instruction length shaders.

As for the reason for these results, I can only speculate. One of the possibilites is the driver not being properly optimized for OpenGL yet so that we may see a remedy for short shaders soon. I don't believe that the round-robin scheduling to one of the four vector units per CU and clock can be held completely responsible either. Rather, I think savings in the controller/issue logic farming out work to the vector units are part of the reason, as I've shown a similar behaviour to occur with Nvidias „economized“ Gamer-Fermi GF104/b as well in an older article – there, GF104 also reaches it's full potential with longer instruction sequences.
Either way, this effect could help explain some cases of a less-than-expected performance increase over both HD 6970 as well as GF100/b-based GPUs because one of Nvidias architectures shortcomings is, as you probably know, it's own fillrate limitation to two pixels per Shader-Multiprocessor per clock - making maximum fillrate a function of the number of SMs rather than ROPs. Peak fillrates - as oftentimes reached in short, simple shaders helped the performance of Cayman- and Evergreen-GPUs a bit. This fillrate advantage could have been expanded upon with GCN, but seems to be limited to cases where more than one or two instruction slots are used.

Important note (repeated!): GPU Bench is a rather old OpenGL program and as such, it is using the graphics pipeline, not the newly created compute workflow. That means, it is very much possible if not outright likely, that what is shown here only applies to graphics, more precisely, to graphics running into a rasterizer limit. It therefore may not reflect the pure compute capabilities for Tahiti and/or graphics core next, but it can provide some insight as to why Tahiti sometimes in graphics, especially games, shows much less of an improvement over Cypress as you would think going by the numbers (+40 percent) even without adding the new architecture on top of that.

Please note that though I'm only showing the MAD-graphs here, the situation is the same for all of the more common instructions based around multiply-add. To round up the picture painted here, I've run similar tests with the transcendental functions, which are handled by the vector units in GCN at quarter rate.

With Tahiti, none of them exhibit the same behaviour as the single-cycle functions above. Apart from a modest increase in the single-digit percent range, we see nothing out of the ordinary here running the tests on Radeon HD 7970. With the Geforce though, the RCP function behaves a little bit strange, possibly due to driver interference. It can be issued suspiciously high and scales very well with both vector width and length of the instruction sequence.

Finally, GPU Bench can be configured to measure the precision of some of the transcendtals, namely RSQ, RCP, SIN, COS, EX2 and LOG2. Here's a little comparison of the precision measured in the intervall 0.001 to 1.5707963267949 (half-Pi) in the upper part and 1.0e-12 to 0.001 in the lower part as per what GPU-Bench's results page suggests. Take those with a grain of salt, though because I know of at least one instance of a driver-update resulting in a change in precision (which was improved from Catalyst 6.xx to 7.xx for the Radeon X 1000 series). I've thrown in some older generations of cards going back to Geforce 7 and Radeon X1k series. The Geforce table is smaller because from G80 to GF110 Nvidia did not change the precision of these functions according to the GPU-Bench results.

About GPU-Bench 1.21
This little application was designed back in the days of the last DirectX 9 cards in order to measure their ability to be abused for general calculations. Since DirectX 9 wasn't suited for those tasks and is generally not very well respected with researchers who want to run their programs under various dialects of Linux, GPU-Bench consequently uses OpenGL. For that, a variety of tests can be run out of which we'll be focusing on the instruction issue rate for the GPU, which is documented here.

In short, GPU-Bench uses a stream of simple ARB 1.0 instructions such as ADD, MUL or MAD as well as more transcendental functions operating on two sets of registers in order to remove dependencies. I have configured the test to run shaders on scalars as well as in vector width ranging from 2 to 4 starting with 48 instructions up to 512 instruction sequences.