Memory Bandwidth and GPU Performance

Memory Insights

To try and understand the deviations from our model, the first step is to look more carefully at some of the unexpected data points (such as those we discussed on the prior page). Table 1 contains information about 15 different graphics cards from our data set, including the model, process technology, DirectX version, shader count and frequency, memory interface width and transfer rate and then the actual performance numbers. The data is organized into pairs (and in one case a triplet) of cards that have relatively similar shader throughput – but very different memory bandwidth and 3DMark scores. The columns on the right contain GFLOP/s, GB/s and 3DMark scores normalized to the slowest of the cards in each pair.

Looking at the data for AMD, it is clear that memory bandwidth plays a substantial role in graphics performance. We identified 3 pairs of graphics cards with similar GFLOP/s, but very different 3DMark scores. In each example, one of the two graphics cards has ~2X the memory bandwidth of the other and higher performance as a result. The Radeon HD 3870 and 4670 were the pair we mentioned on the earlier page. The 3870 has 2.13X the memory bandwidth of the latter, which translates into the 36% better performance that we already observed. In a similar vein, the Radeon 4870 and 4850 achieve 14% and 27% higher 3DMark scores over their bandwidth starved cousins. The 4870 actually understates the advantage of memory bandwidth, because the shaders are less capable than the 5850’s – about 12% lower in terms of GFLOP/s. All together, the 78% higher memory bandwidth raises the performance by 30%.

Table 1 – Selected AMD and Nvidia GPUs

The Nvidia results are equally promising, but show a more varied response to memory bandwidth. The GeForce 9800M, 160M and 420M are the three cards we identified earlier – each with 192 GFLOP/s. But the GeForce 420M’s 128-bit memory interface is half the width and throughput of the other two. As a result, the two better balanced cards have 33% and 46% higher performance on 3DMark. A similar result can be observed for the 360M and 435M graphics cards – they have 250-255 GFLOP/s, but the 360M has twice the bandwidth and 45% higher performance.

The Quadro FX 2800M and GeForce 9800M show a much stronger relationship between bandwidth and performance – nearly linear. With only 25% more bandwidth, the performance rises by about 18% – which is very surprising, given the Bytes/FLOP ratio. Both cards are around 0.2, which is inline with most of the other cards in our dataset. So it is somewhat puzzling why the gain from the extra bandwidth was so strong.

The last example pair is the 335M and 4200M, which show somewhat less benefit from bandwidth. The 335M has nearly triple the bandwidth of the 4200M, identical shader throughput, and about 40% higher performance.

Implications

The data is unambiguous in support of our hypothesis – increasing the available bandwidth has a tremendous impact on performance and can readily explain many of the shortcomings in our performance model.

The selected data suggests that performance scales non-linearly with memory bandwidth. The scaling itself will depend on the nature of the workload and the GPU architecture. For an architecture and workload that are extremely starved for bandwidth, the scaling is probably stronger and at a certain point, additional compute resources won’t even increase performance. For example, a game with relatively simple shaders is more likely to be bandwidth limited. The same is true for a hardware platform using DDR3, as opposed to the faster GDDR5. On the other hand, aggressively taking advantage of caching could reduce the bandwidth requirements of an application.

What is intriguing about the examples above is that both AMD and Nvidia graphics cards show similar sensitivity to the memory interface. In most of the cases we analyzed, 2X higher memory bandwidth yielded ~30% better 3DMark Vantage GPU performance. A good estimate is that performance scales with the cube root of memory bandwidth, as long the memory/computation balance is roughly intact.

One of the most interesting implications of our analysis applies to future generations of integrated GPUs such as AMD’s 32nm Llano and Intel’s 22nm Ivy Bridge. A key difference between integrated GPUs and discrete GPUs is that a modest graphics card like the Radeon 6670 has 64GB/s of memory bandwidth, while high-end client microprocessors of today are targeted at roughly 21GB/s. A microprocessor with integrated graphics will have to share a much more limited amount of bandwidth between both the CPU cores and GPU. Consider a hypothetical discrete GPU which is integrated into a microprocessor, reducing the available bandwidth by a factor of 4. A good guess is that the integrated version will have about 1.5-7X lower performance than the discrete equivalent, due to the loss in bandwidth.

Naturally, this is one of the reasons that Intel uses Sandy Bridge’s L3 cache for the GPU and AMD’s chipsets have an optional side-port memory. The need for high bandwidth DRAM to accompany powerful GPUs is also one of the driving factors behind AMD and Intel’s efforts on 3D integration and packaging. The bottom line is that when it comes to graphics performance, memory bandwidth has a huge impact as our performance model makes clear. This raises the question though – when will integrated GPUs start shipping with high bandwidth memory interfaces, this year or the next?