Web Development

CUDA, Supercomputing for the Masses: Part 21

By Rob Farber, November 19, 2010

The Fermi architecture and CUDA

In CUDA, Supercomputing for the Masses Part 20, I focused on the analysis capability of Parallel Nsight v1.0 coupled with the NVIDIA Tools extension (NVTX) library to illustrate asynchronous I/O, hybrid CPU/GPU computing, and the performance of primitive restart to dramatically accelerate OpenGL rendering in CUDA applications. (Note that Parallel Nsight 1.5 has been released, which is now compatible with Visual Studio 2010 and further refines the Parallel Nsight experience.)

This article will focus on Fermi and the architectural changes that significantly broaden the types of applications that map well to GPGPU computing while maintaining the performance benefits provided by previous generations of CUDA-enabled GPUs. Particular attention will be paid to how the Fermi architecture affects CUDA memory spaces. Also discussed will be how the Fermi architecture moves GPU computing into mainstream 24/7 production computing with error correction and other robustness features.

Fermi is the internal name that NVIDIA uses for the GF100 architecture that has many expanded capabilities to overcome computational limitations in the previous G80 and follow-on GT200 series of architectures. Variants of the Fermi architecture are used in the GeForce 400 and Tesla 20-series of products.

Aside from creating opportunities throughout science and industry for the developers of GPU software, this ubiquity (as illustrated by NVIDIA's claim of 250+ million CUDA-enabled GPUs sold to date) has driven the evolution of CUDA-enabled GPU architectures so they can efficiently run applications in more problem domains. Further, it has forced GPU hardware designers to harden GPGPU technology against common errors so that many GPGPUs can simultaneously be used in 24/7 production environments to reliably run applications for extended periods measured in days, weeks and months. Examples include rendering farms that create animated movies and supercomputers that run some of the largest physics simulations in the world.

Dual-dispatch scheduling that allows better utilization of the SFU (Special Function Units), integer and other pipelines.

Hardware that accelerates small conditional branching and predication.

Improved speed and accuracy of various math operations.

All these hardware capabilities translate to a higher-performance more generalized CPU-like GPGPU programming experience that can efficiently support a broader range of applications. Kudos to the CUDA software development teams that have leveraged these capabilities to further increase performance and support Fermi architecture GPGPUs including:

Recursive functions.

Function pointers.

(Note: in CUDA 3.2 use __forceinline__ on functions to force inlining again.)

C++ features such as:

Virtual functions.

On GPU "new" and "delete" operators for dynamic objects on the GPU.

Try/catch/throw exception handling.

Please see the Fermi Compatibility guide to understand how Fermi related changes have affected the nvcc compiler command-line arguments for building Fermi CUDA applications.

Fermi architecture products include the GF100 and variants classified as the GF104/106/108 and just released GF110 series. Differences include:

Various features of the GF100 architecture are available only on the more expensive Tesla series of cards. For consumer products, double precision performance has been limited to a quarter of that of the "full" Fermi architecture. Error checking and correcting memory (ECC) is also disabled on consumer cards.

Vasily Volkov has an excellent set of slides discussing how Fermi follows the trend towards an inverse memory hierarchy, "Programming inverse memory hierarchy: case of stencils on GPUs." The basic idea is that registers and on-chip local memory is fast and can scale with local processors and massive numbers of threads. This suggests that tiled/stencil algorithms on such systems should be stored in the upper, larger levels of the memory hierarchy such as registers instead of the traditionally used lower, smaller levels such as caches or local stores. (His slides, "Better Performance at Lower Occupancy" from the 2010 GTC conference are also an excellent source of information.)

Fermi, along with other processors, seem to be following this trend towards an inverse memory hierarchy as shown in this table created with data from, "Programming inverse memory hierarchy: case of stencils on GPUs." Note how the gap in memory hierarchy ratios is decreasing as parallelism increases. Succinctly, a single-thread won't see the inverse hierarchy. The inversion is caused by massive-parallelism, which motivates the use of large amounts of thread local data that can scale along with the number of processing elements. This also ties into the ILP (Instruction Level Parallelism) discussion in "Registers and warp scheduling" section of this article, see Table 2.

Dr. Dobb's encourages readers to engage in spirited, healthy debate, including taking us to task.
However, Dr. Dobb's moderates all comments posted to our site, and reserves the right to modify or remove any content that it determines to be derogatory, offensive, inflammatory, vulgar, irrelevant/off-topic, racist or obvious marketing or spam. Dr. Dobb's further reserves the right to disable the profile of any commenter participating in said activities.

This month's Dr. Dobb's Journal

This month,
Dr. Dobb's Journal is devoted to mobile programming. We introduce you to Apple's new Swift programming language, discuss the perils of being the third-most-popular mobile platform, revisit SQLite on Android
, and much more!