NVIDIA Fermi Next Generation GPU Architecture Overview

Fermi Architecture continued

You heard me previously mention the “warps” – the groupings of 32 threads that a single SM will process.

Each SM features a pair of warp schedulers and instruction dispatch units that allow two warps to be executed concurrently on the CUDA cores. Each warp assigns instructions to 16 of the cores and 16 of the load/store units and half of the special function units – the warps then execute independently without scheduler assistance. This model of dual-issue architecture will apparently allow Fermi to reach close to its theoretical performance limits.

In our SM diagram above you also can see a block of 64KB of shared memory and L1 cache. This memory is unique in that it is configurable either as 48KB of share memory and 16KB of L1 cache or as 16KB of shared memory and 48KB of L1 cache. This option was required to guarantee 100% backwards compatibility with existing GPU-based applications but it also provides flexibility to the developer based on their programs’ needs.

Here you can see a specification breakdown of the new GPU architecture compared to G80 and GT200. At this time NVIDIA is not making any claims against current or upcoming AMD designs, though whether that is because NVIDIA would not look favorable or because the company is simply taking the higher ground has yet to be seen.

Besides these raw compute capabilities, there are some new features that NVIDIA is hoping will help Fermi differentiate from the competition. The first is a new ISA (instruction set architecture) that is updated to support the most popular programming language today: C++. By including support for a unified address space, NVIDIA’s architecture can now support object-oriented programming models with unlimited and unrestricted pointer locations. This feature alone could draw a lot of developers into the world of CUDA and GPU computing.

NVIDIA was quick to point out that this new ISA and architecture in general is completely ready for OpenCL and DirectCompute. The sharing of key abstractions like threads, blocks and grids are key to the optimization for these upcoming compute languages.

The new parallel thread execution model implements improved branching support through predication. By basically looking ahead into the branching code (if-else), Fermi is able to improve performance of both gaming and GPU computing code. This feature sounds very similar to the branch prediction units that AMD implement on their GPUs a couple of generations ago.

Memory Subsystem Innovations

While we have already discussed the benefits of the 64KB of shared memory/L1 cache, there are other changes that NVIDIA has made with Fermi to improve computing performance.

Applications that benefit from additional shared memory will have that option, up to 48KB, but will still have access to the L1 cache that is unique to this design. The L1 will stores temporary register spills and thus can improve overall memory access time.

NVIDIA also included a new L2 cache of 768KB that is shared and coherent across all 16 SMs in the GPU. The L2 cache then improves communication between the various SMs for applications that span more a single set of 32 CUDA cores.

NVIDIA has also taken the step to implement all major internal memories with support for ECC. While not a consumer-based issue, for very large server processing farms that have to worry about single bit-flips due to random radiation, ECC is a key component of a stable environment. The GDDR5 memory controller supports ECC as do the internal registers, L1 and L2 caches.

GigaThread Scheduler

The updated thread scheduler offers two new features with Fermi worth discussing. The first is vastly improved context switching performance – down to as little as 10-20 microseconds. Context switching is used when the GPU needs to swap between applications; for example switching between graphics rendering and PhysX processing. This could allow for developers to use more of the GPU compute power for non-graphics purposes if the penalty for doing so is reduced from a performance perspective.

The second major update is with concurrent kernel execution which I like to think of as HyperThreading for the GPU.

This allows a program that only uses a small number of kernels (and thus SMs and CUDA cores) to better utilize the entire GPU by running multiple instances of the kernel simultaneously. For this to work the kernels need to be based on the same GPU context so you would not be able to run both graphics and PhysX processing in this example.

Final Thoughts

NVIDIA has shown only the first taste of its new Fermi architecture to us today and it claims to have radically adjusted the GPU’s role, purpose and capability. NVIDIA did not just add new execution units to the core (though they did do that) but also took the route to improve performance with newer memory hierarchies, a configurable L1 cache, global L2 cache and ECC support. Double precision performance gets a big boost over the GT200 design though we have yet to decide how well it will compete with AMD’s Evergreen in raw compute.

NVIDIA CEO holds up the first Fermi reference card

NVIDIA also continues to push its CUDA architecture and support for other programming models besides DirectCompute and OpenCL. It would be hard to deny that NVIDIA has had success with its proprietary CUDA architecture in the professional and academic worlds, if not with the consumer. Adding support for the C++ programming model will only further drive the NVIDIA architecture into this market.

Since this is a Tesla card, only have one video output is not a big deal.

From a gaming angle, which is obviously one of our primary targets at PC Perspective, we don’t yet know how the Fermi architecture will apply. While I am doubtful that NVIDIA will be sharing any information about new products, frequencies, etc during the GPU Tech Conference today, if we find anything out we will be sure to share it. But even if clock rates remain the same as we currently have on the GT200 the architecture should perform damn well – after all we moved from 240 SPs to 512 SPs and have a new GDDR5 memory bus that is 384-bits wide. Everything else at this point is up in the air.

We also don’t know how soon anyone, gamers or professionals, will actually get hardware based on the Fermi architecture. If the persistent rumors are correct we are still looking at early 2010 for hardware – does that make this new design a “paper launch”? More or less, but as a journalist and fan of technology I would rather have this type of information earlier rather than later.