Introducing Fermi

Fermi Architectural Highlights

First and foremost, with Fermi, NVIDIA is taking a significantly different direction as to where GPU technology is headed. Sure, NVIDIA assures that the new chips will still be efficient where gaming is concerned, but with Fermi, NVIDIA is taking aim also at high performance computing (HPC). Which is no surprise given that they have spent so much time, effort and money into promoting its CUDA architecture as a means of speeding up applications that are suited to be executed in parallel by multiple cores. For more about CUDA, we refer you to an earlier article we've written here.

Fermi represents NVIDIA's first high-end chip to be manufactured using a 40nm processor and with over 3 billion transistors onboard, it is also NVIDIA's biggest and most ambitious yet. However, it is evolutionary rather than revolutionary, seeing that it still has its roots in the original G80 chip that debuted back in 2006, and we've highlight its most significant enhancements here:

GigaThread Engine

In essence, Fermi works in very much the same way as NVIDIA's previous generation GT200 and G80 chips. Commands from the CPU are read by the GPU via the host interface, after which the GigaThread Engine copies the data from the system memory and onto the frame buffer, and then creates and dispatches thread blocks to the various Streaming Multiprocessors (SMs). The SMs in turn schedule "warps" (groups of 32 threads) to the CUDA cores (shader processors) and other execution units. Hence, rather than radically changing the way Fermi works, NVIDIA has sought to streamline and make processes more efficient and quicker instead. The basic underlying reason of why the GigaThread Engine exists and how it works is still similar to that of the G80 chips in 2006 - only much more capable now.

As we've mentioned earlier, the efficiency of the GigaThread Engine is crucial to performance and the Fermi architecture improves on the GigaThread engine by providing greater thread throughput by means of enabling faster context switching, more refined thread scheduling, and most importantly, concurrent thread execution. The end result is greater speed and efficiency in the scheduling and dispatching of threads to the shader multiprocessors (SMs), which will benefit not only GPU computing processes, but also 3D gaming too.

Streaming Multiprocessors (SM)

Speaking of SMs, Fermi is made up of 16 of them, each with 32 shader cores configuration for a grand total of 512 CUDA cores - which is more than twice the number the GeForce GTX 285 packs (240, in case you were wondering). And just for comparisons sake, it's four times the number of shader cores on the original grand-daddy of the current architecture, the GeForce 8800 GTX.

One of most important changes to these CUDA cores is that they have been specifically engineered to perform better on double precision applications. Such a change was brought about because of requests by NVIDIA's partners in GPU computing, where double precision performance is critical. Fermi also implements the new IEEE 754-2008 floating-point standard, and has been specifically engineered to offer greater performance in double precision operations. NVIDIA claims that performance on double precision applications can be as much as 4 to 8 times faster on Fermi compared to the older GT200.

Memory Subsystem

To make the SMs even more efficient, NVIDIA has also made changes to the memory subsystem, one of which is to provide a larger and more flexible L1 cache. While the older GT200 chip only had 24KB of L1 cache, Fermi makes do with 64KB, and this can be configured as either 16KB of shared memory and 48KB of cache or as 48KB of shared memory and 16KB of cache. In addition, all 16 SMs will have access to an unified 768KB of L2 cache, which is markedly larger than the GT200's 256KB L2 cache. All in all, these changes help make Fermi a better performer at HPC applications.

That aside, the new Fermi GPU will also have six channels of 64-bit wide GGDR5 memory controllers, which means a net memory bus width of 384 bits. This is down from the 512 bits wide memory bus interface on the older GT200. However, the support for GDDR5 memory (a first for NVIDIA's high-end GPU), which delivers twice the bandwidth compared to GDDR3, should make up the difference. Theoretical memory bandwidth of the GeForce GTX 480 is 177.4 GB/s as compared to the GeForce GTX 285's 159 GB/s.

On top of this, the new GPU will be the first in the world to support ECC (Error-correcting code) and will also be capable of supporting up to 6GB of memory. ECC and 6GB of memory are probably redundant for gaming, but will offer great appeal to corporations and educational institutions who rely on GPUs for high performance precision computing needs.

As we have seen, most of the new implementations on Fermi are geared towards HPC users, so what does it actually mean for gamers? Fret not, because gaming is still very much in NVIDIA's blood and for that, we have to turn to the next page for NVIDIA's first chip featuring Fermi – the GF100.