Login Form

Growing up in the last millennium and reading a lot of science fiction, not to mention going on the daily quests for UFOs, has the advantage that quite a few of the new names popping up in the tech world are, in fact, very old acquaintances. The Little Green Men from Mars, the Arecibo message and finally the Fermi paradox were things that any geek had to be familiar with. I mean, you didn’t even have to be a geek, hi-school alone already qualified.

So I have been looking at Fermi for the last 60 years and finally the little green men, this time on sabbatical in Santa Clara, came out with it. Not a sex toy this time, though undeniably sexy, it is still somewhat different from what I anticipated half a century ago – no I am lying, I am not that old yet.

To get back to the topic at hand, we are looking at nVidia’s Fermi graphics processor / general purpose graphics processing unit and, truth be told, we have been hearing about it almost as long as about the Fermi paradox. But it is finally here.

The first substantiated rumors and semi-facts about nVidia’s Femi architecture a.k.a GF100 GPU surfaced during the summer of 2009 and at least according to the PR machinery behind it, it was going to be nothing like anything that had been there before. And then, silence struck again. There were a few press briefings to kindle the fires while AMD released their 5000 series and unleashed performance like nothing that had been there before. And then, there was again, nothing from nVidia.

Arguably, the difficulties of manufacturing ICs increase exponentially with complexity of the design and with die size. Add a new, un-proven fab process and there is a recipe for some major handicaps. It would be lopsided to claim that nVidia was the only company affected by the difficulties at TSMC to deliver sufficient yields of their 40 nm process but on the other hand, as mentioned above, the GF100 GP-GPU is at 500 mm2 die size and 3 billion transistors just a tad larger and more complex than the RV870 Cypress chip sporting a measly 2.15 billion transistors on an area of 334 mm2.

Whatever the contributing factors were, Fermi has been late to the show and after it finally debuted in limited quantities in the middle of spring, there still are no full version of the GF100, taking advantage of all processing units. Instead, there are two scaled-down versions namely the GTX480 and the GTX470. Before going into details on what is missing where, let’s take a quick overview of the architecture.

In short, the GF100 chip is organized into four quadrants or graphics processing clusters (GPCs), each of which is featuring four Fermi Streaming Multiprocessors (SM) for a total of 16 SMs. The four quadrants are not obvious from the functional diagrams but can be appreciated when looking at a die shot.

Functionally, the quadrants are primarily defined on the basis of one discrete raster engine per GPC, performing edge setup, rasterization and z-culling, otherwise, we have 16 totally interchangeable SMs, each of which features 64 CUDA cores, supplemented by 16 Load/Store units and four special function units (SFUs).

For reference, here is a quick recap of some of the stats and numbers of the Fermi GPU in comparison to the older generations of nVidia GPU, that is G80 and GT200.

GPU

G80

GT200

GF100

Transistors

681 million

1.4 billion

3.0 billion

CUDA Cores

128

240

512

Double Precision FloatingPoint Capability

None

30 FMA ops / clock

256 FMA ops /clock

Single Precision Floating Point Capability

128 MADops/clock

240 MAD ops /clock

512 FMA ops /clock

Special Function Units (SFUs) / SM

2

2

4

Warp schedulers (per SM)

1

1

2

Shared Memory (per SM)

16 KB

16 KB

Configurable 48 KB or 16 KB

L1 Cache (per SM)

None

None

Configurable 16 KB or 48 KB

L2 Cache

None

None

768 KB

ECC Memory Support

No

No

Yes

Concurrent Kernels

No

No

Up to 16

Load/Store Address Width

32-bit

32-bit

64-bit

It is a bit difficult to compare the GF100 to the older generations just on the basis of numbers since there are more fundamental changes that heavily impact functionality and capabilities of the GPU. From a hierarchical cache organization to a HyperThreading equivalent and ECC extended to the local frame buffer, the changes in architecture are probably the biggest since the move from the GeForce2 to the GeForce4 MX.