Now that we’ve discussed the Telsa K20 series from the big-picture perspective of performance, configurations, pricing, and the marketplace, we can finally dive into the technical underpinnings of the K20.

Announced alongside the Tesla K20 back at NVIDIA’s GTC 2012 was the GPU that would be powering it: GK110. In a reversal of their usual pattern, GK110 was to be NVIDIA’s first compute-oriented Kepler GPU (GK10X having been significantly stripped for gaming efficiency purposes), but it would be the last Kepler GPU to launch. Whereas in the Fermi generation we saw GF100 first and could draw some conclusions about the eventual Tesla cards from that, GK110 has been a true blank slate. On the other hand because it builds upon NVIDIA’s earlier Kepler GPUs, we can draw a clear progression from GK104 to GK110.

GK110 is NVIDIA’s obligatory big-die GPU. We don’t have a specific die size, but at 7.1 billion transistors it is now the biggest GPU ever built in terms of transistors, dwarfing the 3.5B transistor GK104 and the 4.3B transistor Tahiti GPU from AMD. These big-die GPUs are unwieldy from a fabrication and power consumption perspective, but the end result is that the performance per GPU is unrivaled due to the fact that so many tasks (both graphical and compute) are embarrassingly parallel and map well to the large arrays of streaming processors found in a GPU.

Like GF100 before it, GK110 has been built to fill multiple roles. For today’s launch we’re mostly talking about it from a compute perspective – and indeed most of the die is tied up compute hardware – but it also has all of the graphics hardware we would expect in an NVIDIA GPU. Altogether it packs 15 SMXes and 6 ROP/L2/memory controller blocks, versus 8 SMXes and 4 ROP/L2/memory blocks on GK104. Not accounting for clockspeeds this gives GK110 87% more compute performance and 50% more memory bandwidth than GK104. But there’s a great deal more to GK110 than just a much larger collection of functional units.

NVIDIA GPU Comparison

Fermi GF100

Fermi GF104

Kepler GK104

Kepler GK110

Compute Capability

2.0

2.1

3.0

3.5

Threads/Warp

32

32

32

32

Max Warps/SM(X)

48

48

64

64

Max Threads/SM(X)

1536

1536

2048

2048

Register File

32,768

32,768

65,536

65,536

Max Registers/Thread

63

63

63

255

Shared Mem Config

16K
48K

16K
48K

16K
32K
48K

16K
32K
48K

Hyper-Q

No

No

No

Yes

Dynamic Parallelism

No

No

No

Yes

Fundamentally GK110 is a highly enhanced if not equally specialized version of the Kepler architecture. The SMX, first introduced with GK104, is the basis of GK110. Each GK104 SMX contained 192 FP32 CUDA cores, 8 FP64 CUDA cores, 256KB of register file space, 64KB of L1 cache, 48KB of uniform cache. In turn it was fed by 4 warp schedulers, each with two dispatch units, allowing GK104 to issue instructions from warps in a superscalar manner.

GK110 SMX

GK110 builds on that by keeping the same general design, but tweaking it for GK110’s compute-focused needs. The single biggest change here is that rather than 8 FP64 CUDA cores GK110 has 64 FP64 CUDA cores, giving it 8 times the FP64 performance of a GK104 SMX. The SMXes are otherwise very similar at a high level, featuring the same 256KB of register file space, 64KB of L1 cache, 48KB of uniform cache, and the same warp scheduler structure. This of course does not include a number of low level changes that further set apart GK104 and GK110.

Meanwhile this comparison gets much more jarring if we take a look at GK110 versus GF100 and by extension Tesla K20 versus its direct predecessors, the Fermi based Tesla family. The GK110 SMX compared to the GF100 SM is nothing short of a massive change. Superficially NVIDIA has packed many more CUDA cores into an SMX than they have an SM due to the change from a shader design that ran fewer CUDA cores at a very high (double pumped) clockspeed to a design that runs many more CUDA cores at a lower (single pumped) clockspeed, but they also have changed their warp execution model on its head in the process.

GF100/GF110 SM

GF100 was essentially a thread level parallelism design, with each SM executing a single instruction from up to two warps. At the same time certain math instructions had variable latencies, so GF100 utilized a complex hardware scoreboard to do the necessary scheduling. Compared to that, GK110 introduces instruction level parallelism to the mix, making the GPU reliant on a mix of high TLP and high ILP to achieve maximum performance. The GPU now executes from 4 warps, ultimately executing up to 8 instructions at once if all of the warps have ILP-suitable instructions waiting. At the same time scheduling has been moved from hardware to software, with NVIDIA’s compiler now statically scheduling warps thanks to the fact that every math instruction now has a fixed latency. Finally, to further improve SMX utilization FP64 instructions can now be paired with other instructions, whereas on GF100 they had to be done on their own.

The end result is that at an execution level NVIDIA has sacrificed some of GF100’s performance consistency by introducing superscalar execution – and ultimately becoming reliant on it for maximum performance. At the same time they have introduced a new type of consistency (and removed a level of complexity) by moving to fixed latency instructions and a static scheduled compiler. Thankfully a ton of these details are abstracted from programmers and handled by NVIDIA’s compiler, but for HPC users who are used to getting their hands dirty with low level code they are going to find that GK110 is more different than it would seem at first glance.

With that said, even with the significant changes to their warp execution model, GK110 brings more changes yet. We can’t hope to replicate the sheer amount of depth NVIDIA’s own GK110 whitepaper covers, but there are several other low-level changes that further separate GK110 from GF100.

Space and bandwidth for both the register file and the L2 cache have been greatly increased for GK110. At the SMX level GK110 has 256KB of register file space, composed of 65K 32bit registers, as compared to 128KB of such space (32K registers) on GF100. Bandwidth to those register files has in turn been doubled, allowing GK110 to read from those register files faster than ever before. As for the L2 cache, it has received a very similar treatment. GK110 uses an L2 cache up to 1.5MB, twice as big as GF110; and that L2 cache bandwidth has also been doubled.

What makes this all the more interesting is that while NVIDIA significantly increased the number of CUDA cores in an SM(X), in fact by far more than the increase in cache and register file sizes, they only marginally increased the number of threads that are actually active on an SMX. Each GK110 SMX can only have up to 2K threads at any time, 1.33x that of GF100 and its 1.5K threads. So as a result GK110 is working from a thread pool only slightly larger than what GF100 worked with, which means that despite the increase in CUDA cores they actually improve their performance in register-starved scenarios as there are more registers available to each thread. This goes hand in hand with an increase in the total number of registers each thread can address, moving from 63 registers per thread on GF100 to 255 registers per thread with GK110.

While we’re on the subject of caches, it’s also worth noting that NVIDIA has reworked their texture cache to be more useful for compute. On GF100 the 12KB texture cache was just that, a texture cache, only available to the texture units. As it turns out, clever programmers were using the texture cache as another data cache by mapping normal data at texture data, so NVIDIA has promoted the texture cache to a larger, more capable cache on GK110. Now measuring 48KB in size, in compute mode the texture cache becomes a read-only cache, specializing in unaligned memory access patterns. Furthermore error detection capabilities have been added to it to make it safer for use with workloads that rely on ECC.

Last, but certainly not least in our low level look, NVIDIA has added a number of new instructions and operations to GK110 to further improve performance. New shuffle instructions allow for threads within a warp to share (i.e. shuffle) data without going to shared memory, making the process much faster than the old load/share/store method. Meanwhile atomic operations have also been overhauled, with NVIDIA both speeding up the execution speed of atomic operations and adding some FP64 operations that were previously only available for FP32 data.

More to the point, they don't need to. The performance of the GK104 is more or less on par with AMD's best. If you don't need to lose money keeping up with the best your opponent has, then why should you lose money?

Keep in mind, they're charging $500 (and have been charging $500) for a GPU clearly built to be in the $200-$300 segment when their chief opponent in the discrete GPU space can't go a month without either dropping the prices of their lines or offering up a new, even larger bundle. This is in spite of the fact that AMD has released not one but two spectacular performance driver updates and nVidia disappeared on the driver front for about six months.

Yet even still nVidia charges more for less and makes money hand over fist. Yeah, I don't think nVidia even needs to release anything based on Big Daddy Kepler when Little Sister Kepler is easily handing AMD its butt.Reply

Your statement that Little Kepler is handing AMD's butt is absurd when it's slower and costs more. If NV's loyal consumers want a slower and more expensive card, more power to them.

Also, it's evident based on how long it took NV to get volume production on K20/20X, that they used GK104 because GK100/110 wasn't ready. It worked out well for them and hopefully we will get a very powerful GTX780 card next generation based on GK110 (or perhaps some other variant).

Still, your comment flies in the face of facts since GK104 was never build to be a $200-300 GPU because NV couldn't possibly have launched a volume 7B chip since they are only now shipping thousands of them. Why would NV open pre-orders for K20 parts in Spring 2012 and let its key corporate customers wait until November 2012 to start getting their orders filled? This clearly doesn't add up with what you are saying.

Secondly, you make it sound like price drops on AMD's part are a sign of desperation but you don't acknowledge that NV's cards have been overpriced since June 2012. That's a double standard alright. As a consumer, I welcome price drops from both camps. If NV drops prices, I like that. Funny how some people view price drops as some negative outcome for us consumers...Reply

Right now ,in the middle of the night, an idea sprang into my abused brain. nVidia is like Apple. And their graphical cards are like the iPhones. There's always a few millions of people willing to buy their producs no matter what, no matter what price they put up. Even if the rest of the world would stop buying nVidia and iPhones at least there will always be some millions of amaricans to will buy them, and their sons and their sons' sons and so on and so forth until the end of days. Heck even one of my friends when we were chatting about computers components uttered the words: "So you are not a fan of nVidia? You know it has PhysX." In my mind I was like : "FAN? What the...I bought my ATI card because it was cheaper and consumed less power so I pay less money when the bloo...electricity bill comes" And after reading all your comments I understand now what you mean by "fanboy" or "fanboi" whatever. Typically american bs.Reply

A consumer card would make sense if yields are relatively poor. A die this massive has to have a very few fully functional chips (in fact, K20X only has 14 of 15 SMX clusters enabled). I can see a consumer card with 10 or 12 SMX clusters being active depending on yields for successful K20 and K20X dies.Reply

It would also make sense if the yields are very good. If your yields are exceptional, you can manufacture enough GK110 die to satisfy both the corporate and consumer needs. Right now the demand for GK110 is outstripping supply. Based on what NV has said, their yields are very good. The main issue is wafer supply. I think we could reasonably see a GK110 consumer card next year. Maybe they will make a lean gaming card though as a lot of features in GK110 won't be used by gamers.Reply

IMHO, at these prices, I won't be buying one, nor do I think that the average enthusiast is going to be interesting in paying perhaps one and a half to three times the price of a good performance PC for a single Tesla card. Though nVidia will probably make hoards of money from supercomputing centers, I think they are doing this while forsaking the enthusiast market.

The 600 series seriously cripples double-precision floating point capabilities making a Tesla an imperative for anyone needing real DP performance, however, I won't be buying one. Now if one of the 600 series had DP performance on par or better than the 500 series, I would have bought one rather than buying a 580.

I don't game much, however, I do run several BOINC projects, and at least one of those projects requires DP support. For that reason, I chose a 580 rather than a 680.Reply