Turing Improves Performance in Today’s Games

Some enthusiasts have expressed concern that Turing-based cards don’t boast dramatically higher CUDA core counts than their previous-gen equivalents. The older boards even have higher GPU Boost frequencies. Nvidia didn’t help matters by failing to address generational improvements in today’s games at its launch event in Cologne, Germany. But the company did put a lot of effort into rearchitecting Turing for better per-core performance.

To start, Turing borrows from the Volta playbook in its support for simultaneous execution of FP32 arithmetic instructions, which constitute most shader workloads, and INT32 operations (for addressing/fetching data, floating-point min/max, compare, etc.). When you hear about Turing cores achieving better performance than Pascal at a given clock rate, this capability largely explains why.

In generations prior, a single math data path meant that dissimilar instruction types couldn’t execute at the same time, causing the floating-point pipeline to sit idle whenever non-FP operations were needed in a shader program. Volta sought to change this by creating separate pipelines. Although Nvidia eliminated the second dispatch unit assigned to each warp scheduler, it also claimed that instruction issue throughput rose.

How is that possible? It's all about the composition of each architecture's SM.

Check out the two block diagrams below. Pascal has one warp scheduler per quad, with each quad containing 32 CUDA cores. A quad's scheduler can issue one instruction pair per clock through the two dispatch units with the stipulation that both instructions come from the same 32-thread warp, and only one can be a core math instruction. Still, that's one dispatch unit per 16 CUDA cores.

In contrast, Turing packs fewer CUDA cores into an SM, and then spreads more SMs across each GPU. There's now one scheduler per 16 CUDA cores (2x Pascal), along with one dispatch unit per 16 CUDA cores (same as Pascal). Gone is the instruction-pairing constraint. And because Turing doubles up on schedulers, it only needs to issue an instruction to the CUDA cores every other cycle to keep them full (with 32 threads per warp, 16 CUDA cores take two cycles to consume them all). In between, it's free to issue a different instruction to any other unit, including the new INT32 pipeline. The new instruction can also be from any warp.

Turing's flexibility comes from having twice as many schedulers as Pascal, so that each one has less math to feed per cycle, not from a more complicated design. The schedulers still issue one instruction per clock cycle. It's just that the architecture is better able to utilize resources thanks to its improved balance throughout the SM.

Turing SM

Pascal SM

According to Nvidia, the potential gains are significant. In a game like Battlefield 1, for every 100 floating-point instructions, there are 50 non-FP instructions in the shader code. Other titles bias more heavily toward floating-point math. But the company claims there are an average of 36 integer pipeline instructions that would stall the floating-point pipeline for every 100 FP instructions. Those now get offloaded to the INT32 cores.

On paper, an SM in the previous-generation GP102 appears more complex, sporting twice as many CUDA cores, load/store units, SFUs, texture units, just as much register file capacity, and more cache. But remember that the new TU102 boasts as many as 72 SMs across the GPU, while GP102 topped out at 30 SMs. The result is a Turing-based flagship with 21% more CUDA cores and texture units than GeForce GTX 1080 Ti, but also way more SRAM for registers, shared memory, and L1 cache (not to mention 6MB of L2 cache, doubling GP102’s 3MB).

That increase of on-die memory plays another critical role in improving performance, as does its hierarchical organization. Consider the three different data memories: texture cache for textures, L1 cache for load/store data, and shared memory for compute workloads. As far back as Kepler, each SM had 48KB of read-only texture cache, plus a 64KB shared memory/L1 cache. In Maxwell/Pascal, the L1 and texture caches were combined, leaving 96KB of shared memory on its own. Now, Turing combines all three into one shared and configurable 96KB pool.

The benefit of unification, of course, is that regardless of whether a workload is optimized for L1 or shared memory, on-chip storage is utilized rather than sitting idle as it may have before. Moving L1 functionality down has the additional benefit of putting it on a wider bus, doubling L1 cache bandwidth (at the TPC level, Pascal supports 64 bytes per clock cache hit bandwidth, while Turing can do 128 bytes per clock). And because those 96KB can be configured as 64KB L1 and 32KB shared memory (or vice versa), L1 capacity can be 50% higher on a per-SM basis.

Combined, Nvidia claims that the effect of its redesigned math pipelines and memory architecture is a 50% performance uplift per CUDA core. To keep those data-hungry cores fed more effectively, Nvidia paired TU102 with GDDR6 memory and further optimized its traffic reduction technologies (like delta color compression). Pitting GeForce GTX 1080 Ti’s 11 Gb/s GDDR5X modules against RTX 2080 Ti’s 14 Gb/s GDDR6 memory, both on an aggregate 352-bit bus, you’re looking at a 27%-higher data rate/peak bandwidth figure across the board. Then, depending on the game, when RTX 2080 Ti can avoid sending data over the bus, effective throughput increases even more by double-digit percentages.

"And although veterans in the hardware field have their own opinions of what real-time ray tracing means to an immersive gaming experience, I’ve been around long enough to know that you cannot recommend hardware based only on promises of what’s to come."

So wait, do I preorder or not? (kidding)

jimmysmitty

Well done article Chris. This is why I love you. Details and logical thinking based on the facts we have.

Next up benchmarks. Can't wait to see if the improvements nVidia made come to fruition in performance worthy of the price.

Lutfij

Holding out with bated breath about performance metrics.Pricing seems to be off but the followup review should guide users as to it's worth!

Krazie_Ivan

i didn't expect the 2070 to be on TU106. as noted in the article, **106 has been a mid-range ($240-ish msrp) chip for a few generations... asking $500-600 for a mid-range GPU is insanity. esp since there's no way it'll have playable fps with RT "on" if the 2080ti struggles to maintain 60. DLSS is promisingly cool, but that's still not worth the MASSIVE cost increases.

jimmysmitty

904774 said:

i didn't expect the 2070 to be on TU106. as noted in the article, **106 has been a mid-range ($240-ish msrp) chip for a few generations... asking $500-600 for a mid-range GPU is insanity. esp since there's no way it'll have playable fps with RT "on" if the 2080ti struggles to maintain 60. DLSS is promisingly cool, but that's still not worth the MASSIVE cost increases.

It is possible that they are changing their lineup scheme. 106 might have become the low high end card and they might have something lower to replace it. This happens all the time.

Lucky_SLS

turing does seem to have the ability to pump up the fps if used right with all its features. I just hope that nvidia really made a card to power up its upcoming 4k 200hz hdr g sync monitors. wow, thats a mouthful!

anthonyinsd

ooh man the jedi mind trick Nvidia played on hyperbolic gamers to get rid of thier overstock is gonna be EPIC!!! and just based on facts: 12nm gddr6 awesome new voltage regulation and to GAME only processes thats a win in my book. I mean if all you care is about is your rast score, then you should be on the hunt for a titan V, if it doesn't rast its trash lol. been 10 years since econ 101, but if you want to get rid of overstock you dont tell much about the new product till its out; then the people who thought they were smart getting the older product, now want o buy the new one too....

none12345

I see a lot of features that are seemingly designed to save compute resources and output lower image quality. With the promise that those savings will then be applied to increase image quality on the whole.

I'm quite dubious about this. My worry is that some of the areas of computer graphics that need the most love, are going to get even worse. We can only hope that overall image quality goes up at the same frame rate. Rather then frame rate going up, and parts of the image getting worse.

I do not long to return to the day where different graphics cards output difference image quality at the same up front graphics settings. This was very annoying in the past. You had some cards that looked faster if you just looked at their fps numbers. But then you looked at the image quality and noticed that one was noticeably worse.

I worry that in the end we might end up in the age of blur. Where we have localized areas of shiny highly detailed objects/effects layered on top of an increasingly blurry background.

CaptainTom

I have to admit that since I have a high-refresh (non-Adaptive Sync) monitor, I am eyeing the 2080 Ti. DLSS would be nice if it was free in 1080p (and worked well), and I still don't need to worry about Gstink. But then again I have a sneaking suspicion that AMD is going to respond with 7nm Cards sooner than everyone expects, so we'll see.

P.S. Guys the 650 Ti was a 106 card lol. Now a xx70 is a 106 card. Can't believe the tech press is actually ignoring the fact that Nvidia is relabeling their low-end offering as a xx70, and selling it for $600 (Halo product pricing). I swear Nvidia could get away with murder...

mlee 2500

4nm is no longer considered a "Slight Density Improvement".

Hasn't been for over a decade. It's only lumped in with 16 from a marketing standpoint becuase it's no longer the flagship lithography (7nm).

TMTOWTSAC

In a perfect world, the non-RT models would be based off the TU architecture without any of the RT silicon, and priced accordingly. They're claiming RT is the must have feature and subsequently worth the price premium. Given those claims it's going to be very interesting to see what pricing scheme they go with for the non-RT models.

mlee 2500

Great article, very informative, thank you for taking the time to write it.

dimar

No need to waste your hard earned money. AMD Navi is around the corner. And if Navi isn't that good, RTX prices will be lower by then. With AMD you get freesync which most monitors have these days.

I hope you are editing the article that gets released here with the benchies once the NDA is lifted.

I will spend money based on that content ...

cangelini

Thanks guys.

Yes, I will be spending a long caffeine-fueled weekend with graphics cards, Excel, and Word. Let me know if there are any specific requests on comparisons you'd like to see made!

truerock

I've been running my Nvidiia Geforce GTX 690 for 6 years. It does 3840 x 2160 at 30fps.The lack of HDMI 2.1 is just enough of a negative to keep me from buying a Geforce RTX 2080 Ti.I guess it is ironic that I actually don't want HDMI or DisplayPort outputs on my Nvidia cards. I want Nvidia cards that only have USB-C output ports.Oh well - maybe next year. My Nvidiia Geforce GTX 690 will be 7 years old.

truerock

Chris,

Thanks for the review. It's the best I've seen on these cards so far.

I'm interested in 3840 x 2160 at 120fps. That would be with the more popular games. What settings for a specific game allow 3840 x 2160 at 120fps vs 3840 x 2160 at 60fps and 3840 x 2160 at 30fps. I'm not interested in g-sync. Does graphics quality suffer much as settings are pushed down to allow higher frame rates?

bit_user

134065 said:

Let me know if there are any specific requests on comparisons you'd like to see made!

Crysis @ 4k? ...you know someone will ask it. And Anandtech tested it on the Titan V, so we can compare.

Time's going to be tight, but I'll see if I can throw it on the test system.

Reynod

I agree ... if you still have the Original Crysis game ... then answer "But will it play Crysis?".

The original Badly coded game please?

I imagine you will alsso have received a couple of iterations of drivers since receiving the card, so let us know how much improvement you found with these?

Finally, when you finish can you pull the HSF off and let us know anything about the TIM you find?

bit_user

123704 said:

The original Badly coded game please?

Uh, it should be comparable to the other Crysis benchmarks, please.

Reynod

Ok then both of them ...

kyotokid

...well the 2070 sounds like the RTX stepchild. No linking capability which means no way to improve frame rate. So my thinking is who would buy this card?

Crazyjay53

So why so hurry getting these cards if there any game that run on rtx , it gonna take awhile for game software to add it in game ,ill stick with my 1080ti for awhile till they get benchmark on those rtx if they are worth it