VeteranNewcomer

I would be shocked if anything other than this were to occur when GT104-based SKUs launch in the next 1-3 months.

Just for the sake of clarification, my theory pertains to the method by which GT104 may surpass GP102 in raw arithmetic performance, which would require something more exotic than minor process and boost algorithm optimizations would allow, in my opinion.

Click to expand...

Worth noting that changes to Volta were beyond just minor process/boost such as instructions/cycle,cache structure,compiler; if the application can utilise the improvements then the benefits are greater by a notable margin than scaling.
Example is Amber that is between 63-75% faster with V100 over the Teslas GP102 even though the FP32 cores scaled increase by 42%.
Amber was one of the applications looked at by Nvidia/devs for such acceleration improvements with Volta.

As reference the Tesla GP102 P40 (7% more cores but different cache/SM structure to P100 and with GDDR5) has around same performance as the P100 16GB SXM 300W accelerator in Amber with FP32 Solvent.
But then not every application sees such scaling with V100, and games generally are well down from the ALU scaling due to the front end (GTC-Polymorph Engine-etc) although some do work well.

Veteran

Worth noting that changes to Volta were beyond just minor process/boost such as instructions/cycle,cache structure,compiler; if the application can utilise the improvements then the benefits are greater by a notable margin than scaling.
Example is Amber that is between 63-75% faster with V100 over the Teslas GP102 even though the FP32 cores scaled increase by 42%.
Amber was one of the applications looked at by Nvidia/devs for such acceleration improvements with Volta.

As reference the Tesla GP102 P40 (7% more cores but different cache/SM structure to P100 and with GDDR5) has around same performance as the P100 16GB SXM 300W accelerator in Amber with FP32 Solvent.
But then not every application sees such scaling with V100, and games generally are well down from the ALU scaling due to the front end (GTC-Polymorph Engine-etc) although some do work well.

Click to expand...

I'm not familiar with the instruction mix of the workload you cite - does it benefit from the presence of tensor cores on Volta which are absent on Pascal? If so, I'm not so sure that it is a good analogy for Turing, which I expect to not feature tensor cores. As you say, gaming workloads tend to make use of other fixed function units of GPUs so arithmetic scaling from one GPU SKU to another does not yield linear performance gains, especially between architectures. As an example of this, AMD has maintained an FP32 (let alone FP16 or 64) lead over Nvidia for quite some time now, yet their graphics cards continually fall behind NV's in the majority of gaming workloads.

VeteranNewcomer

I'm not familiar with the instruction mix of the workload you cite - does it benefit from the presence of tensor cores on Volta which are absent on Pascal? If so, I'm not so sure that it is a good analogy for Turing, which I expect to not feature tensor cores. As you say, gaming workloads tend to make use of other fixed function units of GPUs so arithmetic scaling from one GPU SKU to another does not yield linear performance gains, especially between architectures. As an example of this, AMD has maintained an FP32 (let alone FP16 or 64) lead over Nvidia for quite some time now, yet their graphics cards continually fall behind NV's in the majority of gaming workloads.

Click to expand...

Amber Solvent is straight up FP32 (for the 'official' benchmarks anyway) and importantly without Tensor cores, it is one of the applications that can benefit from the design of Volta beyond core scaling and not Tensor cores, for the factors I briefly mentioned.
Cache/register/SM do have a benefit as can be seen when comparing the P100 to GP102 for such applications, where it can be seen the Tesla GP102 with 7% more cores has comparable performance to the 16GB SXM P100.
Anyway the gains seen with V100 go quite a bit beyond just that when weighing factors involved, even allowing for the cache architecture/simplification improvements (context L1/L2 with Volta).

VeteranSubscriber

Worth noting that changes to Volta were beyond just minor process/boost such as instructions/cycle,cache structure,compiler; if the application can utilise the improvements then the benefits are greater by a notable margin than scaling.
Example is Amber that is between 63-75% faster with V100 over the Teslas GP102 even though the FP32 cores scaled increase by 42%.
Amber was one of the applications looked at by Nvidia/devs for such acceleration improvements with Volta.

As reference the Tesla GP102 P40 (7% more cores but different cache/SM structure to P100 and with GDDR5) has around same performance as the P100 16GB SXM 300W accelerator in Amber with FP32 Solvent.
But then not every application sees such scaling with V100, and games generally are well down from the ALU scaling due to the front end (GTC-Polymorph Engine-etc) although some do work well.

Click to expand...

People like to declare a primary bottleneck, like the front end, without proof. The reality is likely that the bottleneck shifts multiple times per frame and any areas that don't perfectly scale compound each other. Sometimes performance doesn't scale with ALU count because there's not enough bandwidth to feed the ALUs or the system can't make 100% use of the ALUs for various reasons like waiting on memory when there are some spare ALU cycles.

VeteranNewcomer

People like to declare a primary bottleneck, like the front end, without proof. The reality is likely that the bottleneck shifts multiple times per frame and any areas that don't perfectly scale compound each other. Sometimes performance doesn't scale with ALU count because there's not enough bandwidth to feed the ALUs or the system can't make 100% use of the ALUs for various reasons like waiting on memory when there are some spare ALU cycles.

Click to expand...

The closest to this was the testing Arun did with his tool looking at Geometry performance tool that historically showed a 1:1 relationship between SM-TPC-Polymorph engine including with Pascal, more recent testing with V100 indicated this has now reduced, which makes sense considering how much the architecture is being scaled up while maintaining the same front end, and that was even allowing for the SM structure with 64 CUDA cores instead of 128 design.
Although I agree for 100% proof it would be great if Arun could test the P100 to see how the 64 CUDA design affects the relationship (in theory geometry tool performance should still be 1:2 ratio or better but it was worst than this for V100).
Somewhere in the Pascal or Volta thread (I think it was the Volta one) you can find the discussion on this.

This is further backed up by what we see with games and their performance that varies between 5% to 35% with average in the 20s, and very rare over 40%.
Drivers could be a factor for the very lowest ansd also the use of 64 CUDA cores per SM and all it entails (I mentioned in the Volta thread that I remember an Nvidia enginer mentioning it is not ideal with gaming for now), but the trend is still well below the 40% scaling of the architecture for the games that work well generally.

VeteranNewcomer

I think the whole point of 3dcgi is that there is nothing to prove, because the types of workload change multiple times per frame and thus the location of the bottleneck changes just the same.

It only makes sense to say: x% of the time, the bottleneck is here, and y% of the time is somewhere else.

Click to expand...

There is, you can look at the ratio of SM-TPC-Polymorph Engine-Raster Engine and actual game frame behaviour.
Historically this has been a 1:1 performance relationship (see Arun's tool) but as the SM/CUDA cores scale while other aspects remain static it puts more pressure on the front end IF looking to use the idea of the architecture scaling from say Pascal to Volta; a 42% increase but the 1:1 relationship in context of geometry is now broken.
This is further seen with games that are measured either with PresentMon or other time based derivative solution, you see the influence on frames.

3dcgi was picking up my post that it was speculation with no foundation; actually it does have a foundation and is backed up with what is seen with nearly every game so far on TitanV and instead of 42% improvement in games we are at average of 18-25% or mostly below and a very rare few either in low 30s% or at times over 40%.
Look at Arun's tool and what was discussed, then look at games monitored from a frame behaviour perspective.
If one wants to argue semantics, then one can say there is no bottlenecks anywhere as workload changes for anything; point is context was in response to scaling of compute/TFLOPs/cores and gaming (geometry aspects that can be proved to be less than before in terms of ratio with the architecture fundamental to Nvidia).
And that then leads into by your context you might as well say games are fine on TitanV and scaling well if we look at the 1% of times it is fine over Y period rather than more real world and how it is behaving 98% of time in the game, in reality games are not scaling well and it comes back so far (no other explanation identified) to what Arun has identified with his tool and was discussed in that thread.

But like I mentioned to be 100% satisfied with Arun's tool results we need to see the behaviour on P100 due to the SM/CUDA structure, like I said in theory the tool should identify it as 1:2 or better, for V100 it is quite a lot worse than that.
Still this gives us some indicator (Arun's tool showing front end performance ratio has reduced) combined with what we are seeing with game behaviour trends on Titan V when the cores scaled by 42%.

Edit:
Worth noting as well that even with the reduced ROPs in compute applications requiring B/W such as Amber the TitanV still hits over 40% performance increase, so relative to comparing scaling performance with say GP102 it is fair to say it is still not a limitation relative to the core scaling.
That said it would be even higher with the full HBM2 bit/BW but it is not limiting to below the core scaling.

RegularNewcomer

Well it's definitely 12nm if that's true. Besides, Nvidia seems to love large dies recently, and this would explain the rumored $1k pricepoint for the lower end version.

Not sure how much money they'd expect to make off that of course, but hell maybe it's yet another non gaming focused chip and gamers are SOL again. Why bother serving them after all if the current lineup still sells and there's plenty of buyers for AI and HPC stuff?

Legend

"In addition, shipments for Nvidia's new-generation GPUs will play another driver of TSMC's revenue growth in the fourth quarter, the sources identified."
I know Digitimes sources are hit and miss, but if true that would mean holiday season at the earliest

VeteranRegular

GeForce GTX 1180 announced late August - 1180+, 1170 and 1160 to follow

More news from the NVIDIA front today. We already learned that there is NVIDIA activity during Gamescom in Germany, add to that an email proclaiming one to be from NVIDIA towards a board partner. This email talks about GeForce GTX 1180, 1170, 1180+ and 1160 by name.
...
Basically, this is what the preliminary launches could/would look like:
GeForce GTX 1180 on 30th August
GeForce GTX 1170/1180+ in 30th September
GeForce GTX 1160 on 30th October

You'll notice a 1180+. We're not sure about it, but are fairly certain that the GeForce GTX 1180: 30th August launch would be a founders edition, and the + very likely is a board partner board, AIB cards always launch later. All this info makes that email sound plausible (but really also could very well be tremendously fake). The email also mentions 21st August for a press conference call, for the partners.

VeteranRegularSubscriber

That would be my guess. If the leak has any validity, it reads like the release of the founders of the 1080 on 8/30 followed by the non-founders boards in a month on 9/30.

Had this been launching 6 months ago in Q1, I would have definitely been buying one. Now for some odd reason, I'm not that hyped. I'll either wait for a bundle with games or see how the 1180Ti shapes up (maybe that will be a 7nm product?)

VeteranRegular

We all know that NVIDIA has something brewing in that kettle of theirs, we all know it will be the 11xx series (GTX 1160, 1170 and 1180), but no one really knows what GPU architecture it'll be based on, and kudos/cudas to NVIDIA for keeping that under wraps. It is expected to be Turing, but I will keep saying it, Turing never was listed in their roadmaps. The most logical thing for them to do would be a Pascal respin with GDDR6 memory, Turing with GDDR6 memory.

However and realistically I think this is the case, Turing simply also could be Volta stripped from the Tensor cores and thus GeForce GTX 11xx would be Volta based. Now Volta GPUs have been around for a while, just not in the consumer domain. This is what the entries in HWinfo spike my interests, as Malik added: Nvidia GV102- en GV104-GPUs. Expect announcements late August on the new card(s)

About Us

Beyond3D has been around for over a decade and prides itself on being the best place on the web for in-depth, technically-driven discussion and analysis of 3D graphics hardware. If you love pixels and transistors, you've come to the right place!