Does the GeForce GTX 970 have a memory allocation bug ? (update 3)

For a week or two now in our forums there have been allegations that users of the GeForce GTX 970 have a darn hard time addressing and filling the last 10% of their graphics memory. The 4 GB card seems to run into issues addressing the last 400 to 600 MB of memory, which is significant.

Two weeks ago when I tested this myself to try and replicate it, some games halted at 3.5 GB while others like COD fill the 4 GB completely. These reports have been ongoing for a while now, then got dismissed. However a a new small tool helps us to indicate and verify a thing or two, and there really is something going on with that last chunk of memory for the GeForce GTX 970 and its memory usage. We have to concur the findings, there is a problem that the 970 shows, and the 980 doesn't.

Meanwhile an Nvidia representative here at the Guru3D forums already stated that "they are looking into it". The tool we are talking about to verify a thing or two was made by a German programmer under the name Nai, he has made a small program that benchmarks vram performance and we can see the 970 memory utilizing around the 3.3GB, while the GTX 980 does not show such behavior:

You can download the test to try it yourself, we placed it here (local guru3d mirror). This is a customized version based on the original programming by Nai, this one is programmed by a Guru3D member. With this version you can now also specify the allocation block size and the maximum memory that is used as follows:

vRamBandwidthTest.exe [BlockSizeMB] [MaxAllocationMB]

BlockSizeMB: any number of 16 32 64 128 256 512 1024

MaxAllocationMB: any number greater or equal to BlockSizeMB

If no arguments are given the test runs the 128MB blocksize by default with no memory limit, which corresponds exactly with the original program. Please disable AERO and preferably disconnect the monitor during the test. We are interested in hearing Nvidia's response to the new findings.

You can further discuss your findings here in our forums. Please do share us your GTX 970 and GTX 980 results.

Meanwhile at Nvidia (a chat from a forum user):

[10:11:39 PM] NV Chat: We have our entire team working on this issue with a high priority. This will soon be fixed for sure.[10:11:54 PM] Me: So, what is the issue?[10:12:07 PM] Me: What needs to be fixed?[10:12:46 PM] NV Chat: We are not sure on that. We are still yet to find the cause of this issue.[10:12:50 PM] NV Chat: Our team is working on it.

Update #1 - Nvidia responds

NVIDIA now has responded to the findings:

The GeForce GTX 970 is equipped with 4GB of dedicated graphics memory. However the 970 has a different configuration of SMs than the 980, and fewer crossbar resources to the memory system. To optimally manage memory traffic in this configuration, we segment graphics memory into a 3.5GB section and a 0.5GB section. The GPU has higher priority access to the 3.5GB section. When a game needs less than 3.5GB of video memory per draw command then it will only access the first partition, and 3rdparty applications that measure memory usage will report 3.5GB of memory in use on GTX 970, but may report more for GTX 980 if there is more memory used by other commands. When a game requires more than 3.5GB of memory then we use both segments.

We understand there have been some questions about how the GTX 970 will perform when it accesses the 0.5GB memory segment. The best way to test that is to look at game performance. Compare a GTX 980 to a 970 on a game that uses less than 3.5GB. Then turn up the settings so the game needs more than 3.5GB and compare 980 and 970 performance again.

On GTX 980, Shadows of Mordor drops about 24% on GTX 980 and 25% on GTX 970, a 1% difference. On Battlefield 4, the drop is 47% on GTX 980 and 50% on GTX 970, a 3% difference. On CoD: AW, the drop is 41% on GTX 980 and 44% on GTX 970, a 3% difference. As you can see, there is very little change in the performance of the GTX 970 relative to GTX 980 on these games when it is using the 0.5GB segment.

So removing SMMs to make the GTX 970 a lower spec product over the GTX 980 is the main issue here, 500MB is 1/8t of the 4GB total memory capacity yeah, two SMMs is 1/8th of the total SMM count. So the answer really is, the primary usable memory for the GTX 970 is a 3.5 GB partition.

Nvidias results seem to suggest this is a non issue, however actual users results contradict them. I'm not quite certain how well this info will sit with GTX 970 owners, as this isn't a bug that can be fixed, it's in design to function that way due to the cut down SMMs.

Update #2 - A little bit of testing

On a generic notice, I've been using and comparing games with both a 970 and 980 today, and quite honestly I can not really reproduce stutters or weird issues other then the normal stuff once you run out of graphics memory. Once you run out of ~3.5 GB memory or on the ~4GB GTX 980 slowdowns or weird behavior can occur, but that goes with any graphics card that runs out of video memory. I've seen 4GB graphics usage with COD, 3.6 GB with Shadows of Mordor with wide varying settings, and simply can not reproduce significant enough anomalies. Once you really run out of graphics memory, perhaps flick down the AA mode a tiny bit from 8x to 4x or something. I have to state this though, the primary 3.5 GB partition on the GTX 970 with a 500MB slow secondary partition is a big miss from Nvidia, but mostly for not honestly communicating this. The problem I find to be more of a marketing miss with a lot of aftermath due to not mentioning it.

Would Nvidia have disclosed the information alongside the launch, then you guys would/could have made a more informed decision. For most of you the primary 3.5 GB graphics memory will be more than plenty in 1920x1080 (Full HD) up-to 2560x1440 (WHQD).

Update #3 - The issue that is behind the issue

New info surfaces, Nvidia messed up quite a bit when they send out specs towards press and media like ourselves. As we now know, the GeForce GTX 970 has 56 ROPs, not 64 as listed in their reviewers guides. Having fewer ROPs is not a massive thing here but it exposes a thing or two about effects in the memory subsystem and L2 cache. Combined with some new features in the Maxwell architecture herein we can find the answers of the cards being split up in 3.5GB/0.5GB partions as noted above.

Look above, (and I am truly sorry to make this so complicated, as it really is just that .. complicated). You'll notice that for GTX 970 compared to 980 there are three disabled SMs giving the GTX 970 13 active SM (clusters with things like shader processors). The SMs shown at the top are followed by 256KB L2 caches and then pairs with 32-bit memory controllers located at the bottom. The crossbar is responsible for communication inbetween the SM's, cache en and memory controllers.

You will notice that greyed-out right-hand L2 for this GPU right ? That is a disabled L2 block and each L2 block is tied to ROPs, GTX 970 does not have 2,048KB but instead has 1,792KB of L2 cache. Disabling ROPs and thus L2 like that is actually new and Maxwell exclusive, on Kepler disabling a L2/ROP segment would disable the entire section including a memory controller. So while the L2/ROP unit is disabled, that 8th memory controller to the right still is active and in use.

Now that we know that Maxwell can disable smaller segments and keep the rest activated, we just learned that we can still use the 64-bit memory controllers and associated DRAM, but the final 1/8th L2 cache is missing/disabled. As you can see the DRAM controller actually need to buddy up into the 7th L2 unit, that it the root cause of a big performance issue. The GeForce GTX 970 has a 256-bit bus over a 4GB framebuffer, the memory controllers are all active and in use, but disabling that L2 segment tied to the 8th memory controller will result in the fact that overall L2 performance would operate at half of its normal performance.

Nvidia needed to tackle that problem and did so by splitting the total 4GB memory into a primary (196 GB/sec) 3.5GB partition that makes use of the first seven memory controllers and associated DRAM, then there is a (28 GB/sec) 0.5GB tied to the last 8th memory controller. Nvidia could have and probably should have marketed the card as 3.5GB, or they probably could even have deactivated an entire right side quad and go for a 192-bit memory interface tied to just 3GB of memory but did not pursue that as alternative as this solution offers better performance. Nvidia's claims that games hardly suffer from this design / workaround.

In a rough simplified explanation the disabled L2 unit causes a challenge, an offset performance hit tied to one of the memory controllers. To divert that performance hit the memory is split up into two segments, bypassing the issue at hand, a tweak to get the most out of a lesser situation. Both memory partions are active and in use, the primary 3.5 GB partion is very fast, the 512MB secondary partion is much slower.

Thing is, the quantifying fact is that nobody really has massive issues, dozens and dozens of media have tested the card with in-depth reviews like the ones here on my site. Replicating the stutters and stuff you see in some of the video's, well to date I have not been able to reproduce them unless you do crazy stuff, and I've been on this all weekend. Overall scores are good, and sure if you run out of memory at one point you will see perf drops. But then drop from 8 to like 4x AA right ?

Nvidia messed up badly here .. no doubt about it. The ROP/L2 cache count was goofed up and slipped through the mazes and ended up in their reviewers guides and spec sheets, and really ... they should have called this a 3.5 GB card with an extra layer of L3 cache memory or something. Right now Nvidia is in full damage control, however I will stick to my recommendations, the GeForce GTX 970 is still a card we like very much in the up-to 2560x1440 (WHQD) domain, but it probably should have been called a 3.5 GB product with an added 512MB L3 cache.

To answer my own title question, does Nvidia have a memory allocation bug ? Nope, this all was done per design, Nvidia however failed to communicate this completely with the tech-media and thus in the end, the people that buy the product.

From the Nai's Benchmark, assuming if the allocation is caused by disabled of SMM units, and different bandwidth for each different gpus once Nai's Benchmark memory allocation reaches 2816MiBytes to 3500MiBytes range, I can only assume this is caused by the way SMM units being disabled.

Allow me to elaborate my assumption. As we know, there are four raster engines for GTX 970 and GTX 980.
Each raster engine has four SMM units. GTX 980 has full SMM units for each raster engine, so there are 16 SMM units.

GTX970 is made by disabling 3 of SMM units. What nvidia refused to told us is which one of the raster engine has its SMM unit being disabled.
I found most reviewers simply modified the high level architecture overview of GTX 980 diagram by removing one SMM unit for each three raster engine with one raster engine has four SMM unit intact.

First scenario
What if the first (or the second, third, fourth) raster engine has its 3 SMM units disabled instead of evenly spread across four raster engine?

Second scenario
Or, first raster engine has two SMM units disabled and second raster engine has one SMM unit disabled?

Oh, please do notice the memory controller diagram for each of the raster engine too. >.< If we follow the first scenario, definitely, the raster engine will not be able to make fully use of the memory controller bandwidth

#4997861 Posted on: 01/23/2015 12:23 PM
WoW, that is a hell of a bandwidth drop.

So, nvidia was selling 3GB/208bit cards as 4GB/256bit? Oh, my...

JohnLai
Senior Member

Posts: 136
Joined: 2006-04-25

#4997863 Posted on: 01/23/2015 12:30 PM
Fresh user.....T_T....okay....
Pill monster, could you take a look on Nai source code? See if there is any issue with the code? I admit I am not coding literate.

demise
Junior Member

Posts: 17
Joined: 2014-11-29

#4997880 Posted on: 01/23/2015 01:11 PM
Testing methods are all over the place, so not really anything conclusive there. I personally won't bother testing until Witcher 3 comes out. Most of these other games are questionable console ports from Ubisoft or Shadow of Mordor which I can't be bothered to re-download.

Interested to see how this concludes myself. Not too worried about it for the moment though. The 970 is still a massive improvement over the 560Ti I was using previously.

Fox2232
Senior Member

Posts: 8504
Joined: 2012-07-20

#4997886 Posted on: 01/23/2015 01:29 PM
WoW, that is a hell of a bandwidth drop.

So, nvidia was selling 3GB/208bit cards as 4GB/256bit? Oh, my...

No, they are selling 4GB/256bit card as there are 4GB physically and each of 8 chips have 32bit bus.
But From their assumption it is quite possible that while some parts of memory can be accessed directly, others are accessed via shared switching infrastructure as crucial parts of GPU are cut.

Fresh user.....T_T....okay....
Pill monster, could you take a look on Nai source code? See if there is any issue with the code? I admit I am not coding literate.

Get the link, while I don't do CUDA I can check for obvious deviations.

JohnLai
Senior Member

Posts: 136
Joined: 2006-04-25

#4997889 Posted on: 01/23/2015 01:41 PM
No, they are selling 4GB/256bit card as there are 4GB physically and each of 8 chips have 32bit bus.
But From their assumption it is quite possible that while some parts of memory can be accessed directly, others are accessed via shared switching infrastructure as crucial parts of GPU are cut.

Get the link, while I don't do CUDA I can check for obvious deviations.

Please do check, I appreciate it ^.^
This is Nai's source code.
The only problem, it is preferable to use IGPU and set GTX970 in headless display mode when running the benchmark. Otherwise, the result might be inaccurate due to web browsers and windows compositing reserving / using the VRAM.

#4997932 Posted on: 01/23/2015 03:16 PM
Same card as above and can confirm as well.

Certainly odd.

Edit: Noticed someone had posted while I was testing. Started wondering and checked Afterburner, and what do you know: 4038MB of VRAM allocated during the test. Still, it's odd that the 980 didn't show it in the test.

skacikpl
Senior Member

Posts: 128
Joined: 2014-07-08

#4997937 Posted on: 01/23/2015 03:22 PM

Are you running the bench with nvidia gpu in headless display mode?

No, not really but i could do that or retry the test with as much VRAM left free as i can.

Generally i believe that there's something wrong here.

JohnLai
Senior Member

Posts: 136
Joined: 2006-04-25

#4997938 Posted on: 01/23/2015 03:23 PM
=.= Sigh, when you guys run the benchmark, please mention if you are running it with NVIDIA GPU being put in headless display mode!
Otherwise, Windows compositing + web browser will reserve some portion of VRAM and skew the result.

skacikpl
Senior Member

Posts: 128
Joined: 2014-07-08

#4997942 Posted on: 01/23/2015 03:29 PM
=.= Sigh, when you guys run the benchmark, please mention if you are running it with NVIDIA GPU being put in headless display mode!
Otherwise, Windows compositing + web browser will reserve some portion of VRAM and skew the result.
In normal scenario (games/rendering) nobody is going to run on IGPU, the drop in bandwidth is dramatic and i guess that something IS wrong here.

Technically somebody could try running it clean just to prove it once and for all.
Though, in normal usage, even if the drop in performance is expected - come on, DRAM drops from ~150 to ~16/~20, cache bandwith goes from ~422 to ~16/~25/~77. That much of a performance drop is suspicious.

Also, i'm not an expert in VRAM allocation but i doubt that windows and normal desktop programs would anyhow impact last gigabyte of VRAM while leaving other three without any issues.

Im2bad
Senior Member

Posts: 791
Joined: 2006-02-19

#4997943 Posted on: 01/23/2015 03:32 PM
=.= Sigh, when you guys run the benchmark, please mention if you are running it with NVIDIA GPU being put in headless display mode!
Otherwise, Windows compositing + web browser will reserve some portion of VRAM and skew the result.

If I understand that mode correctly, what you're asking for is impossible with single GPU configurations.

Please do check, I appreciate it ^.^
This is Nai's source code.
The only problem, it is preferable to use IGPU and set GTX970 in headless display mode when running the benchmark. Otherwise, the result might be inaccurate due to web browsers and windows compositing reserving / using the VRAM.

Code looks solid, no logical/math errors.
1. Code allocates 128MB chunk by chunk till card runs out of memory (last sub 128MB block is not allocated). Therefore if you already allocate some when you run this code it allocates remaining memory and test should not be affected.
2. There were some experimental rewrites. I picked 3 definitions. 2 of them are fixed ones while 3rd is based on previous two. And I am really not sure why there is not fixed value for 3rd as it is based on 2 previous and they are not altered by code.
3. Only question I have is

and its cache bench counterpart. As I am not sure what kind of overhead "blockDim.x *blockIdx.x + threadIdx.x " has. As those are CUDA related allocations. And where those are held.

I would rather make something what allocates entire block, fills it with random incompressible data. and then bench some simple math operation over each chunk. Like negation since it is repeatable and always have same result.

Would be much slower, would not give bandwidth, but would show if each chunk gets processed in same amount of time.

Kashinoda
Member

Posts: 25
Joined: 2014-02-03

#4997945 Posted on: 01/23/2015 03:35 PM

If I understand that mode correctly, what you're asking for is impossible with single GPU configurations.

Wouldn't most people have IGP on their motherboard? Except for maybe older i7s. Easily tested.