NVIDIA Fermi GPU and Architecture Analysis - Page 9

Published on 23rd Oct 2010, written by Alex Voicu for Consumer Graphics - Last updated: 28th Oct 2010

ROPs,
Memory Interface

A fully enabled Slimer uses a VRAM data bus width of 384-pins, arranged as six independent partitions of 64 pins each. Each partition supports GDDR5, and exclusively owns two 32-bit DRAMs. Memory requests and responses get routed through the L2 as we've mentioned. In a move meant to please farm owners everywhere, NVIDIA has added ECC protection to all levels in its memory hierarchy, going for a 1-bit fix 2-bit detect solution (this only applies to Teslas). Enabling ECC costs both storage space (for the ECC codes) and bandwidth (for transmitting the codes). Alongside ECC there's also EDC, which generally comes naturally with the use of GDDR5.

There are 6 ROPs on-chip, each being paired with a specific memory partition. Peak rate per ROP is 8 pixels per clock (ROPs run at base clock), which jumps to 64 pixels in a Z-Only scenario. There's no dedicated cache for the ROPs, with the ever-helpful L2 taking up this task as well. This makes sense, since GPCs write out fragments generated in the SMs in the L2 anyhow.

Also, it means that there's quite a bump in cache capacity (older designs had quite small Colour/Depth caches). There aren't many improvements in the MSAA departments, save for extending CSAA to compute coverage with up to 32 samples, and also making it play nicely with TAA. Later on we'll show you exactly how the ROPs perform with different formats.

Now, you may be thinking: no more than 8 fragments can be rasterised per GPC per base clock, thus it'd take 4 base clocks to fill a fragment warp, thus apparent rate would be 8 fragments per GPC per clock and thus 32 across the entire chip – why so many ROPs (6 of them equate to a theoretical maximum of 48-fragments per base clock)? Two reasons, at least in our opinion: first, the memory controller-to-ROP connection is so tight that it would have been quite intrusive to remove the extra ROPs, and second, atomics.

The ROPs are atomic units for at least atomics performed on memory addresses that map to global memory. Since raster rate has no impact here, it means that atomics can benefit from the extra ROPs. Speaking of atomics, another interesting aspect is that Fermi also adds support for doing atomic ADD or XCH with FP operands (INT atomic units are cheap, FP units not so much). Finally, we believe that writes to the L2 portion that's allocated as ROP cache are serialized between GPCs, so as to prevent conflicts/contention, with each GPC writing at most 128-bytes to it in a round-robin fashion.

Furthermore, the L1 is bypassed for this with each SM contributing a 32-byte payload in the total write. This means that a fully-enabled Slimer would be stuck at 32 4-byte exported fragment data chunks per cycle anyhow, whilst the partially disabled parts get even less throughput, proportionally to the SM disablement happening (so a 480 gets only 30 exported fragments, a 470 only 28, and so on and so forth). We'll verify this as well down the pipe.

Having completed the first stage of any B3D analysis, namely the wild speculation about architectural detail, it's time to move to the second, and last stage, which is testing! We'll carry on in the order imposed by the logical graphics pipeline (sorry compute peeps!), and start on with triangle setup.