Research, Computer Graphics and GPU

Search:

Pages

NVIDIA just launched the TU102/TU104 (GeForce RTX 2080ti/2080), first GPUs based on the Turing architecture. This new architecture brings hardware ray-tracing acceleration, as well as many other new and really cool graphics features. A good high-level overview of the architecture and new graphics features can be found in the Turing Architecture whitepaper as well as this blog post. Most of these features are exposed through both Vulkan and OpenGL extensions, and I will quickly go through each of them in this post. A big thanks to the many people at NVIDIA who worked hard to provide us with these extensions !

Most features split into a Vulkan or OpenGL -specific extension (GL_*/VK_*), and a GLSL or SPIR-V shader extension (GLSL_*/SPV_*).

Ray-Tracing Acceleration

Turing brings hardware acceleration for ray-tracing through dedicated units called RT cores. The RT cores provide BVH traversal as well as ray-triangle intersection. This acceleration is exposed in Vulkan through a new ray-tracing pipeline, associated with a series of new shader stages. This programming model maps the DXR (DirectX Ray-Tracing) model, which is quickly described in this blog post, and this blog post details the Vulkan implementation.

Mesh Shading

This is a new programmable geometry pipeline which replaces the traditional VS/HS/DS/GS pipeline with basically a Compute-based programming model. This new pipeline is based on two shader stages, a Task Shader and a Mesh Shader (separated by an expansion stage), which are used to ultimately generate a compact mesh description called a Meshlet. A Meshlet is a mini indexed geometry representation which is maintained on chip and is directly fed to the rasterizer for consumption. This exposes a very flexible and very efficient model with Compute Shader features and generic cooperative thread groups (workgroups, shared memory, barrier synchronizations...). Applications are endless, and this can for instance be used to implement efficient culling or LOD schemes, or perform procedural geometry generation.

Variable Rate Shading

This is a very powerful hardware feature which allows the application to dynamically control the number of fragment shader invocations (independently of the visibility rate) and vary this shading rate across the framebuffer. The shading rate is controlled using a texture image ("Shading Rate Image", 8b/texel) where each texel specifies an independent shading rate for blocks of 16x16 pixels. The rate is actually specified indirectly
using 8b indices into a palette which is specified per-viewport and stores the actual
shading rate flags.

Not only the feature allows to vary the MSAA shading rate per-pixel (allowing 1x,4x,8x, and now even 16x SSAA, but with a maximum of 8x depth test and color storage), but it also allows to drop the shading rate below one invocation per-pixel, down to one invocation per block of 4x4 pixels (through one per 1x2, 2x1, 2x2, 2x4 and 4x2 pixels) and even zero invocation.

The GLSL extensions also exposes intrinsics allowing fragment shaders to read the effective fragment size in pixels (gl_FragmentSizeNV) as well as the number of fragment shader invocation for a fully covered pixel (gl_InvocationsPerPixelNV). This opens the road to many new algorithms and more efficient implementations of optimized shading rate techniques, like Foveated Rendering, Lens Adaptation (for VR), Content or Motion Adaptive Shading.

Exclusive Scissor Test

This adds a second per-viewport scissor test, which culls fragments *inside* (exclusive) the specified rectangle, unlike the standard scissor test which culls *outside* (inclusive). This can be used for instance to implement more efficient multi-resolution foveated-rendering techniques (in conjunction with Variable Rate Shading), where raster passes fill concentric strips of pixels by enabling both inclusive and exclusive scissor tests.

Texture Access Footprint

These extensions expose a set of GLSL (and SPIR-V) query functions which report the texture-space footprints of texture lookups, ie. some data identifying the set of all texels that may be accessed in order to return a filtered result for the corresponding texture accesses (which can use anisotropic-filtering and potentially cover large footprints). Footprints are returned and represented as an LOD value, an anchor point and a 64-bit bitfield where each bit represents coverage for a group of neighboring texel (in 2D, group granularity can range from 2x2 to 256x256 texels).

This is actually an important component for implementing multi-pass decoupled and texture-space shading pipelines, where a restricted set of actually visible pixels must be determined in order to efficiently perform shading in a subsequent pass.

Derivatives in Compute Shader

These extensions bring Compute even closer to Graphics by adding support for Quad-based derivatives in Compute Shaders, using the x and y coordinates of the local workgroup invocation ID. This allows Compute Shaders to use both built-in derivative functions like dFdx(), as well as texture lookup functions using automatic LOD computation, and the texture level of detail query function (textureQueryLod()).
Two layout qualifiers are provided allowing to specify Quad arrangements based on a linear index or 2D indices.

Shader Subgroup Operations

These shader extensions provide a series of ballot-based partitioning and scan/reduce operations which operate on "subgroups" of shader invocations. This can be used for instance to implement clustering and de-duplication operations on sets of values distributed among different shader invocations.

This feature exposes barycentric coordinates as Fragment Shader input in GLSL (and SPIR-V), and provides the ability for a Fragment Shader to directly fetch raw per-vertex values in order to perform manual barycentric interpolation.

A three-component vector built-in input gl_BaryCoordNV provides
perspective-corrected barycentric coordinates (gl_BaryCoordNoPerspNV for non- perspective-correct). Per-vertex inputs use the same brackets array syntax as for Tesselation and Geometry Shader inputs, and a pervertexNV qualifier is added to identify input blocs and variables which read raw per-vertex values from the vertices of the original primitive.

This feature potentially allows more efficient data passing to the Fragment Shader using compact or compressed data formats for instance. It could also allow interpolation from vertex values fetched directly from memory, user defined interpolations, or various reconstructions and computations using raw attributes accessed from the three vertices.

Ptex Hardware Acceleration

An corner-sampled image has texels centered on integer coordinates instead of being halfway, which allows edge sampling coordinates to filter to the exact texels on the edge of the texture. This facilitates implementing Ptex (Per-face Texture [Burley and Lacewell 2008], cf. https://developer.nvidia.com/sites/default/files/akamai/gamedev/docs/Borderless%20Ptex.pdf) texturing in real-time applications by providing proper filtering and interpolation. Ptex uses separate textures for each face of a subdivision surface or polygon mesh, and sample locations are placed at pixel corners, maintaining continuity between adjacent patches by duplicating values along shared edges.

Representative Fragment Test

This extension has been designed to allow optimizing occlusion queries techniques which rely on per-fragment recording of visible primitives. It allows the hardware to stop generating fragments and stop emitting fragment shader invocations for a given primitive as long as a single fragment has passed early depth and stencil tests. This reduced subset of fragment shader invocation can then be used to record visible primitives in a more performant way. This is only a performance optimization, and no guarantee is given on the number of discarded fragments and consequently the number of fragment shader invocations that will actually be executed.

Multi-View Rendering

Pascal Simultaneous Multi-Projection (SMP) and stereo view features allowed broadcasting the same geometric data for rasterization to multiple views (up to 2) and viewports (up to 16) with very limited possibilities of per-view (and viewport) variations (an horizontal position offset per-view and xyz swizzle per-viewport + viewport attributes). Turing generalizes and improves over this feature by allowing to specify fully independent per-view attribute values (including vertex positions) and exposes up to 4 views. No new extension is provided, but the feature is exposed transparently as an optimization to the existing standardized Vulkan VK_KHR_multiview and OpenGL GL_OVR_multiview extensions.
More info on Multi-View Rendering in this bog post: https://devblogs.nvidia.com/turing-multi-view-rendering-vrworks/

NVIDIA just launched the second-generation Maxwell architecture with the GM204 GPU, which is I believe, an incredible chip. The Maxwell 2 architecture is both highly energy efficient (~2x perf/watt of Kepler in games), and provides a lot of very exciting new graphics features (some of them are exposed in Direct3D). These features are exposed in form of new OpenGL extensions in the R344 driver that was released today, and the specification for all NVIDIA supported GL extensions can be found here. NVIDIA also released new SDK samples using these extensions.

Quick description of the new extensions

Multisampling

This feature adds a lot of flexibility to the multi-sampled rasterization. It decouples the rasterization sampling frequency (which can be set explicitly) from the actual framebuffer storage.
This enables rasterization to operate at higher sampling frequency than the one of the target render color buffers.
It supports both depth and stencil testing at this frequency, if the corresponding depth and stencil buffers are sampled accordingly (it must be a multiple of the number of samples in the color buffers).
There are still some constraints; All color buffers must have the same number of samples, and the raster sample count must match the depth and stencil buffer sample count if depth or stencil test is enabled, and it must be higher or equal to the color buffer sample count.

A new “coverage reduction stage” is introduced in the per-fragment operations (after the fragment shader in early-z mode, after the depth-test in late-z), which converts a set of covered raster/depth/stencil samples to a set of covered color samples.
There is an implementation-dependent association of raster samples to color samples. The reduced "color coverage" is computed such that the coverage bit for each color sample is 1 if any of the associated bits in the fragment's coverage is set, and 0 otherwise.
This feature can be used in conjunction with the coverage to color feature (cf. below), in order to get the FS output coverage mask automatically transformed into a color by ROP.
According to AnandTech, when rasterizing with explicit multisampling and no render-target, Maxwell allows evaluating primitive coverage at 16x MSAA.

Note that EXT_raster_multisample is equivalent to "Target-Independent Rasterization" in Direct3D 11.1, which allows using multiple raster samples with a single color sample, as long as depth and stencil tests are disabled, and it is actually a subset of NV_framebuffer_mixed_samples which is more general and exposes more flexibility.

This allows using ROP to automatically convert the post depth-/stencil-/alpha- test coverage mask into a color and write it into a color render target.
This conversion is performed before the new coverage reduction stage (cf. NV_framebuffer_mixed_samples) and can be useful in order to save coverage in the context of deferred shading.

When operating in early-depth mode (layout(early_fragment_tests) in;, see here for more information),
this extension allows the fragment shader to get the post depth-test coverage mask of the current fragment as input (gl_SampleMaskIn[], for which only sample passing the depth-test are set),
unlike the standard GL 4.5 behavior which provides the pre- depth-test coverage (actual triangle coverage).

With standard OpenGL, the Fragment Shader output coverage mask (gl_SampleMask[]) is ANDed with the actual primitive input coverage mask before being used in subsequent pipeline stages.
This extension disables this AND operation, which allows the fragment shader to entirely override the primitive coverage, and enables setting coverage bits that are not present in the input mask.
This is actually very nice, because it allows using the output coverage as a way to dynamically route color output values into arbitrary sample locations inside a multisampled render target.

Allows applications to explicitly set the location of sub-pixel samples for multisample rasterization, providing fully programmable sampling patterns. Sampling patterns can be defined within a grid of adjacent pixels, which depends on the number of samples. According to provided queries, the sub-pixel positions are snapped to a 16x16 sub-pixel grid.

Rasterization

This is a really great feature. It allows Rasterization to generate fragments for any pixel touched by a triangle, even if no sample location is covered on the pixel.
A new control is also provided to modify the window coordinate snapping precision in order to allow the application to match conservative rasterization triangle snapping with the snapping that would have occurred at higher resolution.
Polygons with zero area generate no fragments. Any location within a pixel may be used for interpolating attributes, potentially causing attribute extrapolation if outside the triangle.
This can be useful for binning purpose for instance (using one pixel per-tile).

This extension exposes an hardware-accelerate critical section for the fragment shader, allowing hazard-free read-modify-write operations on a per-pixel basis.
It also allows enforcing primitive-ordering for threads entering the critical section.
It provides new GLSL calls beginInvocationInterlockNV() and endInvocationInterlockNV() defining a critical section which is guaranteed to be executed only for one fragment at a time.
Interlock can be done on a per-pixel or a per-sample basis if multi-sampled rasterization is used.
This feature is useful for algorithms that need to access per-pixel data structures via shader load and store operations, while avoiding race conditions. Obvious applications are OIT and programmable blending for instance.

This allows rasterizing the axis-aligned screen-space bounding box of submitted triangles, disregarding the actual triangle edges.
It can be useful for drawing a full-screen quad without an generating an internal edge for instance, or for more efficiently drawing user interfaces.

Geometry processing

This extension allows making geometry shaders more efficient in the case where they are pass-through, ie. there is a one-to-one mapping between input and output primitives.
In this case, per-vertex attributes are simply copied from the input primitive into the output primitive, and the geometry shader is only used to set per-primitive attributes (like gl_Layer, gl_ViewportMask[] ... ), which can be computed from the input vertex attributes.

Viewport multicast allows automatically broadcasting the same primitive to multiple viewports (and/or multiple layers when using layered render-targets) simultaneously, in order to be rasterized multiple times.
It is exposed through a new gl_ViewportMask[] GLSL output attribute which is available in both the vertex shader and the geometry shader.
This can be especially powerful when combined to the new passthrough geometry shader. A sample using it for speeding-up cascaded shadow maps is available here.

Texturing

This extension improves on ARB_sparse_texture, which separates the allocation of virtual address space from the physical memory of textures, and provides the ability to sparsely allocate the physical backing-store of 2D/3D/2DArray textures on a per-tile basis.
This new extension adds the ability to retrieve texture access residency information from GLSL, to specify minimum allocated LOD to texture fetches and to return a constant zero value for lookups into unallocated pages. It also adds support for multi-sampled textures.

This exposes a new sampler parameter which allows performing a min or max reduction operation on the values sampled inside a texture filtering footprint, instead of the regular linear interpolation.
It is supported for all kind of textures, as well as anisotropic filtering.

I gave an interview to 3DVF, a great French community website dedicated to image synthesis, and numerical creation in general. The interview is mainly centered around our voxel-based GI approach, as well as my former INRIA research team and computer graphics research in France.

I did not take time to report on this before, but I defended my Ph.D thesis on GigaVoxels at INRIA this summer (my thesis will be soon available there), and I started a one year postdoc at NVIDIA Research in the SF bay area. There, I continue working on voxel representations, of course :-D

Even if I am now an NVIDIA employee, this continue to be a personal blog, all the things I publish here are totally mine and my opinions do not necessarily reflect the ones from my employer !

Just a quick post to recommend you the great book "Real-time shadows" that details, explore and greatly explain a large number of relevant techniques for real-time shadow rendering, which we all know must be one of the most prolific field in computer graphics !

In addition, this book is written by 4 leading experts of the field: Elmar Eisemann, Michael Schwarz, Ulf Assarsson and Michael Wimmer.

I am happy to present you a new approach to compute indirect illumination and ambient occlusion in real-time using a pre-filtered geometry representation stored inside a sparse voxel octree, as well as a new approximate voxel-based cone-tracing. The sparse voxel octree is used as a proxy for occlusion information and indirect illumination, it supports dynamic objects and environments thanks to a new very fast voxelization algorithm that updates the octree structure and pre-filter geometry and lighting information.

This work has been done in collaboration with Miguel Sainz and Simon Green from NVIDIA, as well as Fabrice Neyret from CNRS and Elmar Eisemann from ParisTech.

[Update 30/06/2011]: Benoit Rogez wrote a very good article about our approach on 3DVF (in French) : here[Update 10/06/2012]: More details on the algorithm and the fast voxelization and octree construction can be found in my GTC 2012 presentation.

Here it is, CUDA 4.0 RC just got released to NVIDIA Registered developers.

Interesting stuff from the CUDA manual:

Layered Textures Support (GL_TEXTURE_1D/2D_ARRAY) : New tex.a1d/.a2d modifiers in PTX. But unfortunately the surface instruction do not support them yet, Grrrr
Layered textures are created using cudaMalloc3DArray() with the cudaArrayLayered flag. New cudaTextureType2DLayered/ cudaTextureType2DLayered texture sampler types and tex1DLayered()/tex2DLayered() access intrinsics.

New .address_size PTX specifier : Allows to specify the address size (32b/64b) used throughout a PTX module.

Inline PTX assembly: This feature was already present since CUDA 2.x but was not officially supported. It's now fully supported and documented :-D

Today NVIDIA announced CUDA 4.0 at the GDC. It will be available to registered developers on March 4th :-D

Among interesting novelties, there is the support for layered textures (GL_TEXTURE_2D_ARRAY) that I hope will also be supported for surface access !
There is also the support for direct peer-to-peer communication between GPUs and mappping multiple GPUs (and potentially other third party devices like network/infiniband) memory into the same address space in order to provide direct memory access (Unified Virtual Addressing, UVA). Virtual functions should also now be supported, along with the New and Delete functions for dynamic memory allocations from kernels.

Update: CUDA 4.0 RC released to registered developers
Slides are available there: http://bit.ly/cuda4features
Among the interesting novelties I did not see before, it seems inline PTX will be officially supported with this release ! Also the dissasembler (cuobjdump) that were previously limited to Tesla ISA now support Fermi ISA disassembly. Take a look as the manual for the list of supported instructions.