"It's no accident that capitalism has brought with it progress, not merely in production but also in knowledge. Egoism and competition are, alas, stronger forces than public spirit and sense of duty."
- Albert Einstein

This demo implements order independent translucency using a Stencil Routed A-Buffer, which is explained in this exemplarily short paper. The basic idea is that you hijack MSAA for the purpose of storing multiple incoming fragments in each pixel. Each stencil sample is initially cleared to a separate value and then the geometry is rasterized with multisampling disabled. The stencil test will then route the incoming fragments one by one to each of the available samples, as long as there are unwritten samples available. This allows up to 8 layers to be stored with current hardware. Once the buffer has been filled it is sorted in back-to-front order and blended. The paper suggests using bitonic sort; however, this demo uses odd-even mergesort instead because that requires fewer compare-and-swap operations to be performed (19 instead of 24).

This demo uses D3D10, but it could have benefited from D3D10.1 in at least two ways. First of all, multisampled depth buffers can't be used for texturing in D3D10, so this demo uses a separate render target for this purpose. Also, now the depth bits of the depth-stencil buffer is entirely unused, which is wasteful. Secondly, and probably more important, is that multisampled buffers can't be CopyResource'd in D3D10. Currently a significant chunk of the frame time is consumed just initializing the stencil buffer. A better way to handle this would be to initially set up a stencil clear-buffer just once, and then clear the active stencil buffer by copying that stencil clear-buffer into it. A copy is likely a good deal faster than 8 fullscreen passes with a sample write mask, which is required now.

The advantage of using this technique compared to depth peeling is that you don't need multiple passes. If you want up to 8 layers, you need 8 passes with depth peeling, whereas this technique only needs a single pass, plus a sort pass. The disadvantage is that once the buffer is full fragments will be discarded. If you limit yourself to 8 layers with depth peeling, you'll get the 8 front-most layers, whereas with this technique the layers are destroyed based on the order they arrived, meaning that the front-most layer could be the one that got discarded if you're unlucky, which of course is a lot more disturbing than dropping layers in the back, which may not be noticable at all.

This demo should run on Radeon HD 2000 series and up and GeForce 8000 series and up.Deep deferred shadingMonday, November 26, 2007 | Permalink

One of the main drawbacks of deferred shading is that it doesn't handle blended objects very well, and typical rendering scenes include at least some translucent objects. The usual solution to this problem is to render translucent objects using traditional forward rendering on top of the deferred rendered scene. This is obviously far from ideal since it forces you to write and maintain a forward rendering path in addition to the deferred shading path, and it also has negative performance implications.

This demo shows a way to solve this problem such that translucent objects can be used with deferred rendering. This is done by using a deep buffer approach. Instead of having a single buffer containing the attributes such as diffuse and normal this technique stores several layers of these attributes in a texture array. Similarly, the depth buffer is also a texture array. In this demo three layers are used, which allows for up to two translucent layers in front of an opaque surface. More layers may be needed in more complex scenes; however, it would still typically suffice to maintain only a few layers. The different layers are extracted using depth peeling.

In the deferred lighting passes the lighting contribution from each layer is computed. However, the cost will not multiply with the number of layers, but instead be only somewhat more expensive than traditional deferred shading since we can stop once we hit an opaque surface, and typical scenes will have a lot more opaque surfaces than translucent, so the majority of the pixels will only have to evaluate the first layer.

The drawback of this technique is naturally in the memory consumption. Deferred shading already consumes a large amount of memory for the render targets, and this technique multiplies that with that number of layers. This may not be much of an issue on PC, but on consoles it could be more problematic. The cost of the light rendering passes also increases, although not too bad. However, the cost of filling the buffers in the beginning increases by a larger number.

This demo should run on Radeon HD 2000 series and up and the GeForce 8000 series. Since it's Direct3D10, Vista is also required.Deferred shadingTuesday, October 23, 2007 | Permalink

A technique gaining increasing attention these days is deferred shading. The main idea behind deferred shading is that you initially fill a set of buffers with common data, such as diffuse texture, normals and various material properties. Then for the lighting you just render the light extents and fetch data from these buffers for the lighting computation. The main advantage of this technique is that it decouples lighting from the geometry. In regular forward rendering you have to resubmit all the geometry that's within the light radius to add another light. This might include changing a lot of states and shaders and issuing numerous draw calls. With deferred shading only a single call is neccesary, in fact, you can apply several lights with a single draw call. This makes it scale much better than forward rendering when the number of lights increases. On the downside, it typically consumes more memory, bandwidth and shader instructions than forward rendering.

This demo takes deferred shading to the extreme. A particle system in generated entirely on the GPU and spews loads of particles in all directions, and every particle is a light source. Hence in this demo there are 1024 light sources active at once. Yet, the performance is in the order of hundreds of frames per second.

The geometry shader is used in this demo to compute the light bounding boxes and each visible light is drawn as a rectangle in clip-space. Note that the extents in z direction is computed as well. This allows the pixel shader to skip computations where the stored depth value is not in range. This eliminates a lot of unlit pixels.

This demo should run on Radeon HD 2000 series and GeForce 8000 series on Vista.InfernoSunday, August 5, 2007 | Permalink

This demo uses a couple of interesting features of Direct3D10. It doesn't use any vertex or index buffers at all (except for what the framework uses for GUI), instead everything is generated in the shader from the SV_VertexID and SV_InstanceID system generated values. The skybox has only tree vertices (fullscreen triangle), so by generating that in the shader we avoid the API overhead of binding buffers (which is not the bottleneck of a skybox pass of course, but that's besides the point). The terrain renders instanced triangle strips which read from height from a heightmap. The heightmap is in BC4 format, or ATI1N as it was called in D3D9. This gives us a very compact geometric representation for the terrain. There are 1024 particle systems, all rendered in a single draw call by using instancing. The particle systems are stateless and are generated entirely in the vertex shader from the input vertex and instance IDs. The geometry shader is used to expand the incoming points from the vertex shader into quads in screen-space. This is similar to how point sprites used to work, except it's more flexible and this demo uses rotation on the particles, something that point sprite can't do.

This demo should run on Radeon HD 2000 series and GeForce 8000 series. Since this demo uses D3D10 it only works in Windows Vista.DominoSunday, February 4, 2007 | Permalink

This demo is mostly for eye-candy, but contains a couple of interesting techniques too. OpenGL currently has no particular API for doing instancing. There exist some pseudo-instancing techniques for OpenGL, such as using a vertex attribute and issuing a new draw call. This of course does not cut down the number of draw calls, but should have relatively low cost per call, making drawing lots of objects managable compared to for instance setting shader constants. Another method, which this demo uses, is based on shader constants. It uses multiple copies of the model in the vertex buffer, each with a particular index. The index is used to grab the instance data from a constant array. This allows you to draw as many instances in a draw call as you can fit instance data in the vertex shader constant storage. While this doesn't cut the number of draw calls down to a single one it at least divides the number of draw calls by a fair amount, in this demo 64 (set conservatively, could probably be increased).

The other interesting technique used here is the wood shader, which is loosely based on the wood shader that comes with RenderMonkey. The RenderMonkey wood shader suffers from a serious aliasing problem at a distance since the wood rings are mathematically generated. To solve this problem I'm deriving the rings from the noise texture too, which of course is mipmapped and thus doesn't suffer directly. I added a repeat factor on the returned noise, which adds back some aliasing, which I get rid of by adding a mipmap bias for the lookup.

This demo should run on Radeon 9500 and up and GeForce FX 5200 and up.Ambient aperture lightingSunday, December 3, 2006 | Permalink

This demo implements a variation of Ambient Aperture Lighting. I went with a simpler implementation than what's in the paper, but the concept is the same. The idea behind ambient aperture lighting is that you approximate the directions from where light can reach the surface with a disc. So you store a direction vector to the centre of the disc and the size of the disc. In the more elaborate version in the paper they compute the intersection area between the light disc and the aperture disc. In my implementation I simply take the dot product between the light vector and the aperture direction, and the difference between that aperture size as the shadow factor.

Ambient aperture lighting is a quite rough approximation, but one of the cases where it applies is terrains, which naturally also translates to bumpmaps. In this demo I use ambient aperture lighting to compute a cheap self-shadowing factor.

The advantage of this method compared to horizon mapping, which can be used for the same purpose, is that it's cheaper. Horizon mapping require a 3D texture, plus either complex math or a cube lookup table for the angles. Ambient aperture lighting requires a 2D texture only. Horizon mapping is also more aliasing prone and there's also a problem with texture coordinate discontinuity when accessing the horizon map, meaning that you either can't mipmap it (more aliasing) or you'll have to use a lookup with gradients (only supported in SM3.0 hardware) to avoid artifacts. Additionally, a benefit of ambient aperture lighting is that the size of the aperture is pretty much an ambient occlusion factor and can thus be used for a better looking ambient lighting (see the outside of the castle in this demo for example). However, in horizon mapping's defence, it is a better approximation of the occluding geometry so you can find cases where it will look more natural than ambient aperture lighting.

This demo should run on Radoen 9500 and up and GeForce FX 5200 and up.Volumetric Fogging 2Sunday, October 8, 2006 | Permalink

This demo shows a method for generating volumetric fog where the fog itself has texture, which adds a lot to the realism compared to traditional uniform fog and allows the fog to be animated. The downside is that it's a lot more computationally expensive. It's implemented by raytracing from the surface back to the camera through the fog and iteratively mixing in the fog at each sampled position. This way you can also include shadows in the computation which gives you nice light shafts through the fog. In this demo I have just stored the shadows in a static volume lightmap for fast lookup and good quality.

This demo should run on Radeon 9500 and up and GeForce FX 5200 and up.Dynamic lightmappingSunday, August 20, 2006 | Permalink

Generating lightmaps can be a quite time-consuming task, thus they are typically generated offline and shipped with the application. That's also what I've done in the past with the demos that use lightmaps. Even though my lightmap generation code probably could be optimized a fair bit, I doubt any CPU implementation could get anywhere close to the performance you can get by offloading this task to the GPU, which is what this demo does.

First a position map is generated on the CPU. The position map contains the worldspace position that each pixel in the lightmap maps to on the geometry it's used with. The position is preprocessed a bit to push it out slightly from the geometry to avoid precision problems. It's particularly important since I generate four position maps to antialias the shadow slightly, and the offset sample positions cannot cut into the geometry or you'd get artifacts.

The shadows are generated with a standard cubic shadow mapping technique, except it's done in texture space of the lightmap with the position looked up from the position map. The process of generating the shadow map is quite fast and definitely real-time if you're doing plain hard shadows. The texture filter will then smooth the edges a bit to get somewhat soft shadows. This is slower than just doing shadowmapping directly though and the quality improvement is relatively small. It does give you the option to blur the shadow in lightmap space, which is cheaper than doing it per pixel with regular shadow mapping. However, in order to really differenciate from plain shadow mapping this demo implements real soft shadows with the light sampled at 512 positions. The shadow for each light position sample is also 4x antialiased. The antialiasing was added since it adds some extra quality especially with a small light radius and adds very little to the cost (generating the shadow map is the bottleneck). Generating this soft shadow is almost real-time, but not fast enough to do every frame. However, once it's been generated it can be reused forever and give you soft shadows nearly for free.

Typical applications of this technique could be to either generate lightmaps fast on end-user machines to reduce download size, or for semi-dynamic lights in games, where the light position is expected to remain static most of the time, such as lighting up a candle.

This demo should run on Radeon 9500 and up and GeForce FX 5200 and up.

On the first run it will generate four position maps, so it may take up to maybe 10 seconds to load. Later runs will start much quicker.