Fast Stencil Light Volumes for Deferred Shading

Hello,

A while back I posted about some deferred shading performance problems, and how using stencil light volumes actually slowed it down instead of speeding it up.
This has probably been thought up before, but I thought I might share it here anyways in case it hasn't.
I managed to batch light stencil tests into groups of 8 (8 stencil bits) by using glStencilMask to act as an OR operation to write which lights are affecting which pixels when performing a depth test on the front faces of the light volumes.
In a second pass for the back faces of the light volumes, I then switch depth testing to GL_GREATER, and set the stencil func to render the light only if its bit was set earlier, using the and'ed mask parameter.

With this system, there is no overdraw, and the stencil test is fast. Overall, the system is faster than without the stenciling.

I managed to batch light stencil tests into groups of 8 (8 stencil bits) by using glStencilMask to act as an OR operation to write which lights are affecting which pixels when performing a depth test on the front faces of the light volumes. In a second pass for the back faces of the light volumes, I then switch depth testing to GL_GREATER, and set the stencil func to render the light only if its bit was set earlier, using the and'ed mask parameter.

With this system, there is no overdraw, and the stencil test is fast. Overall, the system is faster than without the stenciling.

That's interesting. Thanks!

Say, curious question: did you happen to bench tile-based deferred with your system? That is, batching lights by screen tile, and then for each tile: read the G-buffer, apply the lights for that tile all-in-one-go, write the lighting buffer?

One thing you can do is use instancing for batched lights. This may improve things a bit. One bad thing with this implementation of deferred shading is that you speak allot with the API, also you clear the stencil buffer again and again.

As Dark Photon mentioned tile based deferred shading is probably the way to go. I've recently implemented it and the performance advantage was INSANE. Now I can render 500 (limited by the UBO size at the moment) point lights without any significant impact. Before, the bottleneck was the lighting stage and now by far the material stage (where you create the G buffer). I am planning to write an article about the implementation I used so if someone is interested I can rush things a bit.

@ Dark Photon: I tried the tiling method a while back, but it performed worse due to how I implemented it. I did it without compute shaders or OpenCL, since I wanted to support older hardware, and because I have no experience with either. All the tiling happened on the CPU, so it came out very CPU limited.
@ Godlike: I couldn't get tiled deferred to work properly myself, so I would love to see that article! I want more lights

I had a deferred shader updated with tile based lights. For each light, I created a quad that will precisely cover the light. It is positioned correctly in z, to make use of depth culling. The position is in front of the light, not at the z of the light source. There are some tricks to be aware of when the camera is inside the light (where my implementation still have some problems). The same technique can be used for a lot of nice effects, like adding spherical fogs, local color coded markers, etc. The vertex shader is as follows. The performance increased a lot, especially for lamps further away or hidden by objects. And that is the usual case, except for a few near lamps.

The system I use is.. funky. It goes like this, I have a render target that is essentially RGBA_8/16/32UI, depending on how many lights I wish to support in a single call. For each light I render a "light volume", in the same fashion as one does stencil shadows, but rather than incrementing, I just flip the bit of that integer buffer (I get away with this because the light volume is essentially the "shadow" of a planar polyon) This way all lights get drawn to the integer buffer. Then in the final pass, each bit of that integer buffer indicates if a light is active. In my system, a part of the G-buffer is an offset into a range of a texture buffer object that "lists" the lights the mesh worries about (done via a CPU computation) and the pass to do the lighting iterates over that range so that only those lights whose "bounding-whatever" that intersect the bounding box of a mesh are added. The main thing I get out of this is that I avoid lots more look ups (for doing a whatever per light means the g-buffer needs to be read again for each light), I avoid blending to add the lights (and the icky choice of doing FP16/FP32 blending or Fixed8 blending with banding)... this system also lets me do detect when any set of lights are active on one pixel by using a mask, opening up the door for drawing weird stuff (like change in a funky way if light A and light B are hitting the same thing).

I have a render target that is essentially RGBA_8/16/32UI, depending on how many lights I wish to support in a single call. For each light I render a "light volume", ... I just flip the bit of that integer buffer ... each bit of that integer buffer indicates if a light is active.

I've seen opening posts in forums that are a page long without spaces and its a wonder anyone replies. Spread the word!

I just wanted to ask, since this thread seems to have become a general discussion. I want to implement a similar shader framework and I am wondering if it counts as "deferred" or not...

It's really more about shadows than lights. In short the static (deterministic) elements of the scene have shadow geometry precomputed. The goal is to make accurate (soft umbra/penumbra etc) go anywhere shadows that do not have jagged / saw-toothed qualities. The scene is sectioned into chunks that each have up to 4 shadow generating lights. Then the shadows are drawn to an RGBA buffer one component per shadow with colour masks and a depth buffer. A second (MRT) buffer probably generates a depth texture for later lookup as I suspect copying the depth buffer into a texture would not fly.

At this point shadows can be generated for non-deterministic elements of the scene via some real time algorithm; haven't given it much thought but it is complicated by the chunking. And shadows for blended geometry can be accumulated without writing to the depth buffer.

Then the same depth buffer is used to draw the scene as usual discarding pixels that are behind the shadows and the lights are modulated by the greyscale values in the shadow buffer. If a pixel is in front (or on top) of a shadow its depth must be compared with the saved depth texture to determine if it is shadowed or not. Blended geometry can skip the depth comparison.

Is this deferred lighting? There is no G buffer, but it's kind of flipped around. Also the shadows can be light instead (more technically the inverse of shadow) if a scene is more dark than light.

Last edited by michagl; 09-17-2012 at 11:18 AM.

God have mercy on the soul that wanted hard decimal points and pure ctor conversion in GLSL.

the standard lighting shader, then loops over the lights in the range [lightBegin, lightEnd), the light is considered active if the lightbitmask locigcal-anded with light bitfield buffer matches with the lightbitmask. The use I have for this was having a light go through a portal. The face of the portal was always planar so the light volume it case was always ok to just do flipping. You can see the demos of this pet project at: http://www.youtube.com/playlist?list=PL2322715E8A420CCD ... sighs it has been a long time since I have had the time to do that project

With Deferred Shading, you sample your materials into a screen-sized buffer and then go back and apply lighting to it.

With Deferred Lighting, you reverse it: sample your lighting (irradiance) into a screen-sized buffer and then go back and apply materials to it.

In both cases, you're sampling at the nearest opaque fragment within each pixel (or sample, if doing MSAA). Thus the complication with translucents.

Neither of these necessarily requires that shadowing be handled for any/all light sources.

...the shadows are drawn to an RGBA buffer one component per shadow with colour masks and a depth buffer.

What this does sound like is what I've seen called "Deferred Shadows", "Shadow Collector", or "Screen-space Shadow Mask". The idea is you sample your shadow term at the nearest opaque fragment within each pixel (or sample, if doing MSAA), and then just apply the shadowing term to each pixel (or sample) in your final pass when generating the composite radiance/luminance for each pixel (or sample).

...discarding pixels that are behind the shadows and the lights are modulated by the greyscale values in the shadow buffer.

I'm guessing by this you don't mean behind the shadows, but behind the nearest opaque fragment, which is where the occlusion field (shadows) is sampled. (?)