An Efficient Hybrid Shadow Rendering Algorithm

Abstract

We present a hybrid algorithm for rendering hard shadows accurately
and efficiently. Our method combines the strengths of shadow maps
and shadow volumes. We first use a
shadow map to identify the pixels in the image that lie near shadow
discontinuities. Then, we perform the shadow volume algorithm only at
these pixels to ensure accurate shadow edges. This approach
simultaneously avoids the edge aliasing artifacts of standard shadow
maps and avoids the high fillrate consumption of standard shadow
volumes. The algorithm relies on a hardware mechanism for
rapidly rejecting non-silhouette pixels during rasterization. Since
current graphics hardware does not directly provide this mechanism, we
simulate it using available features related to occlusion culling and
show that dedicated hardware support requires minimal changes to
existing technology.

Files

Images

The following images show the three test scenes that we study in our
paper. We provide three different views of each scene. In the table
below, rows correspond to different viewpoints, and columns show
different renderings of a scene from a given viewpoint. Click on a
thumbnail for a full-resolution (1024 x 768) image. All images were
rendered using a NVIDIA GeForce 6800 (NV40) graphics card.

The visualization images in the last column require further
explanation. Red and blue pixels are non-silhouette pixels. In our
hybrid algorithm, shadow maps are used to determine which pixels are
in shadow (red) and which ones are not (blue). Black and green pixels
are shadow silhouette pixels. Shadow volumes are used to determine
which ones are in shadow (black) and which ones are not (green). Note
that this is the same color scheme used in our paper.

Shadow map

Shadow volumes

Hybrid algorithm

Shadow silhouette pixels

Shadow-volumepolygons

Reconstructionvisualization

Cubes

N/A

Tree

Dragon Cage

Code

You can download some of our Cg shader code:

The following vertex and fragment programs are used to identify shadow
silhouettes. These pixels are tagged by writing 1 to the color
buffer.

The following fragment program is used to simulate the computation
mask in conjunction with the EXT_depth_bounds_test OpenGL extension.
The idea is to draw a screen-sized quad and use this shader to set the depth values of non-silhouette pixels to z =
0; the depth values of silhouette pixels are left alone. Then,
enable the depth bounds test and set the bounds to [0.001, 1]. This
tells the hardware that in all subsequent rendering, ignore all pixels
whose depth values in the framebuffer are less than 0.001: these ignored
pixels are exactly the non-silhouette pixels. Now we can draw the
shadow volumes into the stencil buffer, and the hardware will perform
rasterization and updates only at the silhouette pixels.

Performance Issues

Our hybrid algorithm relies heavily on the hardware's ability to
reject rapidly non-silhouette pixels early in the pipeline. Actual
performance gains will depend on the effectiveness of hardware
culling. In our tests on NVIDIA hardware, the GeForce 6 (NV40) can
rasterize z/stencil (color writes disabled) at 32 pixels/clock, but it
can reject pixels at 64 pixels/clock. Similarly, we observed that the
GeForce FX 5950 (NV38) can rasterize z/stencil at 8 pixels/clock, but
it can reject at 16 pixels/clock. Both measurements represent peak
performance. In either case, the relative speedup is a factor of 2.

We have found that ATI's early culling mechanisms on the Radeon 9800
Pro are somewhat more effective than NVIDIA's, meaning that the
relative performance gains for rejected pixels is higher in the ATI
scenario. We have not performed detailed measurements, however. The
purpose of this discussion is not to identify which vendors
offer superior early culling performance. Instead, we simply want to
point out that culling mechanisms can vary across different
architectures; as these mechanisms improve in the
future, so will the performance of our hybrid algorithm.

We mention in the paper that our algorithm requires the hardware to
preserve early culling behavior when using a pixel shader to set depth
values. Of the various NVIDIA hardware, only the GeForce 6 (NV40)
architecture supports this feature. Previous NVIDIA architectures,
such as the GeForce 3/4 (NV2x) and GeForce FX (NV3x) do not. On those
earlier architectures, if you use a pixel shader to discard fragments
(fragment kill) or modify depth values, early culling becomes
disabled. Consequently, our hybrid algorithm is not efficient on
those architectures because shadow volumes will be completely
rasterized instead of being limited to the shadow silhouettes.

We have also discussed this issue with ATI, and they informed us that
using a pixel shader to set depth values will preserve culling
behavior for subsequent render passes on the Radeon 9700 (R300)
or newer architectures. (Note that early culling will not apply to
the current pass.) While this arrangement is suitable for our hybrid
algorithm, unfortunately ATI hardware does not support the
EXT_depth_bounds_test extension. Therefore, currently we cannot
simulate a computation mask efficiently on ATI hardware.

Is it necessary to use the EXT_depth_bounds_test extension to simulate
a computation mask? Not always. Consider the following
image-processing scenario. We want to identify a small subset
of the pixels in the image to work on; all other pixels are to be
masked off. Then we want to run an expensive image-processing shader
on the small working set of pixels. This can be accomplished as
follows:

Clear the depth buffer so that z = 1.

Draw a screen-sized quad at z = 0 and use a pixel shader to
discard the pixels in the working set. Make sure to have depth
testing enabled and set the depth compare function to LESS. The end
result is that all pixels to be masked-off will have depth values of
0, and all remaining pixels will have depth values of 1.

Draw another screen-sized quad, this time at z = 0.5, and
apply the expensive image-processing pixel shader. Continue to use
depth testing with the depth compare function set to LESS. A big
performance win will result because the masked-off pixels will fail the
depth test.

The above example does not use the EXT_depth_bounds_test; it only
relies on standard z-buffering. Why can't our algorithm do the same?
The reason is that we need to preserve the real z-values of the
scenes' surfaces at the silhouette pixels in order to apply the shadow
volume algorithm. Similarly, we can't do early stencil-based culling,
because we also need the stencil buffer for rasterizing shadow
volumes. These difficulties underscore the need for a
user-programmable computation mask that is semantically independent of
the existing depth and stencil buffers.

Related Research

Timo Aila
and Tomas
Akenine-Möller have written a paper titled "A Hierarchical
Shadow Volume Algorithm" that will appear in Graphics Hardware 2004.
Their idea is similar to ours: apply the shadow-volume computation
only near the shadow silhouettes. The most important difference is
that they identify tiles of pixels in object space by using testing
for triangle intersections against the tiles' 3D bounding boxes,
whereas we identify shadow silhouette pixels in image space using a
shadow map. Since they do not use a discrete image representation
anywhere in their algorithm, their method is not subject to
undersampling artifacts such as missing shadows of fine geometry.
Also, their method always identifies tiles of boundary pixels with
respect to the observer's image plane, whereas our method uses a
shadow map drawn from the light's point of view. This means that, as
long as we're using only a single shadow map, our method will only
provide performance speedups for parts of the observer's image that
are covered by the shadow map. Another way of saying this is that the
shadow-map method implicitly involves some sort of directional light
source, like a spotlight, because one has to choose a field-of-view
for the shadow camera. In contrast, the hierarchical shadow volume
approach doesn't involve this shadow camera and naturally handles
omnidirectional lights. The tradeoff, however, is that their method
requires numerous changes to current hardware, whereas our method
leverages existing early culling hardware.

Brandon Lloyd and his
colleagues at UNC have written a related paper titled "CC Shadow Volumes" and
published it in the same forum as our paper: the 2004 Eurographics
Symposium on Rendering. Their method is also aimed at minimizing the
fillrate consumed by shadow volumes. The main difference between
their approach and ours is that they reduce the number and
size of the shadow-volume polygons (by culling and clamping),
whereas we send all the polygons (at full size) and rely on the early
culling hardware.

It is interesting to consider both approaches because they incur
different types of overhead. In our method, rasterization would
ideally be performed only at the shadow silhouette pixels, but in
reality there is the overhead of considering and then rejecting the
non-silhouette pixels. Another way of thinking about this is that (a)
rasterizing a quad that covers half of the viewport is faster than (b)
drawing a full-screen quad and using early culling to throw away half
of the pixels. Of course, their is also some overhead in the CC
shadow-volume method because they have to clamp the shadow-volume
polygons. The take-home message here is that the two methods are
complementary: In a highly-optimized implementation of shadow volumes,
it makes sense to spend some additional work on the CPU to perform
coarse-grained culling of shadow-volume polygons and then to rely on
the GPU for fine-grained culling. Maximizing rendering performance by
balancing overhead vs. culling benefits is a difficult (but
interesting!) problem.

Mark Harris gave a presentation
on general-purpose GPU programming at Game Developer's Conference in
March 2004. In the presentation he talks about computation masks
(though he doesn't use that term) and describes one way to implement
by exploiting the standard depth test. As mentioned above, we
couldn't implement our mask this way because we need to preserve the
depth values of silhouette pixels.

Copyright Notice

Permission to make digital or hard copies of part or all of this work
for personal or classroom use is granted without fee provided that
copies are not made or distributed for profit or direct commercial
advantage and that copies show this notice on the first page or
initial screen of a display along with the full citation. Copyrights
for components of this work owned by others than the publisher must be
honoured and acknowledged.