We present a machine learning technique for driving 3D facial animation by audio input in real time and with low latency. Our deep neural network learns a mapping from input waveforms to the 3D vertex coordinates of a face model, and simultaneously discovers a compact, latent code that disambiguates the variations in facial expression that cannot be explained by the audio alone. During inference, the latent code can be used as an intuitive control for the emotional state of the face puppet.

We train our network with 3-5 minutes of high-quality animation data obtained using traditional, vision-based performance capture methods. Even though our primary goal is to model the speaking style of a single actor, our model yields reasonable results even when driven with audio from other speakers with different gender, accent, or language, as we demonstrate with a user study. The results are applicable to in-game dialogue, low-cost localization, virtual reality avatars, and telepresence.

We present a GPU-based ray traversal algorithm that operates on compressed wide BVHs and maintains the traversal stack in a compressed format. Our method reduces the amount of memory traffic significantly, which translates to 1.9-2.1x improvement in incoherent ray traversal performance compared to the current state of the art. Furthermore, the memory consumption of our hierarchy is 35-60% of a typical uncompressed BVH.

In addition, we present an algorithmically efficient method for converting a binary BVH into a wide BVH in a SAH-optimal fashion, and an improved method for ordering the child nodes at build time for the purposes of octant-aware fixed-order traversal.

We present a real-time deep learning framework for video-based facial performance capture -- the dense 3D tracking of an actor's face given a monocular video. Our pipeline begins with accurately capturing a subject using a high-end production facial capture pipeline based on multi-view stereo tracking and artist-enhanced animations. With 5-10 minutes of captured footage, we train a convolutional neural network to produce high-quality output, including self-occluded regions, from a monocular video sequence of that subject. Since this 3D facial performance capture is fully automated, our system can drastically reduce the amount of labor involved in the development of modern narrative-driven video games or films involving realistic digital doubles of actors and potentially hours of animated dialogue per character. We compare our results with several state-of-the-art monocular real-time facial capture techniques and demonstrate compelling animation inference in challenging areas such as eyes and lips.

In this paper, we present a simple and efficient method for training deep neural networks in a semi-supervised setting where only a small portion of training data is labeled. We introduce self-ensembling, where we form a consensus prediction of the unknown labels using the outputs of the network-in-training on different epochs, and most importantly, under different regularization and input augmentation conditions. This ensemble prediction can be expected to be a better predictor for the unknown labels than the output of the network at the most recent training epoch, and can thus be used as a target for training. Using our method, we set new records for two standard semi-supervised learning benchmarks, reducing the (non-augmented) classification error rate from 18.44% to 7.05% in SVHN with 500 labels and from 18.63% to 16.55% in CIFAR-10 with 4000 labels, and further to 5.12% and 12.16% by enabling the standard augmentations. We additionally obtain a clear improvement in CIFAR-100 classification accuracy by using random images from the Tiny Images dataset as unlabeled extra inputs during training. Finally, we demonstrate good tolerance to incorrect labels.

We introduce apex point map, a simple data structure for constructing conservative bounds for rigid objects. The
data structure is distilled from a dense k-DOP, and can be queried in constant time to determine a tight bounding
plane with any given normal vector. Both precalculation and lookup can be implemented very efficiently on current
GPUs. Applications include, e.g., finding tight world-space bounds for transformed meshes, determining per-object
shadow map extents, more accurate view frustum culling, and collision detection.

We present a method for extreme occluder simplification. We take a triangle soup as input, and produce a small set of polygons with closely matching occlusion properties. In contrast to methods that optimize the original geometry, our algorithm has very few requirements for the input—specifically, the input does not need to be a watertight, two-manifold mesh. This robustness is achieved by working on a well-behaved, discretized representation of the input instead of the original, potentially badly structured geometry. We first formulate the algorithm for individual occluders, and further introduce a hierarchy for handling large, complex scenes.

We introduce a novel Metropolis rendering algorithm that directly computes image gradients, and reconstructs the final image from the gradients by solving a Poisson equation. The reconstruction is aided by a low-fidelity approximation of the image computed during gradient sampling. As an extension of path-space Metropolis light transport, our algorithm is well suited for difficult transport scenarios. We demonstrate that our method outperforms the state-of-the-art in several well-known test scenes. Additionally, we analyze the spectral properties of gradient-domain sampling, and compare it to the traditional image-domain sampling.

When programming for GPUs, simply porting a large CPU program
into an equally large GPU kernel is generally not a good approach.
Due to SIMT execution model on GPUs, divergence in control flow
carries substantial performance penalties, as does high register usage
that lessens the latency-hiding capability that is essential for the
high-latency, high-bandwidth memory system of a GPU. In this paper,
we implement a path tracer on a GPU using a wavefront formulation,
avoiding these pitfalls that can be especially prominent when
using materials that are expensive to evaluate. We compare our performance
against the traditional megakernel approach, and demonstrate
that the wavefront formulation is much better suited for real-world
use cases where multiple complex materials are present in
the scene.

The surface area heuristic (SAH) is widely used as a predictor for
ray tracing performance, and as a heuristic to guide the construction
of spatial acceleration structures. We investigate how well SAH actually
predicts ray tracing performance of a bounding volume hierarchy
(BVH), observe that this relationship is far from perfect, and
then propose two new metrics that together with SAH almost completely
explain the measured performance. Our observations shed
light on the increasingly common situation that a supposedly good
tree construction algorithm produces trees that are slower to trace
than expected. We also note that the trees constructed using greedy
top-down algorithms are consistently faster to trace than SAH indicates
and are also more SIMD-friendly than competing approaches.

We present a novel approach to voxelization, based on intersecting the input primitives against intersection targets
in the voxel grid. Instead of relying on geometric proximity measures, our approach is topological in nature,
i.e., it builds on the connectivity and separability properties of the input and the intersection targets. We discuss
voxelization of curves and surfaces in both 2D and 3D, and derive intersection targets that produce voxelizations
with various connectivity, separability and thinness properties. The simplicity of our method allows for easy
proofs of these properties. Our approach is directly applicable to curved primitives, and it is independent of input
tessellation.

Stochastic techniques for rendering indirect illumination suffer
from noise due to the variance in the integrand. In this paper, we describe
a general reconstruction technique that exploits anisotropy in
the light field and permits efficient reuse of input samples between
pixels or world-space locations, multiplying the effective sampling
rate by a large factor. Our technique introduces visibility-aware
anisotropic reconstruction to indirect illumination, ambient occlusion
and glossy reflections. It operates on point samples without
knowledge of the scene, and can thus be seen as an advanced image
filter. Our results show dramatic improvement in image quality
while using very sparse input samplings.

This technical report is an addendum to the HPG2009 paper "Understanding the Efficiency of Ray Traversal on
GPUs", and provides citable performance results for Kepler and Fermi architectures. We explain how to optimize
the traversal and intersection kernels for these newer platforms, and what the important architectural limiters
are. We plot the relative ray tracing performance between architecture generations against the available memory
bandwidth and peak FLOPS, and demonstrate that ray tracing is still, even with incoherent rays and more complex
scenes, almost entirely limited by the available FLOPS. We will also discuss two esoteric instructions, present in
both Fermi and Kepler, and show that they can be safely used for faster acceleration structure traversal.

We present a novel method for increasing the efficiency of stochastic rasterization of motion and defocus blur. Contrary to earlier approaches, our method is efficient even with the low sampling densities commonly encountered in realtime rendering, while allowing the use of arbitrary sampling patterns for maximal image quality. Our clipless dual-space formulation avoids problems with triangles that cross the camera plane during the shutter interval. The method is also simple to plug into existing rendering systems.

Traditionally, effects that require evaluating multidimensional integrals for each pixel, such as motion blur, depth of field, and soft shadows, suffer from noise due to the variance of the highdimensional integrand. In this paper, we describe a general reconstruction technique that exploits the anisotropy in the temporal light field and permits efficient reuse of samples between pixels, multiplying the effective sampling rate by a large factor. We show that our technique can be applied in situations that are challenging or impossible for previous anisotropic reconstruction methods, and that it can yield good results with very sparse inputs. We demonstrate our method for simultaneous motion blur, depth of field, and soft shadows.

Our previous paper on stochastic rasterization [Laine et al. 2011] presented a method for constructing time and lens bounds to accelerate stochastic rasterization by skipping the costly 5D coverage test. Although the method works for the combined case of simultaneous motion and defocus blur, its efficiency drops when significant amounts of both effects are present. In this paper, we describe a bound computation method that treats time and lens domains in a unified fashion, and yields tight bounds also for the combined case.

In this paper, we implement an efficient, completely software-based graphics pipeline on a GPU. Unlike previous approaches, we obey ordering constraints imposed by current graphics APIs, guarantee hole-free rasterization, and support multisample antialiasing. Our goal is to examine the performance implications of not exploiting the fixed-function graphics pipeline, and to discern which additional hardware support would benefit software-based graphics the most.

We present significant improvements over previous work in terms of scalability, performance, and capabilities. Our pipeline is malleable and easy to extend, and we demonstrate that in a wide variety of test cases its performance is within a factor of 2–8x compared to the hardware graphics pipeline on a top of the line GPU.

The traditional method of rendering semi-transparent surfaces using alpha blending requires sorting the surfaces in depth order. There are several techniques for order-independent transparency, but most require either unbounded storage or can be fragile due to forced compaction of information during rendering. Stochastic transparency works in a fixed amount of storage and produces results with the correct expected value. However, carelessly chosen sampling strategies easily result in high variance of the final pixel colors, showing as noise in the image. In this paper, we describe a series of improvements to stochastic transparency that enable stratified sampling in both spatial and alpha domains. As a result, the amount of noise in the image is significantly reduced, while the result remains unbiased.

In this paper we examine the possibilities of using voxel representations as a generic way for expressing complex and feature-rich geometry on current and future GPUs. We present in detail a compact data structure for storing voxels and an efficient algorithm for performing ray casts using this structure.

We augment the voxel data with novel contour information that increases geometric resolution, allows more compact encoding of smooth surfaces, and accelerates ray casts. We also employ a novel normal compression format for storing high-precision object-space normals. Finally, we present a variable-radius post-process filtering technique for smoothing out blockiness caused by discrete sampling of shading attributes.

Based on benchmark results, we show that our voxel representation is competitive with triangle-based representations in terms of ray casting performance, while allowing tremendously greater geometric detail and unique shading information for every voxel.

Stochastic renderers produce unbiased but noisy images of scenes that include the advanced camera effects of motion and defocus blur and possibly other effects such as transparency. We present a simple algorithm that selectively adds bias in the form of image space blur to pixels that are unlikely to have high frequency content in the final image. For each pixel, we sweep once through a fixed neighborhood of samples in front to back order, using a simple accumulation scheme. We achieve good quality images with only 16 samples per pixel, making the algorithm potentially practical for interactive stochastic rendering in the near future.

A ray cast algorithm utilizing a hierarchical acceleration structure needs to perform a tree traversal in the hierarchy. In its basic form, executing the traversal requires a stack that holds the nodes that are still to be processed. In some cases, such a stack can be prohibitively expensive to maintain or access, due to storage or memory bandwidth limitations. The stack can, however, be eliminated or replaced with a fixed-size buffer using so-called stackless or short stack algorithms. These require that the traversal can be restarted from root so that the already processed part of the tree is not entered again. For kd-tree ray casts, this is accomplished easily by ray shortening, but the approach does not extend to other kinds of hierarchies such as BVHs.

In this paper, we introduce restart trail, a simple algorithmic method that makes restarts possible regardless of the type of hierarchy by storing one bit of data per level. This enables stackless and short stack traversal for BVH ray casts, where using a full stack or constraining the traversal order have so far been the only options.

Ambient occlusion has proven to be a useful tool for producing realistic images, both in offline rendering and interactive applications. In production rendering, ambient occlusion is typically computed by casting a large number of short shadow rays from each visible point, yielding unparalleled quality but long rendering times. Interactive applications typically use screen-space approximations which are fast but suffer from systematic errors due to missing information behind the nearest depth layer.

In this paper, we present two efficient methods for calculating ambient occlusion so that the results match those produced by a ray tracer. The first method is targeted for rasterization-based engines, and it leverages the GPU graphics pipeline for finding occlusion relations between scene triangles and the visible points. The second method is a drop-in replacement for ambient occlusion computation in offline renderers, allowing the querying of ambient occlusion for any point in the scene. Both methods are based on the principle of simultaneously computing the result of all shadow rays for a single receiver point.

This technical report extends our previous paper on sparse voxel octrees. We first discuss the benefits and drawbacks of voxel representations and how the storage space requirements behave for different kinds of content. Then, we explain in detail our compact data structure for storing voxels and an efficient ray cast algorithm that utilizes this structure, including the contributions of the original paper: additional voxel contour information, normal compression format for storing high-precision object-space normals, post-process filtering technique for smoothing out blockiness of shading, and beam optimization for accelerating ray casts.

Management of voxel data in memory and on disk is covered in more detail, as well as the construction of voxel hierarchy. We extend the results section considerably, providing detailed statistics of our test cases. Finally, we discuss the technological barriers and problems that would need to be overcome before voxels could be widely adopted as a generic content format.

In this paper we examine the possibilities of using voxel representations as a generic way for expressing complex and feature-rich geometry on current and future GPUs. We present in detail a compact data structure for storing voxels and an efficient algorithm for performing ray casts using this structure.

We augment the voxel data with novel contour information that increases geometric resolution, allows more compact encoding of smooth surfaces, and accelerates ray casts. We also employ a novel normal compression format for storing high-precision object-space normals. Finally, we present a variable-radius post-process filtering technique for smoothing out blockiness caused by discrete sampling of shading attributes.

Our benchmarks show that our voxel representation is competitive with triangle-based representations in terms of ray casting performance, while allowing tremendously greater geometric detail and unique shading information for every voxel.

We discuss the mapping of elementary ray tracing operations---acceleration structure traversal and primitive intersection---onto wide SIMD/SIMT machines. Our focus is on NVIDIA GPUs, but some of the observations should be valid for other wide machines as well. While several fast GPU tracing methods have been published, very little is actually understood about their performance. Nobody knows whether the methods are anywhere near the theoretically obtainable limits, and if not, what might be causing the discrepancy. We study this question by comparing the measurements against a simulator that tells the upper bound of performance for a given kernel. We observe that previously known methods are a factor of 1.5--2.5X off from theoretical optimum, and most of the gap is not explained by memory bandwidth, but rather by previously unidentified inefficiencies in hardware work distribution. We then propose a simple solution that significantly narrows the gap between simulation and measurement. This results in the fastest GPU ray tracer to date. We provide results for primary, ambient occlusion and diffuse interreflection rays.

Determining early specular reflection paths is essential for room acoustics modeling. Beam tracing algorithms have been used to calculate these paths efficiently, thus allowing modeling of acoustics in real-time with a moving listener in simple, or complex but densely occluded, environments with a stationary sound source. In this paper it is shown that beam tracing algorithms can still be optimized by utilizing the spatial coherence in path validation with a moving listener. Since the precalculations required for the presented technique are relatively fast, the acoustic reflection paths can be calculated even for a moving source in simple cases. Simulations were performed to show how the accelerated algorithm compares with the basic algorithm with varying scene complexity and occlusion. Up to two orders of magnitude speed-up was achieved.

We present a method for rendering single-bounce indirect illumination in real time on currently available graphics hardware. The method is based on the instant radiosity algorithm, where virtual point lights (VPLs) are generated by casting rays from the primary light source. Hardware shadow maps are then employed for determining the indirect illumination from the VPLs. Our main contribution is an algorithm for reusing the VPLs and incrementally maintaining their good distribution. As a result, only a few shadow maps need to be rendered per frame as long as the motion of the primary light source is reasonably smooth. This yields real-time frame rates even when hundreds of VPLs are used.

We identify and analyze several performance problems in a state-of-the-art physically-based soft shadow volume algorithm, and present an improved method that alleviates these problems by replacing an overly conservative spatial acceleration structure by a more efficient one. The new technique consistently outperforms both the previous method and a ray tracing-based reference solution in several realistic situations while retaining the correctness of the solution and other desirable characteristics of the previous method. These include the unintrusiveness of the original algorithm, meaning that our method can be used as a black-box shadow solver in any offline renderer without requiring multiple passes over the image or other special accommodation. We achieve speedup factors from 1.6 to 12.3 when compared to the previous method.

Displaying a synthetic image on a computer display requires determining the colors of individual pixels. To avoid aliasing, multiple samples of the image can be taken per pixel, after which the color of a pixel may be computed as a weighted sum of the samples. The positions and weights of the samples play a major role in the resulting image quality, especially in real-time applications where usually only a handful of samples can be afforded per pixel. This paper presents a new error metric and an optimization method for antialiasing patterns used in image reconstruction. The metric is based on comparing the pattern against a given reference reconstruction filter in spatial domain and it takes into account psychovisually measured angle-specific acuities for sharp features.

Precomputing volumetric lighting allows realistic mutual shadowing and reflections between objects with little runtime cost: for example, using an irradiance volume the shadows and reflections due to a static scene can be precomputed into a three-dimensional grid and this grid can be used to shade moving objects at runtime. However, a rather low spatial resolution has to be used to keep the memory requirements acceptable. For this reason, these methods often suffer from aliasing artifacts.

In this article we introduce a new sampling algorithm for precomputing lighting into a regular three-dimensional grid. The advantage of the new method is that it dramatically reduces aliasing while adding only a small overhead for the precomputation time. Additionally, the runtime component does not have to be changed at all.

To improve image quality in computer graphics, antialiazing techniques such as supersampling and multisampling are used. We explore a family of inexpensive sampling schemes that cost as little as 1.25 samples per pixel and up to 2.0 samples per pixel. By placing sample points in the corners or on the edges of the pixels, sharing can occur between pixels, and this makes it possible to create inexpensive sampling schemes. Using an evaluation and optimization framework, we present optimized sampling patterns costing 1.25, 1.5, 1.75 and 2.0 samples per pixel.

We present a new, fast algorithm for rendering physically-based soft shadows in ray tracing-based renderers. Our method replaces the hundreds of shadow rays commonly used in stochastic ray tracers with a single shadow ray and a local reconstruction of the visibility function. Compared to tracing the shadow rays, our algorithm produces exactly the same image while executing one to two orders of magnitude faster in the test scenes used. Our first contribution is a two-stage method for quickly determining the silhouette edges that overlap an area light source, as seen from the point to be shaded. Secondly, we show that these partial silhouettes of occluders, along with a single shadow ray, are sufficient for reconstructing the visibility function between the point and the light source.

We present a novel algorithm for rendering physically-based soft shadows in complex scenes. Instead of casting shadow rays, we place both the points to be shaded and the samples of an area light source into separate hierarchies, and compute hierarchically the shadows caused by each occluding triangle. This yields an efficient algorithm with memory requirements independent of the complexity of the scene.

We present a novel method for rendering shadow volumes. The core idea of the method is to locally choose between Z-pass and Z-fail algorithms on a per-tile basis. The choice is made by comparing the contents of the low-resolution depth buffer against an automatically constructed split plane. We show that this reduces the number of stencil updates substantially without affecting the resulting shadows. We outline a simple and efficient hardware implementation that enables the early tile culling stages to reject considerably more pixels than with shadow volume optimizations currently available in the hardware.

Occlusion culling based on precomputed visibility information is a standard method for accelerating the rendering in real-time graphics applications. In this paper we present a new general algorithm that performs the visibility precomputation for a group of viewcells in an output-sensitive fashion. This is achieved by exploiting the directional coherence of visibility between adjacent viewcells. The algorithm is independent of the underlying from-region visibility solver and is therefore applicable to exact, conservative and aggressive visibility solvers in both 2D and 3D.

We present a novel real-time technique for computing inter-object ambient occlusion. For each occluding object, we precompute a field in the surrounding space that encodes an approximation of the occlusion caused by the object. This volumetric information is then used at run-time in a fragment program for quickly determining the shadow cast on the receiving objects. According to our results, both the computational and storage requirements are low enough for the technique to be directly applicable to computer games running on the current graphics hardware.

In this paper we abandon the regular structure of shadow maps. Instead, we transform the visible pixels P(x,y,z) from screen space to the image plane of a light source P'(x',y',z'). The (x',y') are then used as sampling points when the geometry is rasterized into the shadow map. This eliminates the resolution issues that have plagued shadow maps for decades, e.g., jagged shadow boundaries. Incorrect self-shadowing is also greatly reduced, and semi-transparent shadow casters and receivers can be supported. A hierarchical software implementation is outlined.

Theses

This research focuses on developing efficient algorithms for computing shadows in computer-generated images. A distinctive feature of the shadow algorithms presented in this thesis is that they produce correct, physically-based results, instead of giving approximations whose quality is often hard to ensure or evaluate.

Light sources that are modeled as points without any spatial extent produce hard shadows with sharp boundaries. Shadow mapping is a traditional method for rendering such shadows. A shadow map is a depth buffer computed from the scene, using a point light source as the viewpoint. The finite resolution of the shadow map requires that its contents are resampled when determining the shadows on visible surfaces. This causes various artifacts such as incorrect self-shadowing and jagged shadow boundaries. A novel method is presented that avoids the resampling step, and provides exact shadows for every point visible in the image.

The shadow volume algorithm is another commonly used algorithm for real-time rendering of hard shadows. This algorithm gives exact results and does not suffer from any resampling problems, but it tends to consume a lot of fillrate, which leads to performance problems. This thesis presents a new technique for locally choosing between two previous shadow volume algorithms with different performance characteristics. A simple criterion for making the local choices is shown to yield better performance than using either of the algorithms alone.

Light sources with nonzero spatial extent give rise to soft shadows with smooth boundaries. A novel method is presented that transposes the classical processing order for soft shadow computation in offline rendering. Instead of casting shadow rays, the algorithm first conceptually collects every ray that would need to be cast, and then processes the shadow-casting primitives one by one, hierarchically finding the rays that are blocked.

Another new soft shadow algorithm takes a different point of view into computing the shadows. Only the silhouettes of the shadow casters are used for determining the shadows, and an unintrusive execution model makes the algorithm practical for production use in offline rendering.

The proposed techniques accelerate the computing of physically-based shadows in real-time and offline rendering. These improvements make it possible to use correct, physically-based shadows in a broad range of scenes that previous methods cannot handle efficiently enough.

The rendering of soft shadows is an important task in computer graphics. Soft shadows appear when the light source is not modeled as a single point but as an object with nonzero surface area. Obtaining correct physically-based shadows requires determining the amount of light that flows from the light source to a receiving point on the surface being rendered. This is generally computationally expensive, and efficient solution methods are needed for keeping the rendering times on a tolerable level.

There is usually significant coherence in shadows among nearby receiving points, and nearby parts of a light source also tend to contribute to the image in a similar fashion. Exploiting these forms of coherence is the key element of modern soft shadow algorithms.

This thesis presents a novel physically-based soft shadow algorithm that attempts to exploit the coherence as much as possible, solving the shadow relations in large chunks instead of considering single points in the emitting or receiving end. The computation of shadow relations is performed hierarchically, and an efficient representation of shadow-casting geometry is maintained incrementally. The algorithm is a generic tool for the solving sets of visibility relations in polygonal scenes, and may have uses in areas other than shadow computation as well.

In addition to presenting the novel algorithm in detail, several existing physically-based shadow algorithms are analyzed and ranked according to their computational complexities. Experimental results are also presented for illustrating the applicability of the novel algorithm in different kinds of rendering situations.