We present a deep learning-based technique to infer high-quality facial
reflectance and geometry given a single unconstrained image of the subject,
which may contain partial occlusions and arbitrary illumination conditions.
The reconstructed high-resolution textures, which are generated in
only a few seconds, include high-resolution skin surface reflectance maps,
representing both the diffuse and specular albedo, and medium- and highfrequency
displacement maps, thereby allowing us to render compelling
digital avatars under novel lighting conditions. To extract this data, we train
our deep neural networks with a high-quality skin reflectance and geometry
database created with a state-of-the-art multi-view photometric stereo system
using polarized gradient illumination. Given the raw facial texture map
extracted from the input image, our neural networks synthesize complete reflectance
and displacement maps, as well as complete missing regions
caused by occlusions. The completed textures exhibit consistent quality
throughout the face due to our network architecture, which propagates
texture features from the visible region, resulting in high-fidelity details that
are consistent with those seen in visible regions. We describe how this highly
underconstrained problem is made tractable by dividing the full inference
into smaller tasks, which are addressed by dedicated neural networks. We
demonstrate the effectiveness of our network design with robust texture
completion from images of faces that are largely occluded. With the inferred
reflectance and geometry data, we demonstrate the rendering of high-fidelity
3D avatars from a variety of subjects captured under different lighting conditions.
In addition, we perform evaluations demonstrating that our method
can infer plausible facial reflectance and geometric details comparable to
those obtained from high-end capture devices, and outperform alternative
approaches that require only a single unconstrained input image.

Overview

Our system pipeline is illustrated in Fig. 2. Given a single input image
captured in unconstrained conditions, we begin by extracting the
base mesh of the face and the corresponding texture map obtained
by projecting the face in the input image onto this mesh. This
map is passed through 2 convolutional neural networks (CNNs)
that perform inference to obtain the corresponding reflectance and
displacement maps (Sec. 5). The first network infers the diffuse
albedo map, while the second infers the specular albedo as well as
the mid- and high-frequency displacement maps. However, these
maps may contain large missing regions due to occlusions in the
input image. In the next stage, we perform texture completion and
refinement to fill these regions with content that is consistent with
that found in the visible regions (Sec. 6). Finally, we perform superresolution
to increase the pixel resolution of the completed texture
from 512 × 512 into 2048 × 2048. The resulting textures contain
natural and high-fidelity details that can be used with the base mesh
to render high-fidelity avatars in novel lighting environments.
To obtain high-quality results, we found that it was essential to
divide the inference and completion process into these smaller objectives
so as to make training process more tractable. Using a single
network that performs both the texture completion
and detail refinement on all of the desired output data (reflectance
and geometry maps) produces significantly worse results than our
described approach, in which the problems are decomposed into
separate stages addressed by networks trained for more specific
tasks, and in which the diffuse albedo is generated by a separate
network than the one that generates the remaining output data.

Results

All our results are rendered with brute-force path tracing in the
Solid Angle’s Arnold renderer [Solid Angle 2016] with physically
based specular reflection and subsurface scattering with high dynamic
range image-based illumination. The resulting surface and
subsurface reflectance, together with the base surface mesh and the
displacement, are used to produce the final render using a layered
skin reflectance model as in [The Digital Human League 2015] (see
supplemental material for more details on the rendering process).
Evaluation. We quantitatively measure the ability of our system
to faithfully recover the reflectance and geometry data from a set
of 100 test images for which we have the corresponding groundtruth
measurements.

About VGL
The ICT Vision & Graphics Laboratory develops new techniques for creating and displaying photorealistic computer graphics of people, objects, and environments.
We specialize in developing image-based methods for acquiring shape, reflectance, and motion from digital photography and video.
The results are computer-generated virtual models which look and behave as realistically as possible, viewable from any viewpoint and in any illumination condition.