Deep learning is enabling a revolution in how we interact with technology in our daily lives, from personalized healthcare to how we drive our cars. NVIDIA AI Co-Pilot combines deep learning and visual computing to enable augmented driving. Co-Pilot uses sensor data from a microphone inside the car, and cameras both inside and outside the car, to track the environment around the driver. Co-Pilot understands where you are looking while driving to determine objects you might not see—like what’s coming up along side you, ahead of you and in your blind spot. This lets the car understand the driver as well as its environment and provides suggestions, warnings and, where needed, interventions for safer and more enjoyable experience. A key component of AI Co-Pilot is the technology for continuous real-time monitoring of the driver’s posture and gaze, as Figure 1 shows.

Estimating facial features such as head pose and facial landmarks from images is key for many applications, including activity recognition, human-computer interaction, and facial motion capture. While most prior work has focused on facial feature estimation from a single image, videos provide temporal links among nearby image frames, which is essential for accurate and robust estimation. A key challenge for video-based facial analysis is to properly exploit temporal coherence.

In this post, we describe how we use recurrent neural networks (RNNs) for joint estimation and tracking of facial features in videos. As a generic and learning-based approach for time series prediction, RNNs avoid tracker engineering for tasks performed on videos, much like CNNs (Convolutional Neural Networks) avoid feature engineering for tasks performed on still images.

We recently incorporated RNN-based facial analysis into the NVIDIA AI Co-Pilot platform to improve the overall driving experience, and published this work [1] in IEEE CVPR 2017.

Bayesian Filtering vs. RNNs for Dynamic Facial Analysis

The traditional approach for analyzing facial features in videos is Bayesian Filtering. Examples include Kalman Filters and Particle Filters. These methods, however, generally requires complicated, problem-specific design and tuning. For example, tracking can be performed at different levels for different tasks [2]. To deal with drifting, many tracking methods require a failure detection and reinitialization method as a backup [3]. For complex tasks (for example, non-rigid face, hand, or body tracking), implementing a Bayesian filter is quite challenging [4]. In short, the need for such tracker engineering makes these methods cumbersome and less generic for dynamic facial analysis from videos.

In order to avoid tracker engineering, we use RNNs to directly learn the trackers from a large amount of training data. As Figure 2 shows, we found that the computation performed by a RNN resembles Bayesian filtering—both methods estimate the hidden states from the measurements and propagate the hidden states over time. The major difference is that Bayesian filtering requires the state transition model and the measurement model to be known beforehand with a problem-specific design, while RNNs learn these two models directly from training data. Please refer to our paper for a more detailed comparison.

Figure 2. The Connection between Bayesian filters (left) and RNNs (right). Bayesian filters model the dynamics between the hidden state and the measurement as a stochastic Markov process (shown as dashed red arrows). Given a sequence of noisy measurements, the goal of Bayesian filtering is to estimate the optimal states and optionally the target output (shown as solid black arrows). Similarly, a RNN learns the prediction from a sequential input to the sequential output by passing information over time via the hidden state.

A Toy Example: Tracking a 1D Cursor

We first use an expository example to illustrate the similarity and differences between Bayesian filtering and RNNs. Figure 3 shows a classic example for Bayesian filtering, in which a cursor moves in a sinusoidal pattern in one dimension. The goal is to track the position of the cursor. The input to our system is the noisy measurement of the cursor’s position. We define the hidden state to be the cursor’s position, velocity, and acceleration. We use a linear Kalman filter (KF) to estimate the cursor position over time. Figure 3 shows that the Kalman Filter can smooth the noisy measurements over time as expected, but there is a gap to the ground truth position.

Alternatively, we trained a RNN (LSTM with 10 neurons for the hidden layer). For training, the input is the noisy measurements, the output is the true position of the cursor. We sampled the noise and created a large amount of training data. At testing, we tested the trained LSTM model at different noise levels. Figure 3 shows that RNN achieved much better results than Kalman Filter at all the noise levels.

Figure 3. Results on a toy example (tracking a moving cursor in one dimension) for Kalman filter (KF) and RNNs. The RNN (an LSTM with 10 neurons for the hidden layer) results in more accurate regression than the Kalman filter, even when the testing sequence has a different noise level from the training sequences (as the lower graph shows).

From Bayesian Filtering to RNNs

We now apply RNNs to facial analysis in videos. As Figure 4 shows, we studied four variants of the proposed method for head pose estimation and facial landmark localization in videos. Figure 4a is the baseline method, where we use a convolutional neural network (CNN) to estimate facial features for each input frame separately. In this case, the temporal connection is completely ignored. Many prior work falls in this category. The second method (Figure 4b) uses a Kalman filter (KF) or a particle filter (PF) to post-process the output of the CNN per-frame estimation. As we shall see later, Kalman filter or particle filter helps to temporally smoothen the per-frame estimation but they cannot reduce the bias in the estimation. In method three (Figure 4c), which we call post-RNN, we train the RNN separately from the CNN. The input of the RNN is the estimated facial features from the CNN, and thus the RNN has a similar role of the Kalman filter or particle filter for temporal smoothing. Finally, in method four (Figure 4d), we train the CNN and RNN jointly end-to-end from videos, where the input to the RNN is the high-dimensional feature maps estimated from the CNN. As we shall see later, this joint CNN-RNN structure is able to learn the temporal connections of facial features between consecutive frames and achieves significantly better performance than the other three methods.

Figure 4. Diagrams for the four variants of our proposed method for facial analysis in videos. The preferred method is (d), in which we train the CNN (feature extraction) and the RNN (tracking) jointly end-to-end from videos.

We used FC-RNN as our choice of RNN and VGG16 trained on the imageNet dataset as our choice of CNN. FC-RNN was originally proposed in [5] to leverage the generalizability of a pre-trained CNN. FC-RNN transforms the fully connected layer of a pre-trained CNN into recurrent layers with the intention of preserving the structure of a pre-trained CNN as much as possible. We adopt FC-RNN in our experiments for dynamic facial analysis because of its simple structural design and superior or on-par performance to other more complex variants of recurrent networks. We apply a set of regularization techniques to train the FC-RNN, such as variational dropout [6] and soft gradient clipping.

Large Synthetic Datasets are Essential

Big data is critical for the success of any deep learning related study. The requirement of big data is especially prominent for head pose and landmark estimation from videos, because (1) the existing video datasets are relatively small, and (2) the ground truth labels of head pose and facial landmarks are not accurate. For example, the BIWI head pose dataset [6] has only 20 subjects and 24 videos in total, and the 300VW facial landmark dataset [7] has only 114 videos, with semi-automatically generated labels as ground truth.

These problems motivated us to create a large-scale, synthetic dataset that has sufficient coverage of head motion and accurate ground truth labels. In this vein, we created to our knowledge the first synthetic head tracking dataset, NVIDIA SynHead, that covers 10 subjects, 70 head motion tracks, 700 videos, and in total 510,960 frames. SynHead is 32 times larger than BIWI (15,678 frames) and two times larger than 300-VW (218,595 frames). We released this synthetic dataset online for research purposes.

Figure 5 demonstrates the pipeline we used to create SynHead. The 10 (5 female and 5 male) 3D head models are high resolution 3D scans. To simulate realistic head motions, we gathered head motion tracks, 24 from the BIWI and 26 from the ETH dataset. Additionally, we recorded 20 depth video sequences performed by 13 (11 male and 2 female) subjects with the Kinect and SoftKinetic sensors. We computed the raw head pose angles for these sequences and discarded any failure cases by manual inspection. Finally, we temporally smoothed these head motion tracks with a Gaussian filter. For each head pose, we rendered the 10 head models and composed the rendered images with a randomly selected background image per motion track. We find that adding the random background is an efficient way to augment the dataset and is helpful for good performance.

Figure 5. The pipeline used to create the NVIDIA SynHead Dataset (available online).

Experimental Results

We applied our methods to two dynamic facial analysis tasks: head pose estimation and facial landmark localization from videos. We implemented the CNN and RNN networks with Theano, and trained the networks on an NVIDIA Titan X GPU.

Head Pose Estimation

For head pose estimation, we evaluated our methods on real data from the BIWI dataset which contains 24 videos in all. We followed the same experimental protocol proposed in prior work. Among all the four methods that Figure 4 shows, the end-to-end RNN approach (Figure 4d) performs the best not only in terms of the average error but also the standard deviation of the estimated error, which shows that the estimation with the end-to-end RNN method is more stable over time. Compared with other prior work, our method has achieved state-of-the-art performance (with average error about 1.5 degree on BIWI), with only RGB images as input. This shows the effectiveness of the proposed method for head pose estimation with large training data obtained by image synthesis. Figure 6 shows several examples of head pose estimation on the BIWI dataset, with RNN estimation and per-frame estimation.

Figure 6. Examples of head pose estimation on the BIWI dataset with the RNN and per-frame algorithms. RNN outperforms the per-frame estimation for various head poses.

Facial Landmark Localization

For facial landmark localization, we experimented with the 300-VW benchmark dataset. As a pre-processing step, we trained a face detector with Faster R-CNN to perform face detection on every frame. For each video, the central positions of the detected facial regions are smoothed temporally with a Gaussian filter, and the maximum size of the detected bounding boxes is used to extract a face-centered sequence. This pre-processing step stabilizes face detections over time and interpolates face regions for the few frames with missed face detection.

We employed several types of data augmentation—horizontal mirroring of the images, playing the image sequences in reverse, and small random scaling and translation of the face windows. We used the same network architecture as for head pose estimation except that the output layer has 136 neurons corresponding to the locations of the 68 facial landmarks.

In our experiments the RNN-based methods, including Post-RNN and RNN, improve the performance of per-frame estimation. In addition, compared to a recent work HyperFace [8], which is a multi-tasking network for frame-wise facial analysis and has state-of-theart performance, our RNN-based method has better performance, demonstrating the effectiveness of joint estimation and tracking. Finally, we observe that the RNN significantly reduces the failure rate compared to other methods. Apart from reducing the noise on the observed measurements, neither the Kalman filter (KF) or the particle filter (PF) are able to reduce the observed error from the ground truth.

Figure 7 shows examples of facial landmark localization on the 300-VW dataset. Compared to per-frame estimation, the RNN is more robust to occlusions, profile head poses, and facial expressions. In particular, it is more advantageous when the head is moving fast, which shows the RNN’s ability to learn motion information implicitly.

Learn More at CVPR 2017

Compared with traditional Bayesian filtering methods, our RNN-based method learns to jointly estimate frame-wise measurements and to temporally track them with a single end-to-end network. This makes the RNN-based method a generic approach that can be extended to other tasks of dynamic facial analysis in videos.

About Jinwei Gu
Jinwei Gu is a Senior Research Scientist in NVIDIA Research. He was a senior researcher in the America Media Lab in Futurewei Technologies from 2013 to 2015. From 2010 to 2013, he was an assistant professor in the Center for Imaging Science at Rochester Institute of Technology. Jinwei Gu received his PhD degree from Columbia University in May 2010, and his bachelor and master degree from Tsinghua University, China in 2002 and 2005. His research interests are computer vision, computational photography, machine learning and computer graphics. His current research focuses on 3D computer vision, Visual SLAM, and augmented reality, and multi-camera systems for immersive media.

About Xiaodong Yang
Xiaodong Yang is a Research Scientist at NVIDIA. His research interests are in the areas of computer vision, machine learning, deep learning, and multimedia. He has been working on large-scale image and video understanding, gesture and action recognition, dynamic facial analysis, video surveillance event detection, multimedia search, and assistive technology. He received his Ph.D. from City University of New York (2015) and B.S. from Huazhong University of Science and Technology (2009).

About Shalini De Mello
Shalini De Mello is a Senior Research Scientist at NVIDIA Research since March 2013. Her research interests are in computer vision and machine learning for human-computer interaction and smart interfaces. Her work includes NVIDIA’s shipping products for hand gesture recognition, face detection, video stabilization and GPU-optimized libraries for the development for computer vision applications on mobile platforms. She received doctoral and master’s degrees in Electrical and Computer Engineering from the University of Texas at Austin in 2008 and 2004, respectively.

About Jan Kautz
Jan Kautz is Senior Director of Computer Vision and Machine Learning Research and leads the Visual Computing Research team at NVIDIA, working predominantly on computer vision problems (from low-level vision through geometric vision to high-level vision), as well as machine learning problems (including deep reinforcement learning, generative models, and efficient deep learning). Before joining NVIDIA in 2013, Jan was a tenured faculty member at University College London. He holds a BSc in Computer Science from University of Erlangen-Nürnberg (1999), an MMath from the University of Waterloo (1999), received his PhD from the Max-Planck-Institut für Informatik (2003), and worked as a post-doc at the Massachusetts Institute of Technology (2003-2006).