Automated understanding of facial expressions is a fundamental step towards high-level human-computer interaction. The goal of this proposal is to develop a solution to this problem by taking advantage of the color, depth and temporal information provided by an RGB-D video feed. We plan to model human facial expressions through the analysis of temporal variations in the pattern of activations of their natural constituents, three-dimensional Action Units. As starting point for algorithm development, we propose to build on our prior experience developing convolutional neural network architectures for fine-grained localization, RGB-D scene understanding and video analysis.