Abstract

We present an approach that combines automatic features learned byconvolutional neural networks (CNN) and handcrafted features computed by thebag-of-visual-words (BOVW) model in order to achieve state-of-the-art resultsin facial expression recognition. To obtain automatic features, we experimentwith multiple CNN architectures, pre-trained models and training procedures,e.g. Dense-Sparse-Dense. After fusing the two types of features, we employ alocal learning framework to predict the class label for each test image. Thelocal learning framework is based on three steps. First, a k-nearest neighborsmodel is applied for selecting the nearest training samples for an input testimage. Second, a one-versus-all Support Vector Machines (SVM) classifier istrained on the selected training samples. Finally, the SVM classifier is usedfor predicting the class label only for the test image it was trained for.Although we used local learning in combination with handcrafted features in ourprevious work, to the best of our knowledge, local learning has never beenemployed in combination with deep features. The experiments on the 2013 FacialExpression Recognition (FER) Challenge data set and the FER+ data setdemonstrate that our approach achieves state-of-the-art results. With a topaccuracy of 75.42% on the FER 2013 data set and 87.76% on the FER+ data set, wesurpass all competition by more than 2% on both data sets.