Analysis of Emotions Through Speech Using the Combination of Multiple Input Sources with Deep Convolutional and LSTM Networks

Abstract

Understanding emotions expressed in speech by a person is fundamental in having a better interaction between humans and machines. Many algorithms have been developed to solve this problem before. They have been tested on different datasets, some of these datasets were recorded by actors under ideal recording conditions and some others were recorded from people’s opinion on some video streaming platform. Deep learning has shown very positive results in recent years and the model presented here follows this approach. We propose the use of Fourier transformations as the input of a convolutional neural network and Mel frequency cepstral coefficients as the input of an LSTM neural network. Finally, we concatenate the outputs of both models and obtain a final classification for five emotions. The model is trained using the MOSEI dataset. We also perform data augmentation by using time variations and pitch changes. Our model shows significant improvements over state-of-the-art algorithms.