InfoSci®-Journals Annual Subscription Price for New Customers: As Low As US$ 4,950

This collection of over 175 e-journals offers unlimited access to highly-cited, forward-thinking content in full-text PDF and XML with no DRM. There are no platform or maintenance fees and a guarantee of no more than 5% increase annually.

Receive the complimentary e-books for the first, second, and third editions with the purchase of the Encyclopedia of Information Science and Technology, Fourth Edition e-book. Plus, take 20% off when purchasing directly through IGI Global's Online Bookstore.

Take 20% Off All Publications Purchased Directly Through the IGI Global Online Bookstore: www.igi-global.com/

Abstract

Speech emotion recognition is the indispensable requirement for efficient human machine interaction. Most modern automatic speech emotion recognition systems use Gaussian mixture models (GMM) and Support Vector Machines (SVM). GMM are known for their performance and scalability in the spectral modeling while SVM are known for their discriminatory power. A GMM-supervector characterizes an emotional style by the GMM parameters (mean vectors, covariance matrices, and mixture weights). GMM-supervector SVM benefits from both GMM and SVM frameworks. In this paper, the GMM-UBM mean interval (GUMI) kernel based on the Bhattacharyya distance is successfully used. CFSSubsetEval combined with Best first algorithm and Greedy stepwise were also utilized on the supervectors space in order to select the most important features. This framework is illustrated using Mel-frequency cepstral (MFCC) coefficients and Perceptual Linear Prediction (PLP) features on two different emotional databases namely the Surrey Audio-Expressed Emotion and the Berlin Emotional speech Database.

Article Preview

Introduction

Speech is the natural communication form between humans, provides a great deal of information about speaker, language and emotions. This fact has motivated researchers to find a fast and efficient method of natural interaction between man and machine. Presence of emotions makes speech more natural. This has introduced a relatively new research area, namely speech emotion recognition (SER), which is defined as extracting the emotional state of a speaker from his or her speech. This challenging task has several applications in day-to-day life like agent-customer interactions, call-center applications (Herm, 2008), web movies, on- board car driving systems (Hu et al., 2013), medical diagnostic tool and E-tutoring systems (Trabelsi & Bouhlel, 2016a). As in any pattern recognition problem, the performance of emotion recognition from speech depends on label, organization, representation, and evaluation of training data. A significant challenge for emotional research depends on a sense of what emotion is and is in finding appropriate emotional labels. Three labeling methods can be distinguished: (1) categorical approach, (2) dimensional approach, and (3) appraisal-based approach (Cowie & McKeown & Douglas-Cowie, 2012; Hudlicka, 2011). In the first one, emotion is described as a discrete class that differs explicitly and mutually exclusive from one emotion to another. In the second one, emotion is described as a continuous process that will changes dynamically over time, using the multi-dimensional emotion model. However, the appraisal approach, introduces the role of time into the comprehension of emotions (Mortillaro & Meuleman & Scherer, 2012; De Vries, 2015). A critical research challenge in speech emotion recognition systems is to how to encode the spoken emotion by some suitable features (Maji et al., 2015; Saba et al., 2016). This step, called feature extraction, is of a great importance in SER. However, having a large number of potential features increases the complexity of the system and normally results in longer system training times. Therefore, a popular approach is to start with a larger set of features and then removes irrelevant data to reduce dimensionality of the training data and generate a more compact and robust feature set. Another important issue in the evaluation of an emotional speech system is the choice of emotional corpus. The existing emotional databases could be divided into three classes namely: simulated (actor), elicited (induced) and spontaneous (natural) speech databases. For more detailed description, the reader may refer to (Koolagudi& Rao, 2012).