Development of an Isolated Digit Speech Recognition Based on Multilayer Perceptron Model

Abstract

The automatic speech recognition (ASR) field has become one of the leading speech
technology areas nowadays. The research in ASR has always been emphasizing on
developing man-machine communication and promising in ease of use over the
traditional keyboard and mouse. The speech recognition task is simple to be
identified by human, but a very complex process for the machine to understand.
Various methods have been introduced to develop an efficient ASR system. A
Neural Network (NN) approach is one of the famous methods and widely used in this
field. A Multilayer perceptron (MLP) is a popular NN model used in ASR field. In
this study, a MLP with back propagation learning algorithm is implemented to
perform the isolated digit speech recognition task for Malay language. However, one
of the current problems faced by MLP and most NN models in ASR field is the long
learning time. Besides that, the requirement to produce high recognition rate for
isolated digit speech recognition system performed by MLP is also not trivial
because it has been widely used in many applications. Thus, this study focuses on
improving the learning time and recognition rate of the MLP neural network for Malay isolated digit speech recognition system. This current study proposes three
new methods to fulfill the objective above. The improvement is made in
preprocessing and recognition phase. In preprocessing phase, a new endpoint
detection method is proposed and it is known as variance method. This method is
introduced to overcome the disadvantages of the conventional method. The
obstacles in the conventional method are unstable and difficult to set the threshold
during the silence detection. Hence, poor recognition rate is produced. Another
contribution in the preprocessing phase is in normalization phase. Three
normalization methods are introduced to normalize the speech data before
propagating to NN. The proposed methods consist of exponent, hybrid I and hybrid
II. These methods are compared with 4 widely used conventional normalization
methods. These include range I, range II, simple and variance method. The
conventional methods have two limitations. The first is that some of the methods are
very slow in learning phase but produce good recognition rate such as variance and
range I methods. The second is that few of them are very fast in learning phase but
produce low recognition rate such as simple and range II methods. Therefore, the
new normalization methods are proposed to accelerate learning time and to produce
high recognition rate. In recognition phase, a simple novel approach is introduced to
increase the recognition rate. An adaptive sigmoid function is implemented to
achieve this objective. A typical or fixed sigmoid function method is used in
learning phase. In the recognition phase, an adaptive sigmoid function is employed.
In this sense, the slope of the activation function is adjusted to gain highest
recognition rate. This study emphasizes on 10 Malay words that comprise of “sifar”
to “sembilan” (“0” to “9”). All utterances were recorded through single male speaker
and each utterance was repeated 100 times. Thus the data set consist of 1000 utterances of Malay words. Four hundred data sets were split to utilize in the learning
phase and the remaining 600 data for recognition phase. The TI46 standard data set
was used to evaluate the performance of the all proposed method and 10 English
words, consisting of “zero” to “nine” (“0” to “9”) are utilized throughout this study.
Eight male and female speakers uttered each word 8 times. Hence, the total data set
is 1600 for both speakers. The data set based on male and female speaker is trained
separately. In this sense, four hundred male data sets were experimented during
learning phase; meanwhile 400 data sets are kept as test data. The same approach is
utilized in learning and recognition phase for female data sets. The Linear Predictive
Coding (LPC) is implemented as a feature extraction method to represent the speech
data. The experimental results show that the proposed endpoint detection (variance
method) produced promising results in term of learning time and recognition rate.
Meanwhile, the proposed normalization method has shown excellent results over all
experiments. The adaptive sigmoid function also successfully increased the
recognition rate in the most of the experiments. Finally, from the overall
experiments, it can be concluded that the highest recognition rate for Malay data set
is 99.83% with 82s convergence time. Meanwhile, for TI46 data set (female and
male data set), the yielded convergence time is 55s and 111s with the recognition rate
of 96.75% and 94.75% respectively.