Σχόλια 0

Το κείμενο του εγγράφου

Voice Recognition by a Realistic Model of Biological NeuralNetworks

By Efrat Barak*

Supervised by Karina Odinaev and Igal Raichelgauz

*visl.technion.ac.il/~efrat_b

Abstract

In this project, a new model for a voice recognition systemis suggested.The model isbased on a realistic model of neural networks, and it integrates principles from thetheories of chaotic systems and Liquid State Machine.

The modelwas implemented inMATLAB, and severaltests were performed on it.The task of the systemin thosetests was

to recognize a voice of a specific person (a voice that it was trained on) outof hundreds of other voices.

TheProblem

The objective of this project was to design a system that canclassify voices, i.e.,recognize the voice of aspecific

person.

However, the problem of voice classificationis too wide to be solved by finite state machines, since it is obviously impossible tocreate a state for every word that every person in the world might say. Even if it werepossible, it is impossibleto save recordings of every word in the voice of every personin order to perform bit–by-bit comparison.

The Solution

The task of voice recognition is highly suitable for neural networks, since suchnetworks can work as classifiers and distinguish thevoice that they learn to identifyfrom other voices. Such networks can learn the characteristics of the voice, andtherefore

using them does not require endless recordings of words.

A new model for a voice recognition system which is based on neural networks issuggested in this project.

Our approach for voice recognition integrates concepts fromthe theories of chaotic neural networks and Liquid State Machines. The mainprincipal of the proposed model is that the input signal is recognized upon the currentstate of the model, which is a limit cycle in which the output of the network isperiodic

and uniquely defines that state.

We defined this state as abasin.

The modelis presented in figure 1.

Figure 1:

The model of the proposed system for voice recognition.

The input signal

, which represent the auditorystimulus, consists of severalparallel spike trains which are transmitted simultaneously to the input neurons of theneural network (see section 4.1 for more informationon the creation of the stimulus).The input neurons transmit an internal signal

to the neural network, whichconsists of a few hundreds of neurons in a three dimensional structure. The network ispushed (by the input signal) to

a basin.The readout function

receives the spiketrains of all the neurons in

and recognizes the basin that the network hasconverged to by comparing it with output patterns of basins that appear in theindicators map. It then classifies the input according to the indicator that belongs

tothat basin. The output signal

determines the class of voices to which the inputsignal belongs.

The neuralnetwork

consists of 135 spiking neurons in a 3x3x15 formation. Thebehavior of the neurons is simulated by the Leaky Integrate and Fire (LIF) model, andthe neurons are connected by dynamic spiking synapses. Twenty percent of theneurons in the networkare

randomly chosen to be inhibitory, and the rest of theneurons are excitatory, in correspondence with the biological values. The connectivityof the network is moderate (=2).

An important advantage of the proposed model is that several tasks can be performedwith only one network atthe same time: The readout function can be trained torecognize several people by finding the current basin of the network and comparing itto several indicators maps (one

for each person).

Tools

MATLAB 6.5 was used for the developments

of the voice recognition system and aGUI that enables a full control of it. The tests were performed with two databases ofrecorded speech: the first one was recorded in the SIPL laboratory of the ElectricalEngineering faculty of the Technion (IIT), and second one was taken fromthe NIST

The neural network that was studied in this project has been created in a newsimulator for neural microcircuits,CSIM, in MATLAB environment. Full details oftheCSIM

simulator can be foundat

http://www.lsm.tugraz.at/csim/.

Two methods were used for encoding the recorded speech signal into spike trains:



Amplitude Encoding: In this method, a

straight forward conversion isperformed between the amplitude in timet

and the number of neurons thatwould fire at that time.



MFCC Encoding: In this method the auditory signal is represented by MelFrequency Cepstral Coefficients (MFCCs), which are coefficients that arebased on human perception. Theauditorysignal is divided to small segments,and each of them istransformed (by FFT)

to the frequency space. Thefrequency bandsare positioned

logarithmically

on the mel scale, which is ascale of pitches that were determined by listeners to be equally distanced fromeach other.

The Classification Process

The classification process consists of three main parts:

1.

Training:

In this part the system istrained ondifferent auditorystimuli: someof them containthe voice that the system should learn to identify, and otherscontain voices of other people.

Simulationsof the neural networkare

performed for very voice segment.

The system learns the basins that thenetwork converges toin every simulationand creates an indicatorsmap:

Theindicators are numbers that are related to each basin, and they indicate howwell this basin represents the wanted voice.

2.

Tuning:

in this part the simulations are performed on another database (whichinclude voice segments of the wanted person and of other persons), and theuser tunes the classification parameters so that the indicators map would bestsuit the person that the system should identify.

3.

Testing:

In this part a new stimulus is presented to the neural network. Thesystem finds the basin that the network converged to and makes aclassification decision upon the indicator of that basin. The output is

ananswer whether the stimulus is the voice of the wanted person or of anotherperson.

The different stages of the voice recognition process are depicted in figure 2.

Figure 2:

A flow chart of the classification process

Results

Several terms were defined for valuation of the classification results:



Hit Segments

–

Voice segments that were classified correctly.



Miss-Hit Segments

–

Voice segments of the person that the system was trained toidentify, which were classified as voices of other people.



False Alarm

Segments

–

Voice segments of different people (not the one that thesystem was trained to identify), which were classified as the voice of the wantedperson.

Table 3 presents the database that was used for the tests.

Num. of Voice Segments

Wanted Voice

Other voices

Data Set 1 (Training)

30

300

Data Set 2 (Tuning)

30

30

Data Set 3 (Test I)

100

400

Data Set 4 (Test II)

38

40

Table 3:The database that was used for the tests

Results for Amplitude Encoded Input

The stimuli that were created by amplitude encoding were examined, and nosignificant difference was found between the stimuli of the wanted person to those ofother persons. The task of the system was therefore to identify the voice of the wantedperson, not to identify a certain type of stimulus.

Table 4 presents the results of two classification tests that were performed on twodifferent (parallel) systems. In these tests, database 1 was used for training the system,database 2 was used for tuning it and database 3 was used for

testing it.

InitialsNum.

Classified as

Test Num. 1

Test Num. 2

True Classification

1–

100

Wanted

(Hit)

71%

94%

100%

Unwanted

(Miss-Hit)

29%

6%

0%

101-492

Wanted

(False-Alarm)

55.9%

61.23%

0%

Unwanted

41.1%

38.77%

100%

Table 4:The results of classification tests

number 1 and 2

The results that are presented in table 4 show that both of the systemsidentified mostof the segments of the wanted voice.Obviously, the indicators map of the networkthat was used in test 2 was much better than theone of the first network: 94% of thewanted voice segments were identified in the second test, while only 71% of themwere identified in the first test.This shows that the internal structure of the neuralnetwork (which is raffled) significantly influences it's classification ability.

The difference (in percents) between the segments that were classified correctly to thefalse alarm segments was15% in the first test

and34% in the second test. These largedifferences show that the classification of any voice segment as a wanted segment isnot a random process. The reason for the high false alarm rate is that the system wasdesigned to find most of the voice segments of the wanted voice, even with a cost ofmany other segments that would be classified as wanted.

Results for MFCC Encoded Input

An examination of the stimuli that were generated in the MFCC method revealed thatthe stimuli of the wanted voice were quite similar to each other butwereverydifferent from the stimuli of the other voices. In this manner, the classification taskbecame a task of distinguishing between two types of stimuli.

Table 5 presents the results of two classification tests that were performed on thesame system. Data set 1 was used for training the system, data set 2 was used fortuning it, and data sets 3 and 4 were used for testing it.

TrueClassification

Classified

as

Test I

Segments:

100 wanted,

400 unwanted

Test II

Segments:

30 wanted,

30 unwanted

Wanted

Wanted (Hit)

87%

86.8%

Wanted

Unwanted (Miss-Hit)

13%

13.2%

Unwanted

Wanted (False Alarm)

55.3%

45%

Unwanted

Unwanted

44.7%

55%

Table 5The results of the classification tests

The results detailed in table 6.13 indicate that the system is quite reliable inclassifying new data: most of the segments of

the wanted voice and about half of thesegments of the unwanted voice were classified correctly. A proof for the consistenceof the system rises from the fact that the hit rate was almost similar in thetwoclassificationtests even though they consistedof significantly different number ofvoice segments and there was no overlapping between them.

Conclusion

A new method for performing voice recognition by a realistic model of biologicalneural networks was presented and implemented in this project. Several systems wereconfigured and trained by the presented method. They were tested on two types ofstimuli: one that was created by amplitude encoding of recorded speech, and anotherthat was created in the MFCC method. The amplitude encoding method was found tobe efficient, while the MFCC method yielded stimuli which were very typical for eachperson.

Tests that were performed on stimuli of both kinds showed that the systems were

efficient inidentifying the

voice that they were trained to find:

The systems found andclassified correctly most of the voice segments of the'wantedperson',even whentheywere given a stimulus

that contained much more voice segments of other people. Theconclusion is that such systems can handle a very high level of noise.Other tests haveshown that the systems were consisted and stable in their classification performances.

Altogether,the tests that were carried out in this project proved that our model for aneural network based voice recognition system is highly suitedfor performing voiceclassification tasks.