Speech Recognition Using Context Independent Word Modeling

Abstractions:

Constructing a address acknowledgment system for the Indian linguistic communication like Tamil is the disputing undertaking. Address in Tamil linguistic communication has alone built-in characteristics like long and short vowels, deficiency of aspirated Michigans, aspirated consonants and many cases of allophones. Pronunciation of words and sentences is purely governed by set of regulations. Like other Indian linguistic communications, Tamil is syllabic in nature. Stress and speech pattern vary in spoken linguistic communication from part to part. However in read Tamil address, emphasis and speech patterns are ignored. The aim of this paper is to construct a little vocabulary context independent word based uninterrupted address recognizer for Tamil linguistic communication.

In this experimentation, a word based context independent acoustic theoretical account, dictionary and Trigram based statistical linguistic communication theoretical account have been built for a little vocabulary of 341 alone words as built-in constituents of the linguist which is the bosom of the address recognizer. The full vocabulary was drawn from a peculiar sphere. The recognizer gives sensible word truth for trial sentences read by trained and new talkers. The limited vocabulary sphere specific undertakings. The consequences are encouraging and this recognizer is simple, robust more truth oriented since its trades with word as a basic acoustic unit.

Introduction

Automatic Speech Recognition ( ASR ) trades with automatic transition of a acoustic signal into text written text in the address vocalizations. Even after old ages of extended research and development, truth in ASR remains a challenge to research workers. There are figure of good known factors which determine truth. The outstanding factors include fluctuations in context, talkers and noise in the environment. Therefore research in automatic address acknowledgment has many unfastened issues with regard to little or big vocabulary, isolated or uninterrupted address, talker dependant or independent and environmental hardiness.

Basically, the job of address acknowledgment can be stated as follows. When given with acoustic observation X = X1X2aˆ¦Xn, the end is to happen out the corresponding word sequence W = w1w2aˆ¦wm that has the maximal posterior chance P ( W|X ) expressed utilizing Bayes Theorem as shown in equation ( 1 ) .

( 1 )

Where P ( W ) is the chance of word W being expressed and P ( X|W ) is the chance of acoustic observation Ten when word W is expressed. P ( X|W ) is besides known as category conditioned chance distribution. P ( X ) is the mean chance that observation X will happen. It is besides called the standardization factor. Since the maximization of equation ( 1 ) is done with variable X fixed, to happen the word W it is adequate to maximise the numerator entirely.

( 2 )

The first term in equation ( 2 ) , P ( W ) , is computed with the aid of a linguistic communication theoretical account. It describes the chance associated with a hypothesized sequence of words. The linguistic communication theoretical account incorporates both the syntactic and semantic restraints of the linguistic communication and the acknowledgment undertaking. By and large the linguistic communication theoretical account may be of the signifier of a formal parser, syntax analyser, N-gram theoretical account and intercrossed theoretical account. Mention 4 for more inside informations. In this experiment, a statistical tri-gram linguistic communication theoretical account has been built utilizing Carnegie Mellon University ‘s ( CMU ) statistical linguistic communication patterning toolkit.

HMM have become the common construction of acoustic theoretical accounts because HMM can normalise speech signal ‘s time-variation and qualify speech signal statistically therefore assisting to parameterize the category conditioned chances. Thus the acoustic theoretical account forms the nucleus cognition base stand foring assorted parametric quantities of address in the optimum sense. At present, all state-of-the-art commercial and most laboratory address acknowledgment systems are based on HMM that give really low Word Error Rate ( WER ) when tested on standard address databases. For elaborate survey refer 12 and 3.

The Choice of Sub-word Unit of measurements

Speech acknowledgment procedure requires cleavage of address wave form into cardinal acoustic units. Telephone is the preferable basic cardinal unit. Other units may be a word or syllable. The comparative virtues and demerits of different acoustic units are presented here.

Variation in context is an of import issue in speech acknowledgment. Telephones are short in continuance and demo high fluctuations. Telephones can be realized depending on their context. Some phones are aspirated when they occur in the beginning of a word and the same phones are non aspirated when they occur at the terminal of a word. Therefore the acoustic variableness of basic phonic units due to context is sufficiently big and non good understood in many linguistic communications. Hence the full word may be treated as basic acoustic unit. Word units have their acoustic representation good defined. Acoustic variableness occurs chiefly in the beginning and terminal of a word i.e. at word boundaries. Another major advantage is no demand of holding pronunciation dictionary. But the word based address theoretical accounts have some disadvantages. The first disadvantage lies in obtaining dependable whole word theoretical accounts from a sensible preparation set. Second, for big vocabulary the phonic content of single words overlap taking to redundancy in hive awaying and comparing whole word forms.

In undertakings like address driven automatic phone dialing, figures ( 0 – 9 ) along with few other words form the vocabulary. The acoustic theoretical account of such systems can be trained utilizing Context Dependent ( Cadmium ) word theoretical account for a sensible size of preparation informations. Context dependence means happening the likeliness of a given word ( or acoustic unit ) with regard to its left and right units. However when the size of vocabulary additions reasonably e.g. around 500 words, CD word patterning becomes impracticable as the possible left and right context words increase exponentially. This besides demands big preparation set, which is impractical. In such state of affairss Context Independent ( CI ) word patterning simplifies the preparation procedure. CI theoretical accounts parameterize single units by disregarding their contexts. The motive for taking the word as an acoustic unit in this paper is that little vocabulary and sphere specific acknowledgment systems can be easy realized utilizing CI word mold.

When covering with big vocabulary acknowledgment undertakings, it would be more practical to develop acoustic theoretical accounts in the phonic degree. However at phonic degree, sensing of word boundary in uninterrupted address becomes really hard. Therefore in English, big vocabulary uninterrupted address acknowledgment ( LVCSR ) systems have used CD phone or triphone as the cardinal acoustic unit. Triphone theoretical accounts are powerful sub-word theoretical accounts because they account for the left and right phonic contexts. Since there are merely about 50 phones in English, they can be sufficiently trained by a sensible sum of preparation informations. Furthermore, phones are vocabulary independent. Therefore one can develop on one set of informations and prove the theoretical account on another set8. Triphones have been tremendously successful in acoustic mold of LVCSR systems.

The Tamil Language

Tamil is a Dravidian linguistic communication spoken preponderantly in the province of Tamilnadu in India and Sri Lanka. It is the official linguistic communication of the Indian province of Tamilnadu and besides has official position in Sri Lanka and Singapore. With more than 77 million talkers, Tamil is one of the widely spoken linguistic communications of the universe.

Tamil alphabet

Some of the phonological characteristics which are of involvement to speech acknowledgment research are discussed in this subdivision. Tamil vowels are classified into short, long ( five of each type ) and two diphthongs. Consonants are classified into three classs with six in each class: difficult, soft ( a.k.a nasal ) , and medium. The categorization is based on the topographic point of articulation. In entire there are 18 consonants. The vowels and consonants combine to organize 216 compound characters. The compound characters are formed by puting dependent vowel markers on either one side or both sides of the consonant. There is one more particular missive aytham ( a®? ) used in classical Tamil and seldom found in modern Tamil. Summarizing up there are 247 letters in standard Tamil alphabet. In add-on to the standard characters, six characters taken from the Grantha book which is used in modern Tamil to stand for sounds non native to Tamil, that is, words borrowed from Sanskrit and other linguistic communications. Even though Tamil is characterized by its usage of retroflex consonants similar to the other Dravidian linguistic communications, it besides uses a alone liquid zh ( a®?a?? ) . Extensive research has been reported in articulation of liquid consonants in Tamil. See 11 for more inside informations.

Pronunciation in Tamil

Tamil has its alone missive to sound regulations. There are really restricted Numberss of harmonic bunchs. Tamil has neither aspirated nor sonant Michigans. Unlike most other Indian linguistic communications, Tamil does non hold aspirated consonants. In add-on, the voicing of stop consonants is governed by rigorous regulations. Stop consonants are voiceless if they occur word-initially or doubled. The Tamil book does non hold distinguishable letters for sonant and voiceless stop consonants, although both are present in the spoken linguistic communication as allophones.

By and large languages construction the vocalization of words by giving greater prominence to some components than others. This is true in the instance of English: one or more phones standout as more outstanding than the remainder. This is typically described as word emphasis. The same is true for higher degree inflection in a sentence where one or more component may bear emphasis or speech pattern. Equally far as Tamil linguistic communication is concerned, it is assumed that there is no emphasis or speech pattern in Tamil at word degree and all syllables are pronounced with the same accent. However there are other sentiments that the place of emphasis in the word is by no agencies fixed to any syllable of single word. However in connected address the emphasis is found more frequently in the initial syllable. Detailed survey on pronunciation in Tamil can be found in 5 and 2. In our experiment, emphasis on syllable is ignored because we are covering with read address.

Constructing CI Model for Tamil words

Constructing uninterrupted address recognizers for Tamil linguistic communication is a ambitious undertaking. This is due to the fact that Indian linguistic communications like Tamil differ from English in several facets refering to orthography and phonology, pronunciation and word emphasis as described in subdivision ( 3.2 ) . As a first measure towards constructing a LVCSR system for Tamil linguistic communication, in this paper the writers have attempted to construct a little vocabulary uninterrupted address recognizer utilizing CI word theoretical account for little vocabulary undertaking utilizing HMM.

The of import faculties in address acknowledgment are acoustic theoretical account, dictionary and linguistic communication theoretical account. A statistical trigram linguistic communication theoretical account was built utilizing the CMU Statistical Language Modeling toolkit. The linguistic communication theoretical account was trained on a text principal of 341 alone words. Since the word theoretical account is being built, the dictionary constituent is created by mapping every word in the vocabulary to itself.

Since address databases are non available for Tamil, a address principal was created in-house. The principal contains 12 hours of uninterrupted read address consisting 6 males and 5 females for preparation and 7.5 hour of address consisting 75 males and 75 females for proving has been created. The recording was carried out in a noise free lab environment. Finally, sentence degree written texts were done manually.

The HMM based acoustic theoretical account trainer from Carnegie Mellon University, SphinxTrain, has been employed. The input file format and inside informations of front-end processing are summarized in table 1.

Table 1. Front-end Processing Detailss

Parameter

Value

Input File Format

Wav ( Microsoft ) File

Sampling Rate

8,000 Hz

Depth

16 spots

Mono/Stereo

Infectious mononucleosis

Window Length

0.025625 S

No. of FFT

512

No. of Filters

31

Min. Frequency

200 Hz.

Max. Frequency

3500 Hz.

Table 1 ( Continued )

No. of Ceptrums

13

End product

Mel frequence Ceptral Co-efficient

The files used to make and develop the acoustic theoretical account with sample informations are as follows.

A set of characteristic files computed from the audio preparation informations, one for every utterence in the preparation principal. Each vocalization can be transformed into a sequence of characteristic vectors in Mel Frequency Ceptral Co-efficients ( MFCC ) utilizing a front-end executable provided with the SphinxTrain. Sample entries are listed below

S011F.mfc

S031M.mfc

A control file incorporating the list of file names of feature-sets. Examples of the entries of this file are

S011F

S031M

A transcript file in which the transcripts matching to the characteristic files are listed in precisely the same order as the characteristic filenames in the control file. Sample entries are shown in figure 1.

Figure 1. Sample entries in transcript file

A chief lexicon which has all acoustic events and words in the transcripts mapped onto the acoustic units we want to develop. Here each word is mapped to the word itself since it is word based preparation. Examples of the entries in this file are shown in figure 2.

Figure 2. Sample entries in dictionary

A filler lexicon, which normally lists the non-speech events as “ words ” and maps them to user_defined phones. This dictionary must at least have the entries

& lt ; s & gt ; SIL

& lt ; sil & gt ; SIL

& lt ; /s & gt ; SIL

The entries stand for

& lt ; s & gt ; : beginning-utterance silence

& lt ; sil & gt ; : within-utterance silence

& lt ; /s & gt ; : end-utterance silence

A phonelist, which is a list of all acoustic units to develop theoretical accounts. Examples are shown in figure 3.

Figure 3. Sample entries in phonelist

HMM theoretical account with 3 breathing and one non-emitting provinces with uninterrupted Gaussian denseness has been used. The HMM topology is shown in figure 4.

Figure 4. HMM and its topology

The inside informations of the preparation parametric quantities are summarized in table 2.

With the address principal and above said files as input, preparation was done as follows utilizing SphinxTrain

First preparation of full theoretical accounts utilizing 15 loops per measure. This measure involves coevals of monophones seed theoretical accounts with nominal values.

Using aligned transcripts from measure ( 2 ) to develop new theoretical accounts ; convergence ratio set to 0.02. This resulted in around 5-7 loops per measure.

After the preparation is over, SphinxTrain generates the parametric quantity files of the HMM viz. the chance distributions and passage matrices.

Table 2. Training Parameters

Parameter

Value

Type of Training

Context Independent ( Continuous Density )

Input signal Features

Mel frequence Ceptral Co-efficient

Feature Type

Ceptra, Delta and Double Delta

Dimensions

13

No. of States in HMMS

3 and one Non-emitting node

No. of Gaussians

1

Execution

The CI word theoretical account based Tamil speech recognizer is implemented on Sphinx-4 which is a state-of-art HMM based address acknowledgment system. It is being developed on unfastened beginning since February 2002. Sphinx-4 is the replacement of Sphinx-3 and Sphinx-2 designed jointly by Carnegie Mellon University, Sun Microsystems Laboratories and Mitsubishi Electric Research Laboratories, USA. It is implemented in Java scheduling linguistic communication and therefore doing it portable across a turning figure of computational platforms10.

The Sphinx-4 model

The Sphinx-4 model has been designed with a high grade of flexibleness and modularity. Figure 5 shows the overall architecture of the system. Each labeled component in figure represents a faculty that can be easy replaced, leting research workers to experiment with different faculty executions without necessitating to modify other parts of the system. There are three primary faculties in the Sphinx-4 model: the Front-End, the Decoder, and the Linguist. The Linguist comprises one or more Acoustic theoretical accounts, a Dictionary and a Language Model. Depending upon the linguist, different faculties can be plugged into the system. This is done through the Configuration Manager faculty.

Decoding uninterrupted Tamil address utilizing Sphinx-4 decipherer

The linguistic communication theoretical account, dictionary and the acoustic theoretical account developed in subdivision ( 4 ) were deployed on the Sphinx-4 decipherer. Sphinx-4 was configured to run in CI manner with the undermentioned constituents

Linguist: Flat Linguist

Dictionary: Full Dictionary

Search Manager: Simple Breath First Search Manager

Flat linguist

This is a simple signifier of a linguist. A level linguist takes a grammar graph and generates a hunt graph for the grammar. The undermentioned premises are made

Zero or one word per grammar node

No fan-in allowed

Merely unit, HMM province and pronunciation provinces are allowed

Merely valid passages are allowed

No tree organisation of units

Full lexicon

This constituent creates a dictionary by reading the Sphinx-3 format lexicon. In our experiment, each line in the dictionary specifies the word followed by infinite or check, followed by its pronunciation. In our instance, the pronunciation is the word itself since we are covering with CI theoretical accounts. The full lexicon will read all the words and their pronunciations at startup. Therefore, it is suited for low vocabulary undertaking.

Simple comprehensiveness foremost hunt director

With the acoustic characteristics and linguist as input, this faculty performs simple comprehensiveness foremost hunt on the hunt graph rendered by the level linguist.

Consequences

The hypothesis word sequences from the decipherer are aligned with mention sentences. The consequence is generated in footings of WER and word truth. Word mistakes are categorized into figure of interpolations, permutations and omissions. Other public presentation steps are velocity and memory footmarks.

The system was tested in batch manner with three trails. First, a trial set of 13 vocalizations for the trained sentences with trained voices was applied. The consequences are tabulated in table 3.

Second, a trial set consisting 50 trial vocalizations from trained voices was applied. The consequences are tabulated in table 4.

Table 4. Consequences for CI theoretical account with trained voice

Detailss

Valuess

Wordss

387

Mistakes

254 ( Sub: 35 Immigration and naturalization service: 2 Del: 217 )

Accuracy

34.9 %

Sentences

50

Time

Audio: 86.36 s, Processing: 232.79 s

Speed

2.70 A- Real clip

Memory

Average: 21.93 MB, Max: 27.69 MB

Finally, a trial set consisting 50 trial vocalizations from new voices was applied. The consequences are tabulated in table 5.

Table 5. Consequences for CI theoretical account with new voice

Detailss

Valuess

Wordss

341

Mistakes

297 ( Sub: 39 Immigration and naturalization service: 1 Del: 257 )

Accuracy

13.2 %

Sentences

50

Time

Audio: 80.54 s, Processing: 198.26 s

Speed

2.46 A- Real clip

Memory

Average: 22.14 MB, Max: 27.65 MB

Discussion and Decision

The truth of the system is better for trained voices than untrained voices. Besides the truth for the vocalizations of trained sentences with trained voices is really high. In scenarios where the vocabulary is limited, repeatability is more and talkers are limited, this recognizer is extremely suited.

The word mistake rate shows a bulk of omissions mistakes. This is due to the little preparation set. The velocity of acknowledgment procedure is besides lower. This is due to word degree comparings in the hunt graph. But this system works moderately good for little vocabulary and domain dependent undertaking. The acknowledgment truth for words and sentences will better farther if the size of the sentences is unbroken little.

For medium and big vocabulary, a triphone based attack is must. CD phone or triphone and syllable based mold for Tamil linguistic communication are under advancement. These attacks are expected to give good consequences. There are built-in characteristics in pronunciation of Tamil linguistic communication which could be exploited in acoustic mold. It is believed that larger sub-word units like syllable could better system public presentation. Many efforts have been made for English linguistic communication. But ab initio, there is an addition in WER and is reported in 1,7,6. In English pronunciation fluctuation is high and syllabication is fuzzed. Even with addition in WER, syllable still remains a primary focal point of research in address acknowledgment. But on the contrary, Tamil has good defined syllabication and sandhi regulations which could assist in syllable mold which will in bend addition the acknowledgment rates.