How Speech Recognition Works

Overview

This article will give you a technical overview of speech recognition so you can understand how it works, and better understand some of the capabilities and limitations of the technology.

Speech recognition fundamentally functions as a pipeline that converts PCM (Pulse Code Modulation) digital audio from a sound card into recognized speech. The elements of the pipeline are:

Transform the PCM digital audio into a better acoustic representation.

Apply a "grammar" so the speech recognizer knows what phonemes to expect. A grammar could be anything from a context-free grammar to full-blown Language.

Figure out which phonemes are spoken.

Convert the phonemes into words.

Transform the PCM digital audio

The first element of the pipeline converts digital audio coming from the sound card into a format that's more representative of what a person hears.

The wave format can vary. In other words, it may be 16 KHz 8 bit Mono/Stereo or 8 KHz 16 bit mono, and so forth. It's a wavy line that periodically repeats while the user is speaking. When in this form, the data isn't useful to speech recognition because it's too difficult to identify any patterns that correlate to what was actually said.

To make pattern recognition easier, the PCM digital audio is transformed into the "frequency domain." Transformations are done using a windowed Fast-Fourier Transform (FFT). The output is similar to what a spectrograph produces. In frequency domain, you can identify the frequency components of a sound. From the frequency components, it's possible to approximate how the human ear perceives the sound.

The FFT analyzes every 1/100th of a second and converts the audio data into the frequency domain. Each 1/100th of second's results are a graph of the amplitudes of frequency components, describing the sound heard for that 1/100th of a second. The speech recognizer has a database of several thousand such graphs (called a codebook) that identify different types of sounds the human voice can make. The sound is "identified" by matching it to its closest entry in the codebook, producing a number that describes the sound. This number is called the "feature number." (Actually, there are several feature numbers generated for every 1/100th of a second, but the process is easier to explain assuming only one.)

The input to the speech recognizer began as a stream of 16,000 PCM values per second. By using Fast-Fourier Transforms and the codebook, it is boiled down into essential information, producing 100 feature numbers per second.

Figure Out Which Phonemes Are Spoken

Start by grouping. To make the recognition process easier to understand, you first should know how the recognizer determines what phonemes were spoken and then understand the grammars.

Every time a user speaks a word, it sounds different. Users do not produce exactly the same sound for the same phoneme.

The background noise from the microphone and user's office sometimes causes the recognizer to hear a different vector than it would have if the user were in a quiet room with a high-quality microphone.

The sound of a phoneme changes depending on what phonemes surround it. The "t" in "talk" sounds different than the "t" in "attack" and "mist."

The sound produced by a phoneme changes from the beginning to the end of the phoneme, and is not constant. The beginning of a "t" will produce different feature numbers than the end of a "t."

The background noise and variability problems are solved by allowing a feature number to be used by more than just one phoneme, and using statistical models to figure out which phoneme is spoken. This can be done because a phoneme lasts for a relatively long time, 50 to 100 feature numbers, and it's likely that one or more sounds are predominant during that time. Hence, it's possible to predict what phoneme was spoken.

Actually, the approximation is a bit more complex than this. I'll explain by starting at the origin of the process. For the speech recognition to learn how a phoneme sounds, a training tool is passed hundreds of recordings of the phoneme. It analyzes each 1/100th of a second of these hundreds of recordings and produces a feature number. From these, it learns statistics about how likely it is for a particular feature number to appear in a specific phoneme. Hence, for the phoneme "h," there might be a 55% chance of feature #52 appearing in any 1/100th of a second, 30% chance of feature #189 showing up, and 15% chance of feature #53. Every 1/100th of a second of an "f" sound might have a 10% chance of feature #52, 10% chance of feature #189, and 80% chance of feature #53.

The probability analysis done during training is used during recognition. The six feature numbers that are heard during recognition might be:

52, 52, 189, 53, 52, 52

The recognizer computes the probability of the sound being an "h" and the probability of it being any other phoneme, such as "f." The probability of "h" is:

80% * 80% * 30% * 15% * 80% * 80% = 1.84%

The probability of the sound being an "f" is:

10% * 10% * 10% * 80% * 10% * 10 % = 0.0008%

You can see that, given the current data, "h" is a more likely candidate. (For those of you that are mathematical sticklers, you'll notice that the "probabilities" are no longer probabilities because they don't sum to one. From now on, I'll call them "scores" because they're un-normalized probabilities.)

The speech recognizer needs to know when one phoneme ends and the next begins. Speech recognition engines use a mathematical technique called "Hidden Markov Models" (HMMs) that figure this out. This article won't get into the mathematics of how HMMs work, but here's an intuitive feel. Assume that the recognizer heard a word with an "h" phoneme followed by an "ee" phoneme. The "ee" phoneme has a 75% chance of producing feature #82 every 1/100 th of a second, 15% of chance feature #98, and a 10% chance of feature #52. Notice that feature #52 also appears in "h." If you saw a lineup of the data, it might look like this:

52, 52, 189, 53, 52, 52, 82, 52, 82, etc.

So, where does the "h" end and the "ee" begin? From looking at the features, you can see that the 52s are grouped at the beginning, and the 82s grouped at the end. The split occurs someplace in between. Humans can eyeball this. Computers use Hidden Markov Models.

By the way, the speech recognizer figures out when speech starts and stops because it has a "silence" phoneme, and each feature number has a probability of appearing in silence, just like any other phoneme.

Now, the recognizer can recognize what phoneme was spoken if there's background noise or the user's voice had some variation. However, there's another problem. The sound of phonemes changes depending upon what phoneme came before and after. You can hear this with words such as "he" and "how". You don't speak a "h" followed by an "ee" or "ow," but the vowels intrude into the "h," so the "h" in "he" has a bit of "ee" in it, and the "h" in "how" as a bit of "ow" in it.

Speech recognition engines solve the problem by creating "tri-phones," which are phonemes in the context of surrounding phonemes. Thus, there's a tri-phone for "silence-h-ee" and one for "silence-h-ow." Because there are roughly 50 phonemes in English, you can calculate that there are 50*50*50 = 125,000 tri-phones. That's just too many for current PCs to deal with so similar sounding tri-phones are grouped together.

And now, for the last issue. The sound of a phoneme is not constant. A "t" sound is silent at first, and then produces a sudden burst high frequency of noise, which then fades to silence. Speech recognizers solve this by splitting each phoneme into several segments and generating a different segment for each segment. The recognizer figures out where each segment begins and ends in the same way it figures out where a phoneme begins and ends.

After all this work, the speech recognizer has all the mechanics necessary to recognize if a particular phoneme was spoken. An important question still needs answering: How does it determine which phoneme to look for?

A speech recognizer works by hypothesizing a number of different "states" at once. Each state contains a phoneme with a history of previous phonemes. The hypothesized state with the highest score is used as the final recognition result.

When the speech recognizer starts listening, it has one hypothesized state. It assumes the user isn't speaking and that the recognizer is hearing the "silence" phoneme. Every 1/100th of a second, it hypothesizes that the user has started speaking, and adds a new state per phoneme, creating 50 new states, each with a score associated with it. After the first 1/100th of a second, the recognizer has 51 hypothesized states.

In 1/100th of a second, another feature number comes in. The scores of the existing states are recalculated with the new feature. Then, each phoneme has a chance of transitioning to yet another phoneme, so 51 * 50 = 2550 new states are created. The score of each state is the score of the first 1/100th of a second times the score if the 2nd 1/100th of a second. After 2/100ths of a second, the recognizer has 2601 hypothesized states.

This same process is repeated every 1/100th of a second. The score of each new hypothesis is the score of its parent hypothesis times the score derived from the new 1/100th of a second. In the end, the hypothesis with the best score is what's used as the recognition result.

Of course, a few optimizations are introduced.

If the score of a hypothesis is much lower than the highest score, the hypothesis is dropped. This is called pruning. The optimization is intuitively obvious. If the recognizer is millions of times more confident that it heard "h eh l oe" than "z z z z," there's not much point in continuing the hypothesis that the recognizer heard "z z z z." However, if too much is pruned, errors can be introduced because the recognizer might make a mistake about which phoneme was spoken.

Recognizers also optimize by not hypothesizing a transition to a new phoneme ever 1/100th of a second. To do this though, the recognizer must limit what phonemes can follow other phonemes.

Reducing Computation and Increasing Accuracy

The speech recognizer now can identify what phonemes were spoken. Figuring out what words were spoken should be an easy task. If the user spoke the phonemes "h eh l oe", you know they spoke "hello." The recognizer should only have to do a comparison of all the phonemes against a lexicon of pronunciations.

It's not that simple.

The user might have pronounced "hello" as "h uh l oe," which might not be in the lexicon.

The recognizer may have made a mistake and recognized "hello" as "h uh l oe."

Where does one word end and another begin?

Even with all these optimizations, the speech recognition still requires too much CPU usage.

To reduce computation and increase accuracy, the recognizer restricts acceptable inputs from the user. On the whole, this isn't a bad assumption because:

It's unlikely that the user will speak "zwickangagang" because it's not a valid word.

The user may limit him/herself to a relatively small grammar. There are millions of words, but most people only use a few thousand of them a day, and they may need even fewer words to communicate with a computer.

When people speak, they have a specific grammar that they use. After all, users say, "Open the window," not "Window the open."

Certain word sequences are more common than others. "New York City" is more common than "New York Aardvark."

How Speech Recognition Works

Context Free Grammar

One of the techniques to reduce the computation and increase accuracy is called a "Context Free Grammar" (CFG). CFGs work by limiting the vocabulary and syntax structure of speech recognition to only those words and sentences that are applicable to the application's current state.

The application specifies the vocabulary and syntax structure in a text file. The text file might look like this:

Of course, the grammars can be much more complex than this. The important feature about the CFG is that it limits what the recognizer expects to hear to a small vocabulary and tight syntax. After the user speaks "Send," the recognizer will only listen to "mail." This significantly reduces the number of generated hypotheses.

The speech recognition gets the phonemes for each word by looking the word up in a lexicon. If the word isn't in the lexicon, it predicts the pronunciation; See the "How Text-to-Speech Works" document for an explanation of pronunciation prediction. Some words have more than one pronunciation, such as "read" which can be pronounced like "reed" or "red." The recognizer basically treats one word with multiple pronunciations the same as two "words."

CFGs slightly change the hypothesis portion of speech recognition. Rather than hypothesizing the transition to all phonemes, the recognizer merely hypothesizes the transition to the next acceptable phonemes. From the initial "silence" phoneme, the recognizer hypothesizes the "s" in send, "k" in "call," and "eh" in exit. If the recognizer hypothesizes phoneme transitions from the "s" phoneme, it will only hypothesis "eh," followed by "n," "d," "m," "ae," "l," and so on.

You can see how this significantly reduces the computation. Instead of increasing the number of hypotheses by a factor of 50 each time, the number of hypotheses stay constant within a word, and only increase a little bit on word transitions. Given a normal amount of pruning, there are no more than about 10 hypotheses around at a time.

When the user has finished speaking, the recognizer returns the hypothesis with the highest score, and the words that the user spoke are returned to the application.

Speech recognition using a CFG requires a 486/33 to run real-time.

Discrete Dictation

Context Free Grammars don't allow users to speak arbitrary text. If a user wants to dictate text into a document, the speech recognizer must work differently and consume a bit more CPU power. I'll explain how a "discrete dictation" speech recognition engine works internally.

With discrete dictation, a user can speak any word he/she wants out of a vocabulary, usually 60,000 words. However, the user must leave quarter second pauses between words, making it easy for the recognizer to tell where words begin and end. Discrete dictation systems require a Pentium 60. If the user doesn't leave pauses, it's called "continuous dictation." I'll cover that next.

The simple approach to discrete dictation is to create a Context Free Grammar that's just a list of words. This will work, but it will have a few problems:

It will be slower than necessary.

It will be inaccurate, especially for words that sound similar to each other.

It will usually get homographs, such as "to," "too," and "two," wrong.

In the end, using just a CFG will produce less than 80% word accuracy.

The speed and accuracy can be improved by knowing the grammar of the language, and how likely word combinations are. There's little point in hypothesizing phrases such as "New York Aardvark."

To learn how words adjoin one another, the recognizer adds another step to producing the engine's recognition database. In addition to using recordings of the phonemes, the training tools analyze a huge amount of raw text data, ideally 1,000,000,000 words or so, about 8 gigabytes of text. The text is segmented into words. For example, the sentence:

"At 8:00 he will walk down the street and then."

Is converted to:

at eight o'clock he will walk down the street period

Sequences of three words are then counted, producing a file that essentially indicates how common any sequence of three words is:

<beginSent> at eight 1
at eight o'clock 1
down the street 1
he will walk 1
eight o'clock he 1
o'clock he will 1
street period <endSent> 1
the street period 1
walk down the 1
will walk down 1

The real database will be generated from over a billion words of text, so it will be much more complete than what you see above. A short segment of the real table might look like this:

New York City 553
New York Sid 1
New York State 483
... other entries

The term "New York City" appeared 553 times, while "New York Sid" only occurred once. From this, the recognizer knows that if the user has spoken "New York" it's much more likely they're saying "city" than "Sid."

Now, let me expand on the simple approach of the Context Free Grammar.

The statistics can be converted to probabilities, and these probabilities can be incorporated into each of the hypotheses generated during recognition. The score of a hypothesis is equal to the acoustic score of the hypothesis times the probability that the word will appear in the given context. This is called a language model.

Example: The score of the phrase "New York City" is the acoustic score of "New York City" times 483, while the score of the phrase "New York Sid" is the acoustic score of "New York Sid" times 1. Another way of putting this is that the recognizer will only claim to have heard "Sid" if it is much more confident that it's heard the word "Sid" (following "New York") than "City."

Sometimes a word combination was never seen in the original database. Perhaps none of the text ever mentions "New York Aardvark," but the user might still speak it. The recognizer always leaves a small but finite chance that any word will follow any other word. It just has to be very sure that it heard the acoustics of the word correctly.

The language model will reduce the error rate by a factor of four, taking accuracy from 80% to 95%. Here's why:

If two or more words sound similar to the recognizer, it ends up choosing the word that's most likely given the text training. This disambiguates "City" and "Sid." Usually, this is the right thing to do, so it increases the accuracy. Of course, it hurts on the rare occasions that the user speaks "New York Sid."

It disambiguates homographs. Because the words "to," "too," and "two" all have the same expected pronunciation, the language model is the sole determiner of which one gets recognized. Most homographs can be disambiguated by the previous two words.

Recognition is sped up because unlikely word combinations are quickly discarded.

Dictation applications frequently allow the user to correct mistakes by showing the top 10 alternatives to the user and allowing him/her to click on one. The list is constructed from the top 10 losing hypotheses for the word.

Continuous Dictation

Continuous dictation allows the user to speak anything he/she wants out of a large vocabulary. This is more difficult than discrete dictation because the speech recognition engine doesn't easily know where one word ends and the next begins. For example: Speak out loud "recognize speech" and "wreck a nice beach" quickly; they both sound similar.

Continuous dictation works similarly to discrete dictation except the end of a word is not detected by silence. Rather, when a hypothesis reaches the end of a word in continuous dictation, it then produces thousands of new hypotheses and prunes those out. The language model probability helps to prune the hypothesis down a lot more in continuous dictation.

Recognizers use a lot more optimizations to optimize processing and memory in continuous dictation systems. The article won't over those here because their description doesn't help explain the underlying technology.

Adaptation

Speech recognition system "adapt" to the user's voice, vocabulary, and speaking style to improve accuracy. A system that has had time enough to adapt to an individual can have one fourth the error rate of a speaker-independent system. Adaptation works because the speech recognition is often informed (directly or indirectly) by the user if it's recognition was correct, and if not, what the correct recognition is.

The recognizer can adapt to the speaker's voice and variations of phoneme pronunciations in a number of ways. First, it can gradually adapt the codebook vectors used to calculate the acoustic feature number. Second, it can adapt the probability that a feature number will appear in a phoneme. Both of these are done by weighted averaging.

The language model also can be adapted in a number of ways. The recognizer can learn new words, and slowly increase probabilities of word sequences so that commonly used word sequences are expected. Both these techniques are useful for learning names.

Although not common, the speech recognizer can adapt word pronunciations in its lexicon. Each word in a lexicon typically has one pronunciation. The word "orange" might be pronounced like "or-anj." However, users will sometimes speak "ornj" or "or-enj." The recognizer can algorithmically generate hypothetical alternative pronunciations for a word. It then listens for all of these pronunciations during standard recognition, "or-anj," "or-enj," "or-inj," and "ornj." During the process of recognition, one of these pronunciations will be heard, although there's a fair chance that the recognizer heard a different pronunciation than what the user spoke. However, after the user has spoken the word a number of times, the recognizer will have enough examples that it can determine what pronunciation the user spoke.

Conclusion

This was a high-level overview of how speech recognition works. It's not nearly enough detail to actually write a speech recognizer, but it exposes the basic concepts. Most speech recognition engines work in a similar manner, although not all of them work this way. If you want more detail, you should purchase one of the numerous technical books on speech recognition.

Live Event Date: April 21, 2015 @ 1:00 p.m. ET / 10:00 a.m. PT
For mid-sized companies, the challenge of serving people and providing better service is forcing a re-evaluation of their contact centers to improve customer experiences. While these growing enterprises are driving much of the market's innovation, they must remain nimble in how they operate, and be very responsive to customer needs. This upcoming eSeminar reveals five ways mid-size organizations can gain needed agility and examines how delivering …