Google Learning Speech Recognition for Voice Search from MTV?

How might a voice search engine learn new words that have been introduced into popular speech, such as “da shiznet,” and learn and understand different pronunciations of words, such as might be found in spoken language based upon regional differences?

A method for generating a speech recognition model includes accessing a baseline speech recognition model, obtaining information related to recent language usage from search queries, and modifying the speech recognition model to revise probabilities of a portion of a sound occurrence based on the information. The portion of a sound may include a word.

Also, a method for generating a speech recognition model, includes receiving at a search engine from a remote device an audio recording and a transcript that substantially represents at least a portion of the audio recording, synchronizing the transcript with the audio recording, extracting one or more letters from the transcript and extracting the associated pronunciation of the one or more letters from the audio recording, and generating a dictionary entry in a pronunciation dictionary.

The patent does provide a lot of detail on how such a language model might be built and trained. The most interesting part of it, to me, is how it might look to sources like television newscasts and transcripts from them, to learn new words, define how those words sound, and understand text related to those words.

It also discusses predictive sound searches, like the predictive searches that you see in the Google toolbar in a drop down, that offer suggested queries based upon letters and symbols and spaces typed into the toolbar search box, except this listen to portions of sound.

The patent application really does use the term “Da Shiznet” (sic) as an example:

Current speech recognition systems, however, do not translate the spoken words with complete accuracy. Sometimes the systems will translate a spoken word into text that does not correspond to the spoken word. This problem is especially apparent when the spoken word is a word that is not in a language model accessed by the speech recognition system.

The system receives the new spoken word, but incorrectly translates the word because the new spoken word does not have a corresponding textual definition in the language model. For example, the words “da shiznet” expresses a popular way, in current language, to describe something that is “the best.”

Language models, however, may not include this phrase, and the system may attempt to translate the phrase based on current words in the language model. This results in incorrect translation of the phrase “da shiznet” into other words, such as “dashes net.”

Now, I can’t recall my local anchorman using terms like “da shiznit” on a regular basis. If Google is learning new words and pronunciations by listening to television and following along with the broadcast by looking at transcripts from those shows, I’m wondering if MTV is on their daily viewing schedule.