[日本語]このビデオは東京で撮影したものです。英語の会話になっていますが、ポイント毎に説明用のテロップを挿入しております。北京にあるMicrosoft Research Asiaは音声認識の技術を応用して、デジタルビデオファイルに含まれているキーワードを検索する技術を開発しています。フランク研究員が2,500時間からなるビデオの中からキーワード検索を行う様子をデモでご覧いただけます。この検索技術では、ビデオの中でキーワードに一致した部分を時間軸とともに探し出すことができるため、長時間のビデオから注目したい語句を含む部分をすぐに見つけることができます。スピーチテクノロジーの詳細は、http://research.microsoft.com/speech/をご覧ください。

[English]This video was shoot by Channel 9 Japan team in Tokyo, Japan. The conversation is in English.In this video, we are talking about Video-search technology developed by Speech group of Microsoft Research Asia.Frank shows the demonstration the power of Video-search technology with 2,500 hours video files.For more information about Speech technology, please refer to the following URL.http://research.microsoft.com/speech/

This is very exciting to see the use of speech recognition technology used in an application.The search was faster than what I was expecting. Does it use a cash system, like a web based search engine would use? What was the memory size of all the video?

You mentioned a speech recognizer. What is the process for compiling a speech recognizer?

Can channel 9 feature more videos like this one, applications using speech recognition technology? I can only imagine what other things are being worked on. Answering machines, automated appliances operating on voice commands? I can turn the TV on with out
having to use a remote control, and navigate to my desired channel just by speaking out loud the channel number. (Or as seen in your video: just by telling the TV the subject matter I want to search for). Does this mean I’ll eventually be able to talk to
my car as though I talk to a person (granted the AI is there)? For those times I lock my keys in the car. I would say “George“(because George is the name I would give my car). “George please un-lock my door”! How about navigating my computer, I would like
to talk to my computer rather than type all the time. How about an application for gaming? Xbox 360? It would be great to have team based game play where you’re able to audibly give commands to your AI team mates via head set and mike. some of the menu displays
for that stuff gets so clutterd and in the way. Im qualified to asist on developing a computer game with the application of speach recognition. =) Pushing the buttons on my automatic money teller machine is so out of date; the buttons are always so dirty
to. “I would like to deposit 10 dollars please” then I would place my eye up to the scanner rather than type in my pin number. So much faster. No more having to carry a whole wallet full of cards. Do you know how uncomfortable it is to sit on a pile of magnetic
cards each day at the office? Will I be able to order a pizza with out having to employ a person to stand by the phone and wait for me to place my order? This technology will be great for phones.

My favorite will be the international language translator. A devise to be carried by a person to translate the words some one speaks into any given language the user specifies. It would do wonders for Americans vacationing in Paris. The tablet pc would be a
nice transport device for this software. With every one having a tablet pc to carry around with them, then all the world powers would have no excuses for not understanding one another and they would then all get along with one another. I'm a very optomisitic
person.

It is a very exciting time for this technology. Thank you for sharing the video. I hope to see more like it and have my questions answered.

I've just asked Frank Seide to reply your answer regarding video-search technology if possible. FYI, the demo system stored Channel 9 videos, so Frank said "I will search my video by the system" when I shoot the video in Tokyo, Japan.

Anyway, this video-search technology is very interested for many people, I think. Also there are a lot of studies regarding speech recognition technology on Microsoft Research Asia.

First, yes, we use an index during the search like a web engine. The actual search only takes a second. The long delay in the video was because we sent the search result from the server to the TV client in a rather inefficient way (as researchers we often resort
to rapid prototyping techniques which are fast to implement but not always fast in execution).

It is impossible to perform all the signal and speech processing for 2500 hours of audio in such a short time. It takes many days to do that on a server farm with tens of computers.

The memory footprint of the searcher is very small. A real web video search engine would include so much video that the index would be too big to fit into memory. Thus the search function reads all index data directly from disk during searching.

A speech recognizer is a complicated system consisting of signal processing, machine learning, and fast pattern matching algorithms. In a nutshell, we use machine learning techniques to learn from millions of examples how each sound ("phoneme") of a language
sounds like ("acoustic-phonetic model"), we add a dictionary that lists for each word of the languages how it is made up of these phonemes, we also include some form of grammar ("language model"), and at recognition time a complicated process ("Viterbi algorithm")
matches incoming audio against these models and outputs the most likely match.

If you are a programmer and interested in trying to use speech recognition in programs, please download the Microsoft SAPI SDK. If you are interested in the scientific aspect of speech recognition, how it works inside, you can go to the HTK web site of Cambridge
University, England, where you can download code and an excellent tutorial.

Regarding the plenty applications you describe, pretty much all of them are already being worked on in research labs around the world. The furthest out is the AI component. The current hard problems being looked at are robustness of speech recognizers to background
noise, accents, speaking styles etc.

Hey, thanks for your comments, and I am happy you liked the video!

Frank Seide

Remove this comment

Remove this thread

Comments Closed

Comments have been closed since this content was published more than 30 days ago, but if you'd like to continue the conversation,
please create a new thread in our Forums, or
Contact Us and let us know.