Automatic Speech Recognition

Can Automatic Speech Recognition Replace Manual Transcription?

We’ve all heard of systems like Apple’s Siri that can automatically recognize what we say and wondered whether we might use this “Automatic Speech Recognition” (ASR) technology to replace the tedious process of manually transcribing oral history interviews. As with many new technologies, the answer turns out to be both yes and no.

Let’s start out by taking a peek under the hood to see how ASR works. It’s a bit like the board game Scrabble in which you have some tiles with letters on them with which you want to make a word that fits with what’s already on the board. In the case of ASR, our tiles contain the sounds associated with a word, and we want to fit them together in a way that matches the sounds that were said. Imagine, for example, we have four tiles with the sounds for “cream,” “I,” “ice,” and “scream.” Then if we hear something that sounds like “iscreem” we might either put down the tiles for “ice cream” or “I scream.” To decide which of those would be the best choice, we need to fit the tiles we put down into what’s already there. For example, if the tiles we have already laid down correspond to “my favorite desert is …” then “ice cream” would be the better choice. Of course, ASR systems don’t really move tiles around, and they consider many more than two choices.

Thinking about ASR as if it were Scrabble helps to explain the kinds of mistakes we see those systems make. We can get into three kinds of problems. First, the ASR system simply might not have a tile for the word that was actually spoken. The system, not knowing that, simply does the best it can with the tiles it has. So if your interviewee says “anaconda snake,” an ASR system that lacks a tile for “anaconda” might produce some nonsense like “anna con the snake.” Second, if the person you are interviewing has an unusual accent, your ASR system might not have any of the right tiles. This can result in what we might charitably call “word salad,” producing long runs of nonsense that have little in common with what was actually said. Third, if your interviewee uses words in ways that the ASR system isn’t prepared for, it might make the wrong choice. For example, if a British speaker referred to their “car ticking over” an ASR system designed for American English might have no idea that “ticking over” is one of the things that a “car” can do.

This perspective also helps to explain why Apple’s Siri and similar systems work as well as they do. Siri learns by “looking over your shoulder” as you read and write email and as you surf the Web, it can automatically create tiles for new words (e.g., names of people). This same evidence can also help Siri to recognize how you are most likely to use words together. Moreover, Siri knows where you are and where you have been, which helps it to make the right choice when you refer to a location. The key idea is that Siri’s access to many types of evidence about what might be said makes it possible to perform more accurate ASR.

Thinking in this way can help to illustrate how we could tailor an ASR system for a specific oral history collection. First, we might want to tell it some of the unusual words that might be found in our interviews. Think of that as creating new tiles. Then we might want to give it some examples of how we speak. We can think of that as adjusting the pronunciation(s) associated with each tile. And finally, we might want to show the system some examples of how words are used together in ways that are specific to the kinds of things it will be asked to transcribe. For example, we might want to show it some newspaper stories on similar topics. The ASR system won’t actually learn what the words mean, of course, but after seeing a few examples it will learn which combinations of words occur frequently and which are so rare that no such examples have yet turned up.

If we do all of that, and if we speak clearly in a quiet room and place the microphone very close to the speaker, an ASR system can yield useful. For example, a commercial ASR system designed for personal dictation in which the speaker adopts a consistent cadence can easily get entire sentences right (which requires a “word error rate” of 5% or less). If you instead do no customization, place your microphone on a table between the speakers, speak naturally about some specialized topic, and interview someone with a bit of an accent, the exact same ASR system will yield word salad. Indeed, we actually tried that as our starting point in a project a few years ago,[1] and that’s what happened.

The problem with this approach is not that ASR can’t do what we want, but rather that we used a system that had been designed for a different purpose. So, in that same project, Bhuvana Ramabhadran of IBM Research built a customized ASR system for our oral history collection. The result was a system that could get three words right out of every four. That’s not good enough to produce readable transcripts, nor is it good enough to make manual post-editing faster than manual transcription from scratch would be, but it is surely good enough for use by a search engine. Indeed, research has shown that a 25% word error rate from ASR often doesn’t reduce “search quality” very much (because they way people use language includes a lot of redundancy). Moreover, we achieved this on one of the most challenging oral history collections in existence, the often heavily accented Shoah Foundation interviews with elderly Holocaust survivors. If we think of ASR as a way of helping find an interview that we might want to listen to, then our experience shows that ASR works. That’s the good news.

The bad news is that building such a highly specialized system would only be cost effective for the very largest oral history collections. For that project, we randomly selected 200 hours of speech from 800 interviews and then spent a couple of thousand hours transcribing that material to “train” the ASR system. Moreover, these were not like typical oral history transcripts. Every sound—every breath, every ummm—had to be transcribed. We also hired linguists to make “tiles” (pronunciations) for unfamiliar words. And we did all of this with the very latest technology that was, at that time, only available in a small number of research labs. Our best estimate is that something similar could now be done for perhaps $100,000, and of course economies of scale might eventually cut that cost further. But the bad news doesn’t end there: if we want the best possible accuracy in very case, then the same investment would have to be made anew for each oral history collection!

The obvious alternative to such a specialized system would be to build an ASR system that works reasonably well for a broad range of oral history collections, and then doing the best we can with the higher word error rate that would result. John Hansen at the University of Texas at Dallas has done this with a system called SpeechFind.[2] Despite the higher word error rate, a little experience should convince you that SpeechFind suffices for some searches that you might want to do. SpeechFind also illustrates a second key idea: by providing speech indexing and search as a centralized service, it becomes possible for a wide range of cultural heritage institutions to participate easily by uploading their collections. Of course, such an approach raises intellectual property issues that are more easily handled in demonstration projects like SpeechFind than in large-scale “cloud services.”

However, there’s more to searching than just building a device that can find where the words you are looking for might have been spoken. Equally important is your ability to recognize what you have found. Ultimately, that requires actually listening to some interviews, and listening takes time. There has been some research on user interface design for searching speech,[3] but much remains to be done to bridge the gap between that early concept development research and deployed systems that are specialized for oral history.

Interestingly, ASR can also be used to automatically synchronize manually created transcripts with the corresponding passages in the audio. These days it is relatively easy for a transcriber to add time code marks to the transcript as they work, but many older collections were transcribed before the value of time code was recognized. Because we don’t need a time code on every word, even quite high ASR word error rates (e.g., 60%) suffice for reliable assignment of time codes to each line in a transcript. Time codes in the transcript can be helpful when preparing collections for Web-based access, and they can also serve as a basis for captioning digital video when preparing multimedia exhibits. Scott Klemmer (now at Stanford University) has even experimented with a “tangible user interface,” adding time code marks to a printed paper transcript that can then be read by a smartphone’s camera to begin replay from any desired point.[4]

Substantial investments in ASR technology continue to be made, both through government-funded research seeking to advance the state of the art and, increasingly, by companies seeking to leverage research results to create profitable applications such as search engines for news broadcasts, podcasts, or personal photograph collections (if your camera records your voice when you take a photo). So it seems quite reasonable to expect continued improvement for some time in ASR accuracy under difficult conditions. Indeed, we are not aware of any theoretical limits that would prevent ASR systems from achieving accuracies on oral history recordings that are close to what OCR can today achieve on printed transcripts. We may, however, need to change some of our collection practices in order to help that process along. One thing that we could do now that would perhaps pay the greatest dividends in the long run would be to consider using a close (e.g., headset-mounted) microphone rather than using a distant (e.g., desktop) microphone. Although the human ear works quite well with content recorded on using a desktop microphone, differences in room acoustics make things much more difficult for current ASR systems. Another thing that we could do would be for the interviewer to make a list of uncommon words used in an interview. Such a list could be useful even without ASR as a basis for helping searchers find specific interviews, and in the future it might be used by ASR system developers as a basis for building new “tiles.” If time allows, transcribing a few minutes from each interview would also likely help future ASR systems to achieve higher accuracy on specific collections.

One language is much like another from the perspective of ASR, so nothing about the technology limits the languages to which ASR might be applied. But the development and adoption of new technologies is often driven more by market forces than by technical limitations, and market forces are most strongly focused on languages where there is money to be made. As a result, for many years to come we will likely see the most advanced technology available for languages such as English, Chinese, French, German and Japanese. We’re a long way from being able to deploy ASR technology across the many hundreds or even thousands of languages that would be of interest if we wished to capture the full richness of the human experience.

At present, our glass is both half empty and half full. The potential for centralized search services that can be provided at an affordable cost has been demonstrated, and further advances can reasonably be anticipated. But easily readable fully automatic transcription of our most challenging content is not yet here, and not yet even on the horizon. To paraphrase Ken Church and Eduard Hovy,[5] perhaps the problem is not that so much that our ASR systems can not yet do all that we might wish, but rather that we have yet to make the best use of the systems that we now know how to build.

[1] “We” refers to the MALACH project, a 2001 NSF grant to the USC Shoah Foundation Institute for Visual History and Education, Charles University, IBM Research, Johns Hopkins University, the University of Maryland, and the University of West Bohemia. See http://malach.umiacs.umd.edu for details.

[2] SpeechFind was used to index oral history collections as part of a 2004 IMLS National Leadership Grant to the Colorado Digitization Program. See http://speechfind.utdallas.edu to try out the system.