Speech Recognition - Dictation

Hardware and Software Requirements

A dictation application requires certain hardware and software on the user's computer. Not all computers have the memory, speed, microphone, or speakers required to support speech, so it is a good idea to design the application so that speech is optional.

These hardware and software requirements should be considered when designing a speech application:

Processor speed. The speech recognition and text-to-speech engines currently on the market typically require a Pentium 60 or faster processor for discrete dictation and a Pentium 200 or faster processor for continuous dictation.

Memory. On the average, speech recognition for dictation consumes 4 to 8 megabytes (MB) of random-access memory (RAM) for discrete dictation and about 32 megabytes for continuous dictation in addition to that required by the running application.

Sound card. Almost any sound card will work for speech recognition and text-to-speech, including Sound Blaster? Media Vision? ESS Technology, cards that are compatible with the Microsoft?Windows Sound System, and the audio hardware built into multimedia computers. A few speech recognition engines still need a DSP (digital signal processor) card.

Microphone. The user can choose between two kinds of microphones: either a close-talk or headset microphone that is held close to the mouth or a medium-distance microphone that rests on the computer 30 to 60 centimeters away from the speaker. A headset microphone is needed for noisy environments. Dictation works best with close-talk microphones.

Speech-recognition and text-to-speech engine. speech recognition and text-to-speech software must be installed on the user's system. Many new audio-enabled computers and sound cards are bundled with speech recognition and text-to-speech engines. As an alternative, many engine vendors offer retail packages for speech recognition or text-to-speech, and some license copies of their engines.

Limitations

Even the most sophisticated speech recognition engine has limitations that affect what it can recognize and how accurate the recognition will be. The following list illustrates many of the limitations found today. The limitations do pose some problems, but they do not prevent the design and development of savvy applications that use dictation.

Microphones and sound cards

The microphone is the largest problem that speech recognition encounters. Microphones inherently have the following problems:

Not every user has a sound card. Over time more and more PCs will bundle a sound card.

Not every user has a microphone. Over time more and more PCs will bundle a microphone.

Sound cards (being in the back) don't make it very easy for users to plug in the microphone.

Most microphones that come with computers are cheap, and they don't do as well as more expensive microphones that retail for $50 to $100. Furthermore, many of the cheap microphones that are designed to be worn are uncomfortable. A user will not use a microphone if it is uncomfortable.

Users don't know how to use a microphone. If the microphone is a worn on their head they often wear it incorrectly, or if it sits on their desktop they will lean towards it to speak even though the microphone is designed for the user to speak from their normal sitting position;

Most applications can do little about the microphone. One way that vendors can deal with this is to test and verify the user's microphone setup as part of the installation of any speech component software. Software to test a user's microphone can be delivered along with other components to ensure that the user can periodically test and adjust the microphone and configuration.

Most users of dictation will wear close-talk microphones for maximum accuracy. Close-talk mikes have the best characteristics for speech recognition; they alleviate a number of the problems encountered in Command and Control recognition caused by weaknesses in the capabilities of user microphones in speech recognition and dictation applications.

Speech Recognizers make mistakes

Speech recognizers make mistakes, and will always make mistakes. The only thing that is changing is that every two years recognizers make half as many mistakes as they did before. But, no matter how great a recognizer is it will always make mistakes.

To make matters worse, dictation engines make misrecognitions that are correctly spelled and often grammatically correct, but mean nothing. Unfortunately, the misrecognitions sometimes mean something completely different than the user intended. These sorts of errors serve to illustrate some of the complexity of speech communication, particularly in that people are not accustomed to attributing strange wording to speech errors.

To minimize some of the misrecognitions, an application can:

Make it as easy as possible for users to correct mistakes.

Provide easy access to the "Correction Window" so the user can correct mistakes that the recognizer made.

Allow the user to train the speech recognition system to his/her voice.

Is it a Command?

When speech recognition is listening for dictation, user's will often want to interject commands such as "cross-out" to delete the previous word or "capitalize-that". Applications should make sure that:

If a command is just one word, it does not replace a word that people like to dictate.

If a command is multiple words, it can't be a phrase that people like to dictate.

Finite Number of Words

Speech recognizers listen for 20,000 to 100,000 words. Because of this, one out of every fifty words a user speaks isn't recognized because it isn't in the 20,000 -- 100,000 words supported by the engine.

Applications can reduce the error rate of an engine if the application tells the engine about what words the engine should expect.

Other Problems

Some other problems crop up:

Having a user spell out words is a bad idea, since most recognizers are too inaccurate.

An engine also cannot tell who is speaking, although some engines may be able to detect a change in the speaker. Voice-recognition algorithms exist that can be used to identify a speaker, but currently they cannot also determine what the speaker is saying.

An engine cannot detect multiple speakers talking over each other in the same digital-audio stream. This means that a dictation system used to transcribe a meeting will not perform accurately during times when two or more people are talking at once.

Unlike a human being, an engine cannot hear a new word and guess its spelling.

Localization of a speech recognition engine is time-consuming and expensive, requiring extensive amounts of speech data and the skills of a trained linguist. If a language has strong dialects that each represent sizable markets, it is also necessary to localize the engine for each dialect. Consequently, most engines support only five or ten major languages-for example, European languages and Japanese, or possibly Korean.

Speakers with accents, or those speaking in nonstandard dialects, can expect more misrecognitions until they train the engine to recognize their speech, and even then, the engine accuracy will not be as high as it would be for someone with the expected accent or dialect. An engine can be designed to recognize different accents or dialects, but this requires almost as much effort as porting the engine to a new language.

Application Design Considerations

Here are some design considerations for applications using command and control speech recognition.

Design Speech Recognition in From the Start

Don't make the mistake of implementing speech recognition in your application as an afterthought. It's a poor design if the application is designed for a mouse and keyboard. Applications designed for just the keyboard and mouse get little benefit from speech recognition. The speech interface is at a point similar to where the mouse interface was when applications were designed for keyboard input only-not until applications were deliberately designed for mousing did the mouse prove generally effective for user input.

Do Not Replace the Keyboard and Mouse

Most dictation systems provide discrete dictation, allowing users to speak up to 50 words per minute. While this is faster than hunt-and-peck typists, touch typists can type at least 70 words per minute. Discrete dictation will not be used by touch typists. Continuous dictation allows up to 120 words per minute.

Communicate Speech Awareness

Since most applications today do not include speech recognition, users will find speech recognition a new technology. They probably won't assume that your application has it, and won't know how to use it.

When you design a speech recognition application, it is important to communicate to the user that your application is speech-aware and to provide him or her with the commands it understands. It is also important to provide command sets that are consistent and complete.

Manage User Expectations

Users will often have the expectation that speech-enabled applications will provide a level of comprehension and interaction comparable to the futuristic speech-enabled computers of Star Trek and 2001: A Space Odyssey. Some users will expect the computer to correctly transcribe every word that they speak, understand it, and then act upon it in an intelligent manner.

You should convey as clearly as possible exactly what an application can and cannot do and emphasize that the user should speak clearly, using words the application understands.

Where the Engine Comes From

If an application implements speech recognition, it can work on an end user's PC only if the system has a speech recognition engine installed on it. The application has two choices:

The application can bundle in and install a speech recognition engine. This strategy guarantees that speech recognition will be installed and also guarantees a known level of quality from the speech recognizer. However, if an application does this, royalties will need to be paid to the engine vendor.

Alternatively, an application can assume that the speech recognition engine is already on the PC or that the user will purchase one if they wish to use speech recognition. The user may already have speech recognition because many PCs and sound cards will come bundled with an engine. Or, the user may have purchased another application that included an engine. If the user has no speech recognition engine installed, the application can tell the user that they need to purchase a speech recognition engine and install it. Several engine vendors offer retail versions of their engines.