Voice input processing for automotive speech recognition systems

In a quiet, controlled environment, today’s speech recognition engines have become quite effective. Whether doing dictation with a quality headset in a quiet office, or giving search-phrases to a smartphone in a silent room, hit rates of close to 100 percent are now commonly achieved. However, adding a few disturbances tends to quickly degrade the performance.

The automobile environment is one of the most challenging in this respect. A variety of noise sources both outside of the car (passing cars, honking horns) and inside (multiple passengers talking, the air conditioning fan, the radio) along with audio reverberations off the hard surfaces result in the lackluster performance with which many car owners are familiar.

Further, in order to avoid false triggers, the driver of the car needs to push a button to trigger the speech command system. This is not just a nuisance but also a safety hazard.

Yet few applications could benefit more from using speech recognition for voice command operation than the automobile. It is therefore critical and of great value if technology can make speech recognition more effective in cars, detecting commands reliably in the presence of all disturbances without use of button-presses. While fundamentally being a speech recognition problem, performance improvements will primarily come by processing the voice input signal by removing noise and disturbances.

In recent years, one of the key areas that Conexant has focused its vast experience in audio technology is in Voice Input Processing (VIP). By doing careful design from the microphone interface, providing clean bias signals and low-noise pre-amplification and gain control, to implementing complex digital signal processing algorithms on its high-performance yet low-power DSPs, Conexant has been able to deliver VIP devices for a number of applications including TVs, home appliances and automobiles. Within those applications, one of the primary advantages of using the Conexant solution is to improve the performance of speech recognition engines, where the Conexant solution has been optimized for many of the common speech recognition algorithms for use in challenging environments.

To achieve superior performance, several algorithms are employed to enhance the desired input signal and suppress noise sources in a coordinated manner. Conexant’s Selective Source Pickup (SSP) algorithm is uniquely able to separate the desired signal from the noise sources by analyzing statistical and spatial information in the signal.

The interference coming from the local loudspeakers is cancelled with Conexant’s advanced Multi-channel Acoustic Echo Canceller (MAEC), reverberation is suppressed with a novel de-reverberation algorithm, and the remaining environmental noise is attenuated by a Non-Stationary Noise Reduction (NSNR) algorithm. Tuning these algorithms together, and in particular if they are tuned for a specific speech recognition engine, can vastly improve the word hit rate without any changes to the speech recognition system.

Figure 1. Disturbances in automobile environment

Selective source pickup (SSP)Independent Component Analysis (ICA) is an emerging area of research within audio technology that attempts to separate or extract different voice or noise sources. Established in the early 90s, it is based on the idea that the underlying sources of a mixed signal are statistically independent. Using prior knowledge of the statistics of the certain types of signals combined with the measured correlation parameters, adaptive techniques can in fact separate or “de-mix” the combined signal to extract one or more of the underlying sources. Typically, ICA algorithms require an extreme amount of processing power and memory. This makes them impractical for implementation in embedded real-time systems.

Conexant’s SSP algorithm utilizes some of the fundamental ideas from ICA, reduces these requirements to a practical level and yet delivers on the promise of separating one talker from another talker or from the environmental noise using only two microphones. The decision of which source to extract can be made in real time. The algorithm can simply extract the dominant talker or use the position of the talker with respect to the microphones to decide what signal to extract. In effect, this allows the VIP to zoom in on a single talker in a room or car filled with interference from other sources, which can be extremely useful for a speech recognition application in an automobile environment.

I have to say, it's very interesting. Right now, I think MeMeMe Mobile (www.memememobile.com) is a new entrant into this vast new field, and it's pretty promising. Come check it out for a few minutes, or sign up under the development track.
Peter

Sverrir, as impressive as Conexant's SSP research appears to be, I'm afraid that this is not going to cut it in the end. The market doesn't want noise-tolerant speech recognition systems that make preprogrammed assumptions about the environment in which they are used. The market wants speech technologies that are as noise-tolerant and versatile as the human brain or better. Personally, I want to be able to talk to my car even when I'm standing next to it. I want to tell it to open, close or lock the doors, any door; or open the trunk. Any company that can deliver this technology will make a killing.
*
It has been obvious for some time that the human brain does not use anything like Bayesian statistics (HMM) to process sounds or anything else. This is the fundamental problem with machine recognition. What is needed is a revolution in our understanding of how humans and animals recognize sounds. Everybody in AI has jumped on the Bayesian bandwagon just as they once jumped on the symbolic bandwagon in the 1950s only to be proven wrong half a century later.
*
In spite of its current success, the hidden Markov model is not it. SR researchers need to start thinking outside the box in my opinion. I say, get off the bandwagon because it's going nowhere. There's a better way, the correct way, and, as we all know, there's a fabulous prize waiting at the end of the road.