Implementing voice processing for smart home apps

26 Feb 2015
| Vineet Ganju, Trausti Thormundsson

Share this page with your friends

For speech control applications, a truly full duplex echo cancellation is a necessary component of the system, where it is desired to enable speech control concurrently with playback. For an AEC to work well it needs to have access to the signal, i.e. the echo reference, that is being played from the device. The AEC then uses the echo reference to linearly model the acoustics of the echo path in the room. However, in real systems there are often considerable non-linearities in the echo path that degrade the performance considerably – such as when the device is trying to generate loud playback volume from small loudspeakers. Another example occurs when there is non-linear post-processing being done on the playback signal after it has been sent to the AEC as echo reference. This is the case in a speech controlled set-top box (STB), where the AEC is performed and echo reference is obtained in the STB, but the TV will most likely add some unknown delay and post-processing on the audio before playing it out. Using a conventional AEC in these types of conditions will give poor performance.

This problem can be solved by connecting the AEC to the noise reduction technology described in the previous section. As long as the AEC can distinguish between far-end, near-end and double talk activity, this information can be used as the activity detection input to the USF. This approach provides truly full duplex AEC performance in systems that have non-linearity and/or impaired echo reference.

Additionally, this new AEC technology should include a delay estimation algorithm that allows it to align the echo reference and the microphone signal to account for the unknown delay in the echo path, like in the STB case.

Figures 8 and 9 show the performance of a STB system. The user is 3m from the TV and a microphone module is on top of the TV and connected to the STB. The user is giving natural language commands to the STB. At the microphone module the SPL of the desired speech is 60dB, and the SPL of the echo from the TV playback content is 72dB. The top part of Figure 8 shows the unprocessed microphone signal, the bottom part shows the processed microphone signal. Figure 9 shows the spectral content of the residual echo before and after processing. For this case the WER was 100% before processing and 8% after processing.

Figure 8: The top part of this graph shows the unprocessed microphone signal, and the bottom part shows the processed microphone signal.

Figure 9: This plot shows the spectral content of the residual echo before and after processing.

Conclusion Conventional beamforming speech enhancement methods often fall short in providing an acceptable solution in smart home far-field conditions. It therefore becomes imperative to look at other systems that can successfully address and resolve these far-field challenges. For example, Conexant has developed cost effective, highly integrated solutions like the one described in this article with high dynamic range ADCs, excellent far-field noise/interference reduction in conditions with low SNR, low DRR and no knowledge of the direction of speech and noise, and truly full duplex acoustic echo cancellation even when the echo signal is not completely known. These solutions have been deployed by Conexant on many production platforms, from smart home devices to tablets, PCs, and wearables – all with excellent performance results.

Conventional methods such as beamforming require significant microphone cost, platform-specific tuning and many constraints on microphone location, matching and directionality of the speech and noise. The robustness of the alternative solutions described translates directly into better performance and significant cost savings during the development and manufacturing of new smart home products.

About the authorsVineet Ganju is the Executive Marketing Director of the audio business unit at Conexant. Vineet has spent over 17 years in the semiconductor industry with most of that time spent in the consumer and automotive audio segments. Vineet's experience spans audio DSPs, mixed-signal products, amplifiers and algorithms.

Trausti Thormundsson is the Audio Chief Technology Officer at Conexant. He has worked over 17 years in the semiconductor industry in the field of digital communication and audio/voice processing, gaining significant experience in research and development, project management and customer support in the semiconductor field. Trausti has authored and co-authored more than 20 published or pending patents. He serves on the board of directors at Controlant. He has a MSc degree in Electrical Engineering from Stanford University in California.

Robotic glove helps restore hand movements
The device is an improvement from conventional robotic hand rehabilitation devices as it has sensors to detect muscle signals and conforms to the natural movements of the human hand.

Copyright @ 2016 EDN Asia Ltd. All rights reserved.
Reproduction in whole or in part in any form or medium without the
express written permission of eMedia Asia Ltd.
is prohibited. Warning: The images on this site are protected by digital
watermark technology. Your use of this website is subject to, and
constitutes acknowledgement
and acceptance of our Terms of Use.