The machine learning techniques that give the whisper to Alexa’s speech

In October 2018, Amazon introduced Whisper Mode to some of its products. Just over a year later, all users of Alexa devices can now whisper back and forth with them. Details on how this process works have now been revealed.

Maintaining clarity

Amazon shared a paper from the January 2020 issue of the journal IEEE Signal Processing Letters. Here, it describes the machine learning techniques that were conducted to implement the Whisper Mode.

Its mission was to convert normal speech into a whispering voice while still holding a natural tone for Alexa. Therefore, it explored three different techniques to perform this conversion.

A handcrafted digital-signal-processing (DSP) system was looked at. This was based on an evaluation of the acoustics of whispered speech. Two different machine learning systems were also explored. One of these uses Gaussian mixture models (GMMs) while the other uses deep neural networks (DNNs).

These techniques were all evaluated through listener studies using multiple stimuli with hidden reference and anchor (MUSHRA) processes. Amazon concluded that the machine learning systems were highly effective but the DNN model was more responsive to multiple and unfamiliar speakers.

Those with Amazon devices on their bedside table have found Whisper Mode to be useful at night. Photo: Amazon

Training data

VentureBeat reports that the GMMs tried to identify a range of values for each output feature corresponding to a related distribution of input values. Meanwhile, the DNNs adjusted their internal settings through a way in which the networks attempted to predict the outputs related to particular inputs.

“We used two different data sets to train our voice conversion systems, one that we produced ourselves using professional voice talent and one that is a standard benchmark in the field. Both data sets include pairs of utterances — one in full voice, one whispered — from many speakers,” Amazon said, as per its blog post.

“Like most neural text-to-speech systems, ours passes the acoustic-feature representation to a vocoder, which converts it into a continuous signal.”

Voice actors from five different countries were consulted to help with the whispered speech. Photo: Amazon

The experiments continue

To evaluate its voice conversion systems, it compared its outputs to both recordings of natural speech and recordings of natural speech fed through a vocoder called WORLD.

The group used two sets of data to train their conversion systems. Thereafter, they produced speech using five professional voice actions from Australia, Canada, Germany, India, and the US. They then compared the outputs to recordings of natural speech and recordings of speech fed through a vocoder.

In their preliminary experiments, they trained the voice conversion systems on data from individual speakers and tested them on data from the same speakers.

The MUSHRA scores for the naturalness of recorded speech (Rec), vocoded recorded speech (Oracle), and Amazon’s three experimental systems. Photo: Amazon

The finished product

They found that, while the raw recordings sounded most natural, whispers synthesized by the models sounded more natural than “vocoded” human speech. This then allowed the company to analyze how well the voice conversion process was performing.

The version of the whispers in all Alexa devices has passed through Amazon’s state-of-the neutral vocoder that enhances the speech quality further.

Altogether, Amazon is set to have another strong decade ahead following ten years of unprecedented growth. By continuing to look at modern technology to improve its products, it will continue to stay ahead within its markets.

What are your thoughts on these machine learning techniques? Let us know what you think in the comment section.

This website uses cookies to improve your experience. We'll assume you're ok with this, but you can opt-out if you wish. Cookie settingsACCEPT

Privacy & Cookies Policy

Privacy Overview

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

This website uses cookies to improve your experience while you navigate through the website. Out of these cookies, the cookies that are categorized as necessary are stored on your browser as they are essential for the working of basic functionalities of the website. We also use third-party cookies that help us analyze and understand how you use this website. These cookies will be stored in your browser only with your consent. You also have the option to opt-out of these cookies. But opting out of some of these cookies may have an effect on your browsing experience.

Necessary cookies are absolutely essential for the website to function properly. This category only includes cookies that ensures basic functionalities and security features of the website. These cookies do not store any personal information.

Any cookies that may not be particularly necessary for the website to function and is used specifically to collect user personal data via analytics, ads, other embedded contents are termed as non-necessary cookies. It is mandatory to procure user consent prior to running these cookies on your website.