Princeton/Adobe technology will let you edit voices like text

Are you ready for fake audio (and maybe video) news?

May 19, 2017

Technology developed by Princeton University computer scientists may do for audio recordings of the human voice what word processing software did for the written word and Adobe Photoshop did for images.

“VoCo” software, still in the research stage, makes it easy to add or replace a word in an audio recording of a human voice by simply editing a text transcript of the recording. New words are automatically synthesized in the speaker’s voice — even if they don’t appear anywhere else in the recording.

The system uses a sophisticated algorithm to learn and recreate the sound of a particular voice. It could one day make editing podcasts and narration in videos much easier, or in the future, create personalized robotic voices that sound natural, according to co-developer Adam Finkelstein, a professor of computer science at Princeton. Or people who have lost their voices due to injury or disease might be able to recreate their voices through a robotic system, but one that sounds natural.

An earlier version of VoCo was announced in November 2016. A paper describing the current VoCo development will be published in the July issue of the journal Transactions on Graphics (an open-access preprint is available).

How it works (technical description)

VoCo allows people to edit audio recordings with the ease of changing words on a computer screen. The system inserts new words in the same voice as the rest of the recording. (credit: Professor Adam Finkelstein)

VoCo’s user interface looks similar to other audio editing software such as the podcast editing program Audacity, with a waveform of the audio track and cut, copy and paste tools for editing. But VoCo also augments the waveform with a text transcript of the track and allows the user to replace or insert new words that don’t already exist in the track by simply typing in the transcript. When the user types the new word, VoCo updates the audio track, automatically synthesizing the new word by stitching together snippets of audio from elsewhere in the narration.

VoCo is is based on an optimization algorithm that searches the voice recording and chooses the best possible combinations of phonemes (partial word sounds) to build new words in the user’s voice. To do this, it needs to find the individual phonemes and sequences of them that stitch together without abrupt transitions. It also needs to be fitted into the existing sentence so that the new word blends in seamlessly. Words are pronounced with different emphasis and intonation depending on where they fall in a sentence, so context is important.

Advanced VoCo editors can manually adjust pitch profile, amplitude and snippet duration. Novice users can choose from a predefined set of pitch profiles (bottom), or record their own voice as an exemplar to control pitch and timing (top). (credit: Professor Adam Finkelstein)

For clues about this context, VoCo looks to an audio track of the sentence that is automatically synthesized in artificial voice from the text transcript — one that sounds robotic to human ears. This recording is used as a point of reference in building the new word. VoCo then matches the pieces of sound from the real human voice recording to match the word in the synthesized track — a technique known as “voice conversion,” which inspired the project name, VoCo.

In case the synthesized word isn’t quite right, VoCo offers users several versions of the word to choose from. The system also provides an advanced editor to modify pitch and duration, allowing expert users to further polish the track.

To test how effective their system was a producing authentic sounding edits, the researchers asked people to listen to a set of audio tracks, some of which had been edited with VoCo and other that were completely natural. The fully automated versions were mistaken for real recordings more than 60 percent of the time.

The Princeton researchers are currently refining the VoCo algorithm to improve the system’s ability to integrate synthesized words more smoothly into audio tracks. They are also working to expand the system’s capabilities to create longer phrases or even entire sentences synthesized from a narrator’s voice.

A key use for VoCo might be in intelligent personal assistants like Apple’s Siri, Google Assistant, Amazon’s Alexa, and Microsoft’s Cortana, or for using movie actors’ voices from old films in new ones, Finkelstein suggests.

But there are obvious concerns about fraud. It might even be possible to create a convincing fake video. Video clips with different facial expressions and lip movements (using Disney Research’s FaceDirector, for example) could be edited in and matched to associated fake words and other audio (such as background noise and talking), along with green screen to create fake backgrounds.

With billions of people now getting their news online and unfiltered, augmented-reality coming, and hacking way out of control, things may get even weirder. …

Zeyu Jin, a Princeton graduate student advised by Finkelstein, will present the work at the Association for Computing Machinery SIGGRAPH conference in July. The work at Princeton was funded by the Project X Fund, which provides seed funding to engineers for pursuing speculative projects. The Princeton researchers collaborated with scientists Gautham Mysore, Stephen DiVerdi, and Jingwan Lu at Adobe Research. Adobe has not announced availability of a commercial version of VoCo, or plans to integrate VoCo into Adobe Premiere Pro (or FaceDirector).

Abstract of VoCo: Text-based Insertion and Replacement in Audio Narration

Editing audio narration using conventional software typically involves many painstaking low-level manipulations. Some state of the art systems allow the editor to work in a text transcript of the narration, and perform select, cut, copy and paste operations directly in the transcript; these operations are then automatically applied to the waveform in a straightforward manner. However, an obvious gap in the text-based interface is the ability to type new words not appearing in the transcript, for example inserting a new word for emphasis or replacing a misspoken word. While high-quality voice synthesizers exist today, the challenge is to synthesize the new word in a voice that matches the rest of the narration. This paper presents a system that can synthesize a new word or short phrase such that it blends seamlessly in the context of the existing narration. Our approach is to use a text to speech synthesizer to say the word in a generic voice, and then use voice conversion to convert it into a voice that matches the narration. Offering a range of degrees of control to the editor, our interface supports fully automatic synthesis, selection among a candidate set of alternative pronunciations, fine control over edit placements and pitch profiles, and even guidance by the editors own voice. The paper presents studies showing that the output of our method is preferred over baseline methods and often indistinguishable from the original voice.

comments 11

However, I’m very concerned by the almost certain likelihood of fake news. Especially on blogs. News that foments fear and hatred and supports confirmation bias.

I agree with GreenDocNowCIv. Hopefully, an inserted “stamp” or other indication of change within the stream can be inserted or easily discovered to invalidate it. Still, it would require some due diligence on the hearer/viewer to validate the content.

Based on the studies presented in the video, this approach seems to be an excellent advance in the editing of voice technology. It seems to act in near real time as opposed to the long times necessary for word synthesis in the currently availlable software that is for sale.
Almost certainly, once the program knows the voice carrier frequencies and the nature of the phonemes, then an extension program should be able to synthesize any sentence of words in the voice chosen, without any more future speech for training.
It might be interesting to see if using similar techniques the machine could detect various accents in a language and reproduce them in any persons speech. For example: in Great Britain there are many different types of accents spoken. Trained individuals can detect where people come from and even the accent learned in specific schools.Persons of different class speak with very different accents. The abillity to preform such sound recognition can allow for the possibility by analogy of doing some unusual feats. It is only a small step to be able to synthesize various musical instruments. With the capacity to discern subtle differences they might be able to take a standard violin and make it sound like a Stradivarious.

Taking this approach even further, you might be able to program the ability to create a singers voice with all its distinctive tones and inflections.It is then only a small step to program the voice of any singer in a never sung song. For example: Sinatra singing a ballad created after his death in his voice. Caruso singing a new opera written long after he died.
Also, it is not unthinkable that what qualities in the sound that make for a “beautiful” voice as opposed to a poor one and fix the poor voice to sound “beautiful” but using its own improved carrier frequencies.

Unfortunately, some individuals have their voice box removed. If their speech sound is saved so this program can speak anything in the “lost’ voice, then the possibility exists to create a true voice box prothesis that is bionic.
If my reasoning is correct, then the loss of the voice box means that a person no longer has the carrier frequencies created by the box. Exactly how this carrier varies is determined by the air pressure exerted on the box by the lungs. Assuming this still is intact after surgery then a prothesis could be designed to measure this pressure and its changes.If the prothesis is programmed to produce the person’s carrier wave and can be modified automatically by the pressure gages. then placing the artificial box where the original box was should allow for the mouth and tongue etc. to form normal speech If such a bionic voice box were implanted in the correct position it will need electric energy to function.Concievably the power can be provided without wires by magnetically charging a storage battery under the closed skin.

I would agree with you if I understand what you mean, and form audio mapping to modulate the voice to form speech. Unless the person still had the voice box you would not be able to recreate the unique modulation which each person creates using the mouth, teeth, tongue and air pressure from the lungs. One would be forced to use mapping from other people to create a general modulation which is impressed on the artificial voice box carrier. The person already has learned how to create phonemes and words, it is the carrier sounds ffrom the voice box that is missing. Your technique can not be easily used to convert the actions of the tongue etc to speech. It becomes very difificult to translate these muscular actions into the correct remembered audio mapping ( in a computer) that produces modulated sounds. If you are manufacturing sounds such as songs by a dead singer then your approach may work quite well.Here we are programming everything which is then recreated by a sound system.

lets take the Blackbird species as an example.I take one of there songs from Ireland or the uk as the tonation can varie around europe,then i recreate it and map it back to a blackbird in Spain as there patterens are similiar however been more robust”just like spanish tonation..it took just over six months to impregnate the initial pattern,but once they had mastered it by uploading,well after practically four years all the population in the area studied apply that song.En primero lugar,tienes que tener un minimo conocimiento de la lengua de los pajaros a profundizar en esta tema,es sencillo de escribirlo,de practicar es otra cosa,dificil de traducir No” depende el nivel de concomiento,estamos en otra Onda”compltemente differente con todo el respecto.

it reminds me of the other article on photo style insertion,where by an image was taken and reinvented on photo shop,here the application seems similiar, only its audio paste.so you could take copyrighted audio and reinvent it thereby undoing it’s copyright?or create fake new’s. any ideas”

A variation on this method can be used for ultra-low bandwidth voice communications. The two parties have a (very large) shared corpus that is n-gram indexed and those n-grams sorted by frequency of use such that the most frequent utterances have the shortest code lengths. When speaker one speaks a sentence it is covered to text and then to the compressed n-gram codes and when this data is received at the other end of the digital communications channel it is synthesised back into a voice using the parameters of the original speaker’s voice characteristics. The system can be adaptive so that emphasis codes can be sent if more bandwidth is available to allow to digital voice proxy to be a more effective actor and thereby enrich the interaction between the people using the communications link. AI can be used to track the emotive flow of the dialogue and insert emphasis when bandwidth is reduced.