It is time to design a unique digital voice for your apps, bots and other connected devices. It’s a logical step in our journey to establish deeper emotional connections with our customers. We can’t ignore the fact that a branded voice adds a lot of new value to a premium voice experience.

How to optimize the digital voice in conversational experiences. The following information will provide a short introduction to custom voice implementation:

Voice Prompts

A recorded message that is played by auto attendants.

TTS (Text To Speech) Voice

A form of speech synthesis that converts text into spoken voice output.

SSML (Speech Synthesis Markup Language)

An XML-based markup language for speech synthesis applications.

Technology has gotten pretty good, but it’s still pretty easy to tell the difference between a TTS voice and a real human voice. Which can be a good thing and a bad thing because designing a digital voice that sounds as human as possible isn’t always the primary objective. In some cases you want your customer to be as aware as possible that their speaking to a bot instead of a real person.

Voice Prompts

The human voice will naturally allow you the most control. Because of its wide range in frequency, intensity, speech rate and more. Strategists often brief and coach professional voice talents to alter their intonation, rate, intensity etc. in specific ways to evoke the right emotions and influence the way consumers perceive their voice.

Pros and cons:

Fidelity

A high quality recorded voice prompt is the most identical representation of the original human voice. It’s by far the best solution if your main objective is to aim for the highest resolution VUI (Voice User Interface) possible.

Control

Professional voice actors can easily alter their way of speaking to come across different. It’s also not a problem to edit recorded voice prompts after the recording sessions to alter the quality and extend pauses etc.

Elision & Prosody

Speech is the most natural way of communication. As humans, we’re able to elide words and sentences in order to smooth the transitions between words. If you decide to use branded voice prompts to enhance your conversational experiences you’re capturing the most transitions in spoken languages.

TTS (Text To Speech) Voice

You can create your own branded TTS voice easily with Microsoft’s custom voice.

According to Microsoft’s cris.ai, “you can build a highly natural voice without a single line of code, starting from just a few minutes of audio”. Although it seems that high-quality TTS voices need a lot more input. Click here to listen to some of their examples online.

Pros and cons:

Fidelity:

The quality of TTS Voice technology gets better every day. With 8 hours of high-quality voice recordings, you can train a highly natural voice model that can be used in various scenarios. Nowadays it’s still pretty easy to recognize a TTS voice but in more and more cases this is not because of the quality of the audio.

Elision & Prosody

Vocal affect or emotion in speech is much harder to capture in a TTS voice compared to voice prompts. When it comes to voice branding these elements are very important because they influence the way people perceive your brand. There are many approaches to the allocation of prosody in TTS systems and it gets better every day.

Scalability

Creating a custom TTS voice is an investment. The threshold is higher because in order to create a high-quality TTS voice you will at least need 8 Hours of premium quality audio recordings. But in the long run, it becomes cheaper than using voice prompts because of recording and implementation expenses.

Costs: TTS vs Voice Prompts

In order to produce voice prompts you need to work with voice talent. Preferably you want to record your prompts in a professional recording studio. This sounds expensive but up to a pretty high amount of interactions, it’s much cheaper than creating a custom TTS voice. On the other hand; a custom TTS voice is much more scalable in the long run. Also if you look at Microsoft’s custom voice pricing structure you’ll see that they are offering scalable solutions with lower thresholds. Contact us at Voicebranding.ai for more information.

SSML (Speech Synthesis Markup Language)

SSML might be the easiest way to create a better conversational voice experience. It provides very limited control but allows you to make simple alterations that can be extremely valuable. For example: Think about using SSML to ad appropriate pause lengths between empathic words to add clarity and optimize turn-taking. TTS can be controlled as well, it requires similar additional markup to do so.