Speech I/O for Embedded Applications

Is the world ready for speech-enabled embedded devices? Now the technology is here for usable speech recognition and synthesis. See how you can use it in your own embedded applications.

Speech user interfaces are like the holy grail for computing. We talk to each
other to communicate, and sci-fi stories—from HAL in 2001: A
Space Odyssey
to the ship's computer in Star Trek—point to talking computers as the
inevitable future. But, creating speech interfaces that are natural
and that people will use has proven to be difficult. Too often speech
technology is provided, or even preinstalled (as with Microsoft Windows
Speech Recognition), and never used, but there are glimmers of hope. The
technology to do “decent” speech recognition and speech
synthesis has existed for a while now, and users are trying it out,
at least in some application categories.

It feels like the opportunity is ripe for someone to get the speech
interface right. Maybe you're the one to invent a speech interface that
makes your embedded application as cool and unique as the iPhone touch
interface was when it first came out.

In some ways, embedded applications are particularly well suited for
speech. An embedded device often is physically small and may not have
a rich user interface. Almost by definition, embedded applications are
not general-purpose, so it's okay if a speech interface has a limited
vocabulary. Speech may be the only user interface provided, or it may
augment a display and keyboard.

Mobile phones are one class of embedded applications where speech works
as a user interface. Voice dialing (“dial home”) is almost a
trivial interface that works very well on phones. If you're driving
and want to send a text message, it's difficult (and in many places
illegal) to use the phone's soft keyboard to enter the message and its
destination. Speech recognition is good enough, and mobile phones are
powerful enough computers, that sending text messages by voice is a
valid use case people are starting to employ.

In this article, I examine technologies for speech synthesis and
recognition and see how they fit with today's embedded devices. As
an example application, and in step with the re-discovery of checklists
as productivity tools (thanks to Atul Gawande's best-seller The
Checklist Manifesto), we'll build a simple vocal
checklist that you can use the next time you do surgery, like Dr Gawande
(kids don't try this at home).

Speech Technologies

As with any other user interface, a speech interface has two components:
input and output (or recognition and synthesis). The two
technologies are closely related, sharing techniques, algorithms
and data models. As mentioned, speech has been a very popular computing
research topic, and I can't cover all the work here, but I take
a quick look at some different approaches, investigate some open-source
implementations and settle on input and output packages that seem well
suited for embedded applications. You don't have to be a computerized
speech expert (I certainly don't claim to be) to speech-enable your
embedded application.

Speech Synthesis or Text-to-Speech (TTS)

Naïvely, you might think “What's so hard about speech
synthesis?” You
envision a hashmap with English words as the keys and speech utterances
as the values. But, it's not that easy. Any nontrivial TTS system needs
to be able to understand things like dates and numbers that are embedded
in the text and utter them properly. And, as any first-grader can tell
you, English is full of words whose pronunciation is context-dependent
(should “lead” be pronounced as rhyming with “reed” or
“red”?). We also vary the pitch of our
voices as we come to the end of a sentence or question, and we pause between
clauses and sentences (called the prosody of the speech).

A lot of smart people have thought this over and have come up with a basic
architecture for TTS:

A front end to analyze the text, replace dates, numbers and abbreviations
with words, and emit a stream of phonemes and prosodic units that describes
the utterance.

A back end, or synthesizer, that takes the utterance stream and converts it
to sounds.

The front end, sometimes called text normalization, is not an easy
problem. It's one of those pattern things that humans do easily and
computers have a difficult time mimicking. The algorithms used vary
from simple (word substitution) to complex (statistical hidden Markov
models). For applications where the text to be spoken is relatively
fixed (like our checklist), most TTS systems provide a way of marking
up the text to give the normalizer hints about how it should be spoken
(and, there is a standard Speech Synthesis Markup Language to do so;
see Resources).

A variety of schemes have been developed to build speech synthesizers. The
two most popular seem to be formant synthesis and concatenation.

Formant synthesizers can be quite small, because they don't actually store
any digitized voice. Instead, they model speech with a set of rules and
store time-based parameters for models of each phoneme. The prosodic
aspects of speech are relatively easy to introduce into the models, so
formant synthesizers are noted for their ability to mimic emotions. They
also are noted for sounding “robotic”, but very
intelligible. For our chosen application, intelligibility is more important
than “naturalness”.

Concatenative synthesizers have a database of speech snippets that are
strung together to create the final sound stream. The snippets can be
anything from a single phoneme to a complete sentence. They are known
for natural-sounding speech, although the technique can produce speech
with distracting glitches, which can interfere with intelligibility,
particularly at higher speeds. They also are typically larger than
formant synthesizers, due to the large database required for a large
vocabulary. The database can be minimized if the TTS is for a
domain-specific application, but, of course, that limits its usefulness.

Rick Rogers has been a professional embedded developer for more than 30 years. Now specializing in mobile application software, when Rick isn't writing software for a living, he's writing books and magazine articles like this one.