Are You Talking to Me? Speech on Mac OS X

Editor's Note -- Apple's recent announcement of Spoken
Interface has moved speech recognition to the forefront. However,
Mac OS X has included speech recognition and synthesis technologies
for quite some time, and in this article we delve into the often misunderstood
world of talking to your Mac.

The Early Speech Days

The documentation provided by Apple states that the Speech Manager --
the component that takes care of piping the text into the Speech Synthesizer
-- was first introduced in 1993. Once again, this shows how innovative
Apple can be. Computers of that time were very different from what we
know today, and adding speech capabilities to a consumer product --
even thinking about it -- was a real breakthrough. If not somewhat crazy.

The Rebirth of Cool

However, Speech really was born again with the introduction of Mac OS
X and especially in the two latest releases, Jaguar and Panther. The
new audio capabilities of Mac OS X, along with the renewed commitment
from Apple to this amazing technology have concurred to produce what
is widely considered to be the most convenient and advanced speech technology
available in this field.

Therefore, if you have tried and abandoned Speech during the last century
-- the Mac OS 9 days, in other words -- give it another try.

Users who got used to the voice verification feature in Mac OS 9 (vocal
password) should not despair. It is not currently built into Mac OS
X, but theoretically nothing prevents Apple from adding it again if
it is widely requested. This feature actually worked OK and can be
considered to be very secure since, even if an attacker knows your pass
phrase, he cannot "borrow" your voice. And no, recordings of your voice
won't fool the system.

The Goals of Speech

When Apple began to built speech into the Mac OS, they formed a team
composed of some of the world's leading speech and language scientists,
aiming to bring the user-computer interaction mechanisms to a whole
other level.

The Speech technology is in fact built in two parts: a speech synthesizer
that your Mac can use to communicate with you -- read text on demand
but also keep you informed about the status of a process. And a speech-recognition technology that allows you to talk to your Mac to send commands
to it -- what you usually do with a keyboard and mouse.

Since Speech is built-in right at the core of Mac OS X, there is no need
to install a special application or devices to make it work. Although
the way a specific application reacts to spoken commands in detail is
up to the developer, any Mac OS X application can, to a certain extent,
be controlled by voice.

This amazing integration is mainly due to the development tools and
elements that Apple provides to developers. Once Apple builds speech-controlling capabilities into the standard elements produced by the
Interface Builder it hands out to developers, for example, all the applications
built with this application can be controlled using standard commands.

Of course, since there is always customization, you can, at any time,
add your very own commands to the speech recognition engine -- more
on that later. There is, however, no need to worry: Apple ships Mac OS
X with a predefined set that will allow you to perform the most common
tasks -- browse the Web, check your emails, etc. -- right out of the box.

What Can it Do for Me?

With a hint of practice, you will be able to forget about your keyboard
and mouse and do much of what you already do leaning back on your chair,
therefore diminishing the risk of physical injuries. In fact, several
HR departments are now encouraging people to use these features whenever
possible for exactly this reason, i.e. to decrease the incidence of
workplace injury through repetitive strains.

Speech can also, when used along with more traditional input devices,
make your computing experience more productive and enjoyable. If
you want to check your mail while working on an important report for
your boss, you do not need to stop what you're doing. Simply say "Get
my mail" and let your Mac do the work for you.

Of course, Speech is also very handy for users with disabilities since
it allows them to interact with their computer without having to ask
for assistance. Thanks to Speech, the Mac has become the computer of
choice for visually impaired users who can enjoy quality voices and
excellent voice recognition. Indeed, the feedback provided by Speech
can allow a user who does not see the screen to determine whether the command he gave to the computer took effect or not and what
the status of the request is.

There is also, let's face it, a "coolness" factor that will convince
many users to turn Speech on. But before doing so, you should be warned
that Speech is highly addictive!

Why Isn't Speech More Successful?

When asked, most Mac users will tell you that they have tried Speech,
asked the computer to give them the time, then turned it off because
they did not see it as valuable. Usually they thought the voice recognition
was unreliable or the voices used by the computer weren't pleasant.

As with any technology, there is a short learning curve before you can
really master it and feel comfortable speaking with your computer. After all, this is a brand new way of interacting with a machine and you may
need a few hours to feel relaxed and speak normally again.

Voices are also computationally expensive and, up until recently, many
computers couldn't deal with extremely complex, natural-sounding voices.
The good news is that the incredible computing power packed in the
latest Macs allows the Speech team to release increasingly natural-sounding
voices and speech synthesizers, making the interaction with a computer
even more pleasant. This will be especially noticeable for Panther users.

No synthetic voice sounds perfectly natural. Keep in mind that the specialized
speech-synthesis technologies on which some phone systems rely are heavily
trained and are "specialized." Ask your virtual reception desk to pronounce
the word "asteroids" and you will probably hear the most unnatural voice
ever. Your Mac is able to pronounce any word you give it in a natural
way. Developers can go even further and use the various tools Apple
puts at their disposition to fine-tune the speech synthesis in their
applications. Few take the time to do that right now, but when they
do the results are striking.

As you can see, the quality of voices has increased over the time.
For example, Vicki, the new default Panther voice -- and last in this
demo -- is 27.6 MB large instead of the more traditional 1.5 MB that
older voices used to take up.

The Speech Synthesizer has also evolved a lot and is now able to distinguish
common abbreviations and to add emphasis to long sentences and paragraphs
automatically, making speech sound much more natural. This is especially
noticeable when you read long text documents. The voice is now much
livelier and lifelike since it better duplicates the emphasis a real-life
speaker would put on different parts of the text.

How Does Speech Work?

Understanding how Speech works can provide you with valuable information
to better take advantage of this technology. In this part, I will try
to provide you with an in-depth look at the speech recognition engine
as well as answer a few basic structural questions.

Speech Managers and Synthesizers

The process that converts the text string that must be read into the
sound that goes out of your speakers can be roughly divided in four
steps:

The application passes a string or buffer of text to the Speech
Manager. The developer may choose to give additional instructions
along with the text to alter the way the text is pronounced and
make it even more natural if he wishes.

The Speech Manager pipes the strings that need to be read into the
Speech Synthesizer. In itself, it does not do any sound-processing
work, but instead provides developers with an easy way to interact
with Speech.

The Speech Synthesizer accepts the data from the Speech Manager
and will take care of converting it into audible speech. It is sometimes
referred to as the "speech engine." To do that, it relies on the
various elements we have already seen: dictionaries, sets of rules
and exceptions, and an understanding of the context in which the
phrase to speak is placed. It will also alter the way it generates
the sound according to the commands that were passed along with
the text.

The information is then passed to the Sound Manager that will take
care of communicating with the audio hardware so that you can hear
the voice.

As a general rule, the more RAM and processing power your computer has,
the better the voice will sound. This is because a Speech Synthesizer
heavily relies on your computer's resources to perform its calculations.
Of course, any modern Mac is able to speak perfectly, but do not expect
your old Performa to do as well as a PowerMac G5.

What's a Voice Anyway?

A voice is a set of characteristics defined in parameters that specify
a particular quality of speech. They are like natural voices -- all
of them are different and, from their characteristics, you can guess
the age and sex of the speaker. Voices can talk slower or faster, but
in the end you cannot change their base characteristics -- just like you
can alter your voice but never entirely change it. Panther comes with
22 voices, but you can theoretically add more if you like.

Indeed, the Speech architecture is very flexible and, as years go by,
more and more developers have created add-ons for it, to extend its
capabilities and provide even more natural-sounding voices.

What Makes Speech Better than the Competition?

Do you remember when, at the beginning of this article, I told you that
you may have in fact used Speech for months without knowing it? That's
because the work going on with the Speech group at Apple doesn't stop
at making voices and recognizing what you say. Far from it!

Indeed, for the Apple team speech cannot be distinguished from
language and there can be no reliable speech technology without a good
understanding of the language that is spoken. That's why Apple developed
a complex set of rules that allows your Mac to truly analyze the text
before it is spoken. Of course, Speech relies on a 121,000-word dictionary
that tells it how the most common words are pronounced ... but what about
the others? What about the context in which these words are placed?
While some other technologies don't care, your Mac does.

This very same technology allows the Mail Junk
Mail feature to reach 98% accuracy when it is properly trained and
serves as the basis for the Japanese input method. If, like millions
of Mac users, you have wondered how Mail does its magic, keep in mind
the phrase "adaptive latent semantic analysis."

However, although this attention to the context in which a word or phrase
is spoken is essential, the end user is more likely to notice something
even more appealing: the speech recognition is speaker independent and
does not require any training.

In other words, you do not need to read some predefined text for hours
to allow Speech to get used to you or your environment. This means that
you do not need to worry about switching between Macs only because you
would need to retrain the system.

Much in the same way, Speech can adapt itself to very diverse environments
and is able to cancel out the background noise.
Therefore, there is no need to pay much attention to your environment as long as the background noise stays constant --
think a restaurant where all the background conversations mix to create
a relatively constant noise.

Thanks to this flexibility, Speech does not require any additional hardware.
Of course Speech addicts may wish to purchase a headset to further
increase the accuracy of the speech recognition in hostile environments,
but this really isn't needed as long as you plan to use it in regular
conditions, such as a room of reasonable size with no strong echo -- an office, your living room, or the company's cafeteria, as opposed
to an empty lecture hall, an underground cave, or an acoustic rock concert.
This also means you do not have to wear these special noise-cancellation
headphones provided by some other manufacturers. These are nice, but in many cases they do not provide a real help and are impractical.

However, to truly understand what makes the difference, we need to get
a bit geeky and see in-depth how Speech works.