Contents

According to his Web site, Christopher Schmandt loves nature. Maybe that's
why he's been so deeply committed to natural forms of communication. Schmandt,
the principal research scientist and director of the Speech Interface Group
at the renowned MIT Media Lab for more than 15 years, remembers when dedicated
speech recognition systems cost $70,000 and understood about 120 words. We recently
asked him whether speech technologies finally have reached a commercially viable
point, what exactly they're good for, and whether we'll ever use our voices
to surf the Web on mobile phones.

What kinds of applications are ideal for speech recognition technology?
Speech recognition is great for situations in which people, interacting over
the telephone, have a limited number of options and a pretty good idea of what
they want to say. For example, United Airlines uses a speech recognition system
to help customers submit lost-baggage claims, check flight status, and perform
other basic customer service tasks.

It seems as if speech recognition is always just on the verge of commercial
viability. What do you think is behind that perception?
It's partly because vendors make claims that are overblown. And there are simply
lots of misconceptions associated with speech recognition. People think it means,
"Oh, I can say anything," which is not the case. Computers just aren't
that smart. Whenever I'm asked whether speech recognition is commercially viable,
I ask, "For what application? Tell me what you want to do with speech recognition
and how you want to do it?" The technology has been as good [or bad] as
it is now for at least five years, but the turning point for its commercial
acceptance has been the rise of the mobile phone. Voice dialing, telephony,
and accessing Web information with a cell phone are all driving the market.

That sounds very different from the notions of talking to your PC to run
applications by voice command or dictating a letter. Are those ideas passe now?
More or less. Fewer people are computer-naive, which is what some of that work
was designed to address. The Web and e-mail have had a big impact there. But
that is different from dictation products, some of which actually do a fairly
passable job. These products are good for people who have difficulty typing
or have problems in certain niches, such as medical transcription. But the mobile
phone is really where the action is for speech recognition.

Aren't there lots of inherent challenges to using speech recognition over
a cell phone?
There are two big challenges with the telephone: First, you're missing some
of the queues that make it easier for us to understand each other when we're
face to face. Second, with mobile phones, you have lousier networks that either
drop signals or compress calls. The combination of the mobile-phone boom and
the Web, however, has led to work by vendors such as SpeechWorks, a company
which is getting more speech recognition out there so customers can deploy applications
over the phone-as United has done.

But how is United Airlines' and other companies' systems different from
plain old Touch-tone or IVR systems?
It's actually the same thing; you're just using speech instead of Touch-tone,
which has its own problems, by the way. First, there's the question of whether
the person has a Touch-tone phone, which some people actually still don't have.
A Touch-tone system is also hard to use with a mobile phone, because it's tough
to listen and press buttons at the same time. But voice is definitely tougher
to use in a noisy environment.

What is a speech interface and how do you build a good one?
Simply put, a speech interface lets people access applications and information
using just their voices. Much of building a good interface involves looking
at mistakes that can happen in the recognition of words. And it's hard. It's
even harder than building a PC interface, because PC interfaces often look and
work alike. But what's the speech equivalent of a Windows dialogue box?

Are there are any guidelines?
There are. For example, you assume people are going to make errors, and you
have to look at how long the transaction takes to perform, because people get
impatient. You can lose track of these kinds of things when you're building
the applications. Also, you have to consider what happens when users say things
you don't expect. It's hard to know what words people are going to use, so guiding
them with prompts is part of designing the interface.

Some people say that building a top-notch speech interface requires lots of
people to use it and test it. Is this true?
Of course. If you're planning to field-test a system, you have people use it,
and then you listen in on phone calls and analyze how the system responds versus
what people actually say. This is a necessary step, because the people using
it are not the same people who are developing it.

What do you think of the concept of voice Web, where you use your voice to
surf from site to site?
There's just no way. Now, if you have a particular page that you go to all the
time-I go to CNN for news a lot-maybe your voice browser could ask if you want
to go to CNN. What I'm saying, and I think the existence of companies like TellMe
and others proves this, is that limited speech recognition vocabularies-where
the user knows what to say-actually do work. But what's going to drive this
into the mainstream market? Will it be people voice dialing on their address
books, or will it be people looking for random information on the Web? Personally,
I think it's going be the former.

So what do the next couple of years have in store for speech recognition?
You will see people talking to machines more, and you'll see more voice dialing
on mobile phones. For years, the public has been told that these services are
just around the corner. But now with companies like Sprint rolling out voice
dialing, you're actually going to see people warming up to the idea.