Introduction to VoiceXML and Voice Services

Get a gentle introduction to the world of computerized telephone applications. This chapter from "Definitive VoiceXML" will show the motivation behind VoiceXML as a clean, enterprise-computing-friendly, open standards technology that unifies the many aspects of the voice application world.

This chapter is from the book

This chapter is from the book

VoiceXML is an XML document type for authoring voicedriven audio
applications. VoiceXML is being used in applications such as voice portals,
where automated voice services can be accessed over the phone, and non-phone
based voice applications, emerging in embedded home appliances and
automobiles.

One of the most common examples of voice applications, that is
familiar to many of us, is the Interactive Voice Response (IVR)
application: an automated service where you can call in and check your bank
accounts, the status of a package, or the estimated arrival time of a flight
 all without speaking to a human. VoiceXML is changing the way we interact
with these and other voice applications.

The remainder of this chapter will gently introduce you to the world
of computerized telephone applications. It will show the motivation behind
VoiceXML as a clean, enterprise-computing-friendly, open standards technology
that unifies the many aspects of the voice application world.

1.1 Voice services and applications

Before diving into VoiceXML, we first need to understand the current
"state of the art" in voice application technology  the
"pre-VoiceXML" voice application landscape.

1.1.1 Traditional applications

Voice applications have traditionally been hosted on a machine called an IVR.
This machine is typically a special purpose computer outfitted with telephony
hardware, possibly speech processing hardware and software, and some sort of a
dialog engine that executes the call-flow logic.

The pre-VoiceXML dialog engines may or may not be particularly
programmable. For example, a voice-mail system is a simple IVR application but
its call flow is more or less "hard-wired" only permitting callers to
leave voice messages and users to check voice messages. You do not need to
program your voice-mail system to do anything more than this.

As businesses started to realize the potential of providing phone
access to customer data the programmable IVR system was born. This is an IVR
system where the call flow is completely programmable. There would typically be
some high-level scripting language for defining a call flow and often a
low-level API so application programmers could write more complex applications,
for example applications that perform database lookups.

Programmable IVRs had their problems, however. For one, they were
typically very difficult to program. Since the call flow scripting languages
were typically vendor-specific, each vendor had to reinvent the wheel in
designing the language and the tools to help voice application developers write
applications in this language. Developing software to interact with these
programmable IVRs was even more difficult as the APIs were often very low-level
or arcane.

Another serious problem was the proprietary nature of these systems.
This made it nearly impossible to move an application from one IVR vendor
platform to another. This culture of "platform lock-in" tended to put
the equipment, application development, and deployment costs into a range where
only the largest call centers could afford to consider sophisticated IVR
services.

1.1.2 Emerging voice services

As the IVR platforms evolved, businesses increasingly incorporated this sort
of technology with their customer relationship management (CRM) efforts. This
trend can be understood if you consider:

Since the 1980s almost every business has implemented its information
system using computers.

While the popularity of Internet as a way for customers to interact with
businesses exploded in the 1990s, call center services remained of paramount
importance. Consider that there are at least three times as many phone users
worldwide as PC users.

If services that are appearing on the Web, such as information retrieval,
stock quotes, product purchases, transactions, booking, bidding, and brokering,
can be performed over the phone with the same ease and reliability, then
interacting with the phone through voice is clearly the simplest interface
 it requires no software downloads to the handset!

Also, as devices become smaller with increasing chip-technology
densities, display and keypad real estate is being constrained in size.
Voice-based interfaces offer a solution.

Unfortunately, not all applications are well suited for voice
interfaces. Large documents and multiple views are some of the features that are
difficult to achieve with voice. On the other hand, there are certain tasks that
voice is better suited for, such as, saying the name of the restaurant you are
looking for rather than typing it in. The challenge is to make new applications
that utilize voice dialogs.

As of the writing of this book we see the rapid growth of the
voice-service industry. Most airlines, banks, shipping companies, and other
time-sensitive businesses provide some sort of automated telephone support or
information service. In addition, new breeds of services are emerging, such as
"voice portals" which provide a service analogous to that of a Web
portal. Also, natural dialog systems are starting to make inroads into the area
of customer support. The fact that these services are accessible from wired
phones, mobile phones, and other emerging wireless devices makes them
ubiquitous.

1.1.3 Enabling technologies

While IVR technology has evolved slowly over the past couple of decades, the
voice-application world has seen accelerated growth, mostly due to the coming of
age of some core enabling technologies. These include:

Automatic Speech Recognition (ASR)

In the past couple of years improvements in speech-recognition algorithms
combined with the explosion in available computing power have made speech
recognition a realistic deployable technology. Early problems with speech
recognition including speaker dependence and small vocabularies have largely
been solved.

Text-To-Speech (TTS)

Just as the technology for recognizing human speech has taken a quantum leap,
so has the technology for synthesizing human speech. TTS technologies now sound
much more life-like increasing their understandability and their acceptance with
a mass-consumer market. The emergence of TTS technologies has enabled the
development of much more dynamic voice applications because, unlike traditional
IVR applications where every audio response must be pre-recorded, a TTS system
can generate speech responses on the fly.

Enterprise Software Integration Technologies

The birth of the Web ushered in a whole new breed of flexible scalable
enterprise application servers. This has allowed business logic to be more
easily accessible over network connections facilitating the integration of voice
systems with back-end data system.