The Java Speech API, Part 1

The idea of machines that speak and understand human speech has long been a
fascination of application users and application builders. With advances in
speech technology, this concept has now become a reality. Research projects
have evolved and refined speech technology, making it feasible to develop
applications that use speech technology to enhance the user's experience. There
are two main speech technology concepts -- speech synthesis and speech recognition.

Speech synthesis is the process of generating human speech from written text
for a specific language. Speech recognition is the process of converting human
speech to words/commands. This converted text can be used or interpreted in
different ways.

Over the course of two articles, we will explore the use of the Java Speech
API to write applications that have speech synthesis and speech recognition
capabilities. In addition, we will look at the application areas where we can
effectively use speech technology.

Speech Technology Support in Java

A speech-enabled application does not directly interact with the audio
hardware of the machine on which it runs. Instead, there is a common
application, termed the Speech Engine, which provides speech capability and
mediates between the audio hardware and the speech-enabled application, as shown
in Figure 1.

Figure 1. Speech engine

Speech engines implemented by each vendor expose speech capabilities in a
vendor-specific way. To enable speech applications to use speech functionality,
vendors design speech engines that expose services that can be accessed via a
commonly defined and agreed-upon Application Program Interface.

Java Speech API

This is where the Java Speech API (JSAPI) steps into the picture. The Java
Speech API brings to the table all of the platform- and vendor-independent features
commonly associated with any Java API. The Java Speech API enables speech
applications to interact with speech engines in a common, standardized, and
implementation-independent manner. Speech engines from different vendors can be
accessed using the Java Speech API, as long as they are JSAPI-compliant.

With JSAPI, speech applications can use speech engine functionality such as
selecting a specific language or a voice, as well as any required audio
resources. JSAPI provides an API for both speech synthesis and speech
recognition.

Figure 2. The Java Speech API stack

Figure 2 shows the Java Speech API stack. At the bottom of the stack, the
speech engine interacts with the audio hardware. On top of it sits the Java
Speech API that provides a standard and consistent way to access the speech synthesis and speech recognition functionality provided by the speech engine.
Java applications that need to incorporate speech functionality use the Java
Speech API to access the speech engine.

Several speech engines, both commercial and open source, are JSAPI-compliant. Among open source engines, the Festival speech synthesis system is
one of the popular speech synthesis engines that expose services using JSAPI.
Many commercial speech engines that support JSAPI exist. You can find a
comprehensive list of these on the Java Speech API
web site.

Java Speech API: Important Classes and Interfaces

The different classes and interfaces that form the JSAPI are grouped into
three packages:

javax.speech contains classes and interfaces for a generic
speech engine.

Before we proceed with writing an application that uses JSAPI, let's explore
a few important classes and interfaces in each of these packages.

Figure 3. JSAPI speech engine interfaces and classes

Central

The Central class is like a factory class that all JSAPI
applications use. It provides static methods to enable the access of speech
synthesis and speech recognition engines.

Engine

The Engine interface encapsulates the generic operations that a
JSAPI-compliant speech engine should provide for speech applications.
Primarily, speech applications can use methods to perform actions such as
retrieving the properties and state of the speech engine and allocating and
deallocating resources for a speech engine. In addition, the Engine
interface exposes mechanisms to pause and resume the audio stream generated or
processed by the speech engine. The Engine interface is subclassed
by the Synthesizer and Recognizer interfaces, which
define additional speech synthesis and speech recognition functionality.

The JSAPI has been modeled on the event-handling model of AWT components.
Hence, events generated by the speech engine can be identified and handled as
required. There are two ways to handle speech engine events: through
the EngineListener interface or the EngineAdapter
class.

Next, let's examine some of the important classes and interfaces of the
javax.speech.synthesis package. These will be used quite
frequently in speech applications.

Figure 4. JSAPI speech synthesis interfaces and classes

Synthesizer

The Synthesizer interface encapsulates the operations that a
JSAPI-compliant speech synthesis engine should provide for speech applications.
Primarily, speech applications can perform actions such as producing speech
output (given text input) or stopping speech-synthesis processing. Other
related operations are inherited from the Engine interface. The
Synthesizer interface provides different sources of text input,
ranging from a plain String, to a URL, to a special-purpose markup language called Java Speech Markup Language (JSML, discussed
in the next article).

SynthesizerProperties

The operations in the SynthesizerProperties interface are used
to define runtime properties for the Synthesizer object, including
the voice, volume, and pitch for speech synthesis by the
Synthesizer object.

Voice

The Voice class represents the voice that the
Synthesizer object uses to play the speech output. The
Voice class also provides methods to obtain metadata information
for the voice used for speech synthesis by the Synthesizer object.
This metadata includes the name, age, and gender of the voice being used.

Similar to the Engine interface, events generated during speech
synthesis can be identified and handled by either implementing the methods in
the SpeakableListener interface or using the
SpeakableAdapter class.

We will explore the classes and interfaces of the
javax.speech.recognition package in the next article.

"Can you hear me now?" Asks the Duke

In order to understand the JSAPI better, let's write a simple application
that uses the JSAPI to provide speech synthesis capability. We will build a simple
text editor using the Java Swing API set and add speech capability to the
editor using JSAPI to enable the application to speak the contents of a file.
What speech synthesis capabilities do we add to the speech-enabled text editor?
As we saw from the previous section, the speech synthesis engine provides
different features, such as producing speech output from text, pausing or
resuming the speech output, or ending the speech output generation. We can add
the following capabilities to the VoicePad editor:

Play: speak the contents of the text editor.

Pause: pause the playing of the speech output.

Resume: resume the playing of the speech output from the last
pause.

Cancel: stop the speech output.

The user can invoke any of these speech capabilities by clicking on the
relevant menu items (“Play”, “Pause”,
“Resume,” or “Cancel”) from the Speech Menu.

To build the speech-enabled text editor, first we will define the user
interface (UI) elements that we will need. We can use the text area element as
the text editor for our application. For navigation and user interaction, we
will define a menu bar with menus and menu options. Since our application
functionality consists of two parts – text editor and speech – we
will define two sets of menus for the VoicePad application:

A file menu that supports file operations: creating a new file, opening an
existing file, saving the contents of an edited file, and closing a file.

A speech menu that supports speech synthesis operations: speaking the
contents of the text editor, pausing and resuming the speech synthesis
operations, and canceling a speech operation in progress.

Figure 5. Class diagram of the VoicePad application

Now that we know what is required, let's put together the Java Swing pieces
for the application. The primary class of the application is the
VoicePad class that extends from the JFrame class.
As shown in the class diagram in Figure 5, the VoicePad class will
contain all of the methods required for both text editing and speech
functionality.

The constructor for the VoicePad application is responsible for initializing
the application elements. The constructor invokes the init()
method, which performs the initialization of the user interface elements and the
speech engine. The JTextArea UI element will be the text editor
for our application.