This project is concerned with the development of methods for
the automatic summarization of spoken language utilizing
prosodic features such as energy, duration, pitch and pauses.
The main components of the proposed research are an
investigation into which prosodic features add useful
information for speech summarization and the development of
methods, based on statistical pattern recognition, relating the
prosodic features present in a message to the content of that
message. The expected results of the project will be a prototype
system for the summarization of voicemail messages, and an
evaluation of the prospects of the proposed approach for more
general information extraction tasks.

Background

Speech is a very rich communication medium and recently there
have been efforts to find ways of incorporating prosodic cues in
order to extend the capabilities of spoken dialogue and audio
browsing/retrieval systems. An important aspect of this approach
is the combination of prosodic, acoustic and language
information to achieve results that are more robust than those
of single sources. Humans use prosody to disambiguate similar
words, to group words into meaningful phrases, and to mark the
importance of words or phrases. The acoustic correlates of
prosody are among the cues least affected by noise, so it is
likely that human listeners use prosody as a redundant cue to
help them correctly recognize speech in noisy environments.
Spontaneous and read speech differ in regard to prosodic
structure, with the former having shorter prosodic units.

In this project, we are concerned with speech summarization, in
particular the generation of short text summaries of a user's
incoming voicemail messages. This is a potentially important
component of integrated voice/data communication, and we have
applied such a facility in a Short Message Service (SMS) based
system. SMS has several unique features that can be summarized
as message storage if the recipient is not available,
confirmation of delivery to the sender and simultaneous
transmission with voice, data and fax services. Voicemail
summarization differs from conventional text summarization or
abstracting, since it does not assume a perfect transcription
and is concerned with summarizing brief spoken messages (average
duration about 40s) into terse summaries (140 characters in the
case of SMS transmission). Given this level of compression,
"document flow" is less important compared with the need to
transmit the principal content words in the message. We describe
the system's ability to generate summaries of two test sets,
having trained and validated using messages from the IBM
Voicemail corpus.

In many applications, such as speech summarization, the cost of
different types of errors is not known at the time of designing
the system. Additionally the costs may change over time.
Finally, some costs cannot be specified quantitatively: in
speech summarization such costs include coherence degradation,
readability deterioration and topical under-representation.
Thus, we resort to specifying the classifier in the form of an
adjustable threshold and a receiver operating characteristic
(ROC) curve obtained by setting the threshold to various
possible values (Provost
and Fawcett, 2001).

Apparently many tens of lexical and prosodic features as inputs
to classifiers can be identified and calculated. It is desirable
to select a subset of such features and to discard the
remainder. This can be useful if there are features which carry
little useful information for the particular task, or if there
are very strong correlations between sets of inputs so that the
same information is repeated in several features. Furthermore,
one might wish to reduce the dimensionality simply in order to
make the classification calculations quicker, to save storage
space or to permit rapid feature extraction.

Classifiers may be combined by random switching to achieve any
operating point on the convex hull of their ROC curves. Such a
combination is referred to as the Maximum Realizable ROC (MRROC)
classifier. Scott, Niranjan and Prager derived the Parcel
algorithm that sequentially selects features and
classifiers to maximize the MRROC. This implies that different
trade-offs in the ROC curve require different optimal feature
sets and classifiers. It is the objective of Parcel to produce a
MRROC that has the largest possible area underneath it, i.e., to
maximize the Wilcoxon statistic associated with the
classification system defined by the MRROC. This is achieved by
searching for, and retaining, those features and classifiers
that extend the convex hull defined by the MRROC. The Parcel
algorithm seeks not to select a single best feature subset, but
rather to select as many as different subsets as are necessary
to produce satisfactory performance across all costs.

We have applied the Parcel feature subset selection algorithm
to evaluate which of the several and often correlated lexical
and prosodic features are potentially optimal as classifier
inputs for voicemail summarization and the architecture of our
system is shown in the following figure:

Results

Two rates can be calculated for any series of classifications:
the true-positive (sensitivity) and the false-positive
(1-specificity) rates. A true-positive has occurred when a
important word is correctly included in the summary, and a
false-positive when a non-important word is incorrectly included
in the summary.

The left depicts the ROC curves produced using single features
with respect to the validation set. For simplicity only the best
(potentially optimal) types of features are shown with
collection frequency, NE scoring, duration, energy, pitch onset,
pitch amplitude and pitch range offering maximum discrimination.

The right figure depicts the MRROC curves produced by Parcel on
the validation set using lexical only, prosodic only and
combination of lexical and prosodic features. Lexical features
as classifier inputs clearly dominate prosodic features in all
intervals of thresholds. The combination of lexical and prosodic
features gives superior performance than any single constituent
classification system.

Results measuring the quality of summary artifacts using a
weighted Slot Error Rate (SER) metric show that combined lexical
and prosodic features are at least as robust as combined lexical
features alone across all operating conditions.

Examples/demonstrator

The architecture of the proposed system encompasses three
distinct phases of processing:

transcription of voicemail messages

construction of transcription summaries by selecting
important terms according to their weights

formation of summaries and delivery via the WAP Push Service

The spoken messages collected by the voicemail system are
forwarded to the Content Server where they are automatically
transcribed and summarized. There is clearly no restriction on
where the voicemail system is located and will most likely not to
be located anywhere geographically close to the Content Server,
allowing access to answering services other than the one provided
by the network operator. The Push Initiator contacts the Push
Proxy gateway over the Internet and delivers the messages. The
Push Proxy gateway examines the message and performs the required
encoding and transformation of the WAP domain. The messages are
then transmitted hop-by-hop in the mobile network to the mobile
client. The Push Initiator is then notified by the gateway about
the final outcome of the Push operation.

The following table shows the human transcription, automatic
transcription and the automatic summarization of vm1dev26 spoken
message.

HI MARY IT'S MARYANNSHAW
I JUST HAVE A QUICK QUESTION FOR YOU THE DEFENSIVEDRIVINGCOURSE THAT IS TOMORROW AND THURSDAY
CAN YOU LET ME KNOW IF THAT'S IN THE HAWTHORNEONEOR HAWTHORNE TWO JUST WANTED TO MAKE SURE I'M
NOT SURE THE WAY THEY YOU KNOW SETUP
THEIR ROOMS SO IF YOU COULD GIVE ME A CALL I'M ON TIELINEEIGHTTWOSIXSIXTEENOHTWO BYE

HEY GARY IT'S MARYANN
SURE I SHOULD HAVE A QUICKQUESTION FOR YOU
THE DEFENSIVEDRIVING COURSE IT IS TOMORROW
AND THURSDAY CAN YOU LET ME KNOW THANKS IN THE HAWTHORNE
WONDERFUL POINTTWO JUST WANTED TO MAKE SURE
I'M NOT SURE THE WAY YOU KNOW SETUP
THERE SO IF YOU COULD GIVE ME A CALL ON TIELINEEIGHTTWOSIXSIXTEENOHTWO
BYE

The following figures show the summary retrieval of the
vm1dev26 on the display of a WAP phone. An optional connection
to the voicemail system in order to listen to the particular
message can be provided by the WTA.

K. Koumpis and S. Renals
The Role of Prosody in a Voicemail Summarization System Proc. ISCA Workshop on Prosody in Speech Recognition and
Understanding, pp. 87-92, Red Bank, NJ, USA, Oct. 2001.[abstract/download][presentation in html]