Audio-Visual Speech Recognition

Page Navigation

Overview

We are working on audio-visual speech recognition, the process of
automatically recognizing speech by lipreading. Supplementing
traditional acoustic features with visual speech information can
lead to robust automatic speech recognizers that can be deployed
in very challenging acoustic environments, such as airplane cockpits,
car cabins, or loud clubs, which are far beyond the reach of
current audio-only speech recognition systems. We have worked on
many aspects of the problem, ranging from adaptive fusion of
asynchronous multimodal features, to compact visual speech facial
representations, to efficient implementations for building
real-time prototypes, and beyond.

Multimodal Fusion by Adaptive Compensation of Feature Uncertainty

Our goal is to develop multimodal fusion rules which
automatically adapt to changing environmental conditions. In the
context of audio-visual speech recognition, such multimodal
fusion rules should automatically discount the share of degraded
features in the decision process. For example, in the presence
of acoustic noise bursts the audio information should be
discarded and similarly for the visual information in the case
of face occlusions.

Figure 2. Multimodal uncertainty
compensation leads to adaptive fusion. As the y_2 stream is
contaminated by more noise, classification decision is
influenced mostly by the cleaner y_1 stream.

We have come up with a conceptually elegant framework based on
uncertainty compensation which can address this problem. This
amounts to explicitly modeling feature measurement uncertainty
and studying how multimodal classification and learning
rules should be adjusted to compensate for its effects. We
show that this approach to multimodal fusion through
uncertainty compensation leads to highly adaptive multimodal
fusion rules which are easy and efficient to implement. What's
more, our technique is widely applicable and can be
transparently integrated with either synchronous or asynchronous
multimodal sequence integration architectures, such as
Product-HMMs. The adaptive fusion effect of uncertainty
estimation is illustrated in Figure 2. For further information
on the method we refer to Papandreou et
al. (2009). We have extended the
popular Hidden Markov Model
Toolkit (HTK) so as to support training and decoding under
uncertainty compensation for multi-stream HMMs; see
the software section of the page.

Active appearance model features for visual speech description

While extraction of acoustic speech features is well studied,
extraction of visual speech descriptors is relatively less
explored. Our goal has been to capture visual speech information
in low-dimensional visual speech descriptors, which should be
compact, relatively invariant to the identity of the speaker and
the scene illumination, and fast to compute so as to allow for
real-time processing.

For visual feature extraction we build on the active appearance
model (AAM) framework. Active appearance models analyze the
shape and appearance variability of human faces in
low-dimensional linear spaces learnt from training data - see
Figure 3. In our visual front-end the AAM mask is
initialized using a face detector, thus leading to fully
automatic visual feature extraction. By adapting the AAM's mean
shape and apperance vectors, we can achieve improved invariance
to inter-speaker variability. The resulting visemic AAM
thus concentrates its modeling capacity to speech-induced
variability. To allow for real-time performance, we have
developed improved fast and accurate algorithms for AAM
fitting. For more information, we refer
to Papandreou et al. (2009)
and Papandreou and Maragos (2008). Our AAM
software is further described in
the AAMtools page.

Real-time Audio-Visual Speech Recognition Prototype

Towards practically deployable audio-visual automatic speech
recognition (AV-ASR), we have built a proof-of-concept
laptop-based AV-ASR prototype which: (i) uses consumer
microphone and camera to capture the speaker; (ii) performs
visual/audio feature extraction, as well as speech
recognition on the laptop in real-time; (iii) is robust to
failures of a single modality, such as visual occlusion of the
speaker's face; and (iv) automatically adapts to changing
acoustic noise levels. The prototype's interface has been
designed and developed by
our TSI-TUC collaborators
(M. Perakakis
and A. Potamianos).

Software

HucTK: Software which
extends HTK to also
support training and decoding under uncertainty compensation for
multi-stream HMMs. It is released in the form of patch code to
HTK. Note that once the patch is applied to HTK, you must obey
the license
of HTK. Coming soon.

AAMtools: A toolbox for building Active Appearance Models
and fitting them to still and moving images. See here.

Acknowledgments

Financial support for this work has been provided by the FP6 European
Network of Excellence MUSCLE, under
contract no. IST-FP6-507752. This is gratefully acknowledged.

Last modified: Sunday, 03 May 2009 | Created by Nassos Katsamanis and George Papandreou