Visual features for the MOCHA database
as provided by the
Computer Vision, Speech Communication & Signal Processing Group
at the National Technical University of Athens, Greece
CVSP group web site:
http://cvsp.cs.ntua.gr
Overview:
--------------------
The MOCHA (Multichannel Articulatory) database has been compiled by the
Department of Speech and Language Sciences at Queen Margaret University
College and the Department of Linguistics at the University of Edinburgh.
It can be found online at
http://www.cstr.ed.ac.uk/research/projects/artic/mocha.html
The database currently comprises sound, EMA (Electromagnetic
Articulography), laryngograph, electropalatograph and video recordings of
two speakers (one male and one female) uttering a set of 460 sentences in
English. For details on the database please refer to (Wrench, Hardcastle
2000). Though the rest of the recorded modalities have been exploited in
various contexts (Richmond et al. 2003, Toda et al. 2008, Toutios and
Margaritis 2008) the video footage had been so far unused.
To also use the video for audiovisual speech inversion
(Katsamanis et al. 2008, 2009) we processed the raw recordings of the
female speaker (fsew0) and extracted visual features based on face active
appearance modeling (Cootes et al. 2001). The video was first segmented
and labeled by automatically aligning the pretranscribed audio data
(available with the MOCHA database) with audio tracks extracted from the
video files. Shape and texture features were extracted at 25Hz, for each
video frame, after face detection, tracking and active appearance model
fitting, using the AAM fitting algorithm described in (Papandreou and
Maragos, 2008). Please check the AAMtools webpage
(http://cvsp.cs.ntua.gr/software/AAMtools) for further information
related to the modeling process and software. In total, 12 shape and 27
texture features were extracted from each frame. Synchronization with the
EMA features was verified both visually and quantitatively using
canonical correlation analysis (Katsamanis et al. 2009).
Practical details:
-------------------------------
Features are stored in binary HTK format (Young et al. 2000) in a
separate file per utterance. Naming of the files follows the MOCHA
conventions. The suffix is .aam.
The current script demonstrates how to import the MOCHA visual features
in the MATLAB environment.
Please check http://cvsp.cs.ntua.gr/research/inversion for further
description of our related research and updates on papers, datasets and
software.
References:
-----------------------
A. Wrench and W. Hardcastle, "A multichannel articulatory speech database
and its application for automatic speech recognition," in Proc. 5th
Seminar on Speech Production, Kloster Seeon, Bavaria, 2000, pp. 305-308.
[Online]. Available: http://www.cstr.ed.ac.uk/artic
K. Richmond, S. King, and P. Taylor, "Modelling the uncertainty in
recovering articulation from acoustics," Computer Speech and Language,
vol. 17, pp. 153-172, 2003.
T. Toda, A. W. Black, and K. Tokuda, "Statistical mapping between
articulatory movements and acoustic spectrum using a gaussian mixture
model," Speech Communication, vol. 50, pp. 215-227, 2008.
A. Toutios, K. Margaritis, "Estimating Electropalatographic Patterns from
the Speech Signal", Computer Speech and Language, Volume 22, Issue 4,
Pages 346-359, October 2008
A. Katsamanis, G. Papandreou, and P. Maragos, "Face Active Appearance
Modeling and Speech Acoustic Information to Recover Articulation," IEEE
Transactions on Audio, Speech and Language Processing, Vol. 17, No. 3,
pp. 411-422, March 2009.
A. Katsamanis, G. Papandreou, and P. Maragos, "Audiovisual-to-Articulatory
Speech Inversion Using Active Appearance Models for the Face and Hidden
Markov Models for the Dynamics," Proc. IEEE Int'l Conference on Acoustics,
Speech, and Signal Processing (ICASSP-2008), Las Vegas, NV, U.S.A.,
Mar.-Apr. 2008.
T. Cootes, G. Edwards, and C. Taylor, "Active Apperance Models," IEEE
Transactions on Pattern Analysis and Machine Intelligence, Vol.23,
No.6, pp. 681-685, June 2001
G. Papandreou and P. Maragos, "Adaptive and Constrained Algorithms for
Inverse Compositional Active Appearance Model Fitting", Proc. IEEE
Int'l Conf. on Computer Vision and Pattern Recognition (CVPR-2008),
Anchorage, AL, June 2008.
S. Young et al., The HTK Book (for HTK Version 3.0), University of
Cambridge, 2000. [Online]. Available: http://htk.eng.cam.ac.uk/docs/docs.shtml
For reprints of our papers, visit the CVSP group web site
URL: http://cvsp.cs.ntua.gr