Practical Improvements to Automatic Visual Speech Recognition

MPhil Thesis Defence
Title: "Practical Improvements to Automatic Visual Speech Recognition"
By
Mr. Ho Long FUNG
Abstract
Visual speech recognition (a.k.a lipreading) is the task of recognizing
speech solely from the visual movement of the mouth. In this work, we
propose multiple feasible and practical strategies, and demonstrate
significant improvements to the established competitive baselines in both
low-resource and resource-sufficient scenarios.
On one hand, one main challenge in practical automatic lipreading is to
deal with the diverse facial viewpoints in the available video data. With
the recent proposal of the spatial transformer, the spatial invariance to
input data in the convolutional neural network has been enhanced and it
has demonstrated different levels of success in a broad spectrum of areas
including face recognition, facial alignment and gesture recognition with
promising results by virtue of the increased model robustness to viewpoint
variations in the data. We study the effectiveness of the learned spatial
transformation to our model through quantitative and qualitative analysis
with visualizations and attain an absolute accuracy gain of 0.92% to our
data-augmented baseline on the resource-sufficient Lip Reading in the Wild
(LRW) continuous word recognition task with incorporation of spatial
transformer.
On the other, we explore the effectiveness of convolutional neural network
(CNN) and long short-term memory (LSTM) recurrent neural network in
lip-reading under a low-resource scenario that has not yet been explored
before. We propose an end-to-end deep learning model fusing conventional
CNN and bidirectional LSTM (BLSTM) together with maxout activation units
(maxout-CNN-BLSTM) and dropout, which is capable of attaining a word
accuracy of 87.6% on the low-resource Ouluvs2 corpus, offering an absolute
improvement of 3.1% to the previous state-of-the-art auto-encoder-BLSTM
model at that time. To emphasize, our lip-reading system does not require
any separate feature extraction stage nor pre-training phase with external
data resources.
Date: Wednesday, 12 December 2018
Time: 2:30pm - 4:30pm
Venue: Room 2131C
Lift 19
Committee Members: Dr. Brian Mak (Supervisor)
Prof. Dit-Yan Yeung (Chairperson)
Dr. Raymond Wong
**** ALL are Welcome ****