Abstract

We present a system capable of producing video-realistic
videos of a speaker given audio only. The audio input
signal requires no phonetic labelling and is speaker independent.
The system requires only a small training set of
video to achieve convincing realistic facial synthesis. The
system leams the natural mouth and face dynamics of a
speaker to allow new facial poses, unseen in the training
video, to be synthesised. To achieve this we have developed
a novel approach which utilises a hierarchical and
non-linear PCA model which couples speech and appearance.
We show that the model is capable of synthesising
videos of a speaker using new audio segments from
both previously heard and unheard speakers. The model
is highly compact making it suitable for a wide range of
real-time applications in multimedia and telecommunications
using standard hardware.