Abstract

The objective of this paper is to separate a target speaker's speech from amixture of two speakers using a deep audio-visual speech separation network.Unlike previous works that used lip movement on video clips or pre-enrolledspeaker information as an auxiliary conditional feature, we use a single faceimage of the target speaker. In this task, the conditional feature is obtainedfrom facial appearance in cross-modal biometric task, where audio and visualidentity representations are shared in latent space. Learnt identities fromfacial images enforce the network to isolate matched speakers and extract thevoices from mixed speech. It solves the permutation problem caused by swappedchannel outputs, frequently occurred in speech separation tasks. The proposedmethod is far more practical than video-based speech separation since userprofile images are readily available on many platforms. Also, unlikespeaker-aware separation methods, it is applicable on separation with unseenspeakers who have never been enrolled before. We show strong qualitative andquantitative results on challenging real-world examples.