Analysis of Audio-Visual Features for Unsupervised Speech Recognition

Jennifer Drexler, James Glass

Research on “zero resource” speech processing focuses on learning linguistic information from unannotated, or raw, speech data, in order to bypass the expensive annotations required by current speech recognition systems. While most recent zero-resource work has made use of only speech recordings, here, we investigate the use of visual information as a source of weak supervision, to see whether grounding speech in a visual context can provide additional benefit for language learning. Specifically, we use a dataset of paired images and audio captions to supervise learning of low-level speech features that can be used for further “unsupervised” processing of any speech data. We analyze these features and evaluate their performance on the Zero Resource Challenge 2015 evaluation metrics, as well as standard keyword spotting and speech recognition tasks. We show that features generated with a joint audio-visual model contain more discriminative linguistic information and are less speaker-dependent than traditional speech features. Our results show that visual grounding can improve speech representations for a variety of zero-resource tasks.