Abstract: We propose an unsupervised method for reference resolution in instructional
videos, where the goal is to temporally link an entity (e.g., "dressing") to
the action (e.g., "mix yogurt") that produced it. The key challenge is the
inevitable visual-linguistic ambiguities arising from the changes in both
visual appearance and referring expression of an entity in the video. This
challenge is amplified by the fact that we aim to resolve references with no
supervision. We address these challenges by learning a joint visual-linguistic
model, where linguistic cues can help resolve visual ambiguities and vice
versa. We verify our approach by learning our model unsupervisedly using more
than two thousand unstructured cooking videos from YouTube, and show that our
visual-linguistic model can substantially improve upon state-of-the-art
linguistic only model on reference resolution in instructional videos.