Matching Words and Pictures

Abstract

We present a new approach for modeling multi-modal data sets,
focusing on the specific case of segmented images with associated
text. Learning the joint distribution of image regions and words
has many applications. We consider in detail predicting words
associated with whole images (auto-annotation) and corresponding
to particular image regions (region naming). Auto-annotation might
help organize and access large collections of images. Region
naming is a model of object recognition as a process of
translating image regions to words, much as one might translate
from one language to another. Learning the relationships between
image regions and semantic correlates (words) is an interesting
example of multi-modal data mining, particularly because it is
typically hard to apply data mining techniques to collections of
images. We develop a number of models for the joint distribution
of image regions and words, including several which explicitly
learn the correspondence between regions and words. We study
multi-modal and correspondence extensions to Hofmann's
hierarchical clustering/aspect model, a translation model adapted
from statistical machine translation (Brown et al.), and a
multi-modal extension to mixture of latent Dirichlet allocation
(MoM-LDA). All models are assessed using a large collection of
annotated images of real scenes. We study in depth the difficult
problem of measuring performance. For the annotation task, we look
at prediction performance on held out data. We present three
alternative measures, oriented toward different types of task.
Measuring the performance of correspondence methods is harder,
because one must determine whether a word has been placed on the
right region of an image. We can use annotation performance as a
proxy measure, but accurate measurement requires hand labeled
data, and thus must occur on a smaller scale. We show results
using both an annotation proxy, and manually labeled data.