Improving Multi-Modal Representations Using Image Dispersion: Why Less is Sometimes More

Models that learn semantic representations
from both linguistic and perceptual input
outperform text-only models in many
contexts and better reflect human concept
acquisition. However, experiments suggest
that while the inclusion of perceptual
input improves representations of certain
concepts, it degrades the representations
of others. We propose an unsupervised
method to determine whether to include
perceptual input for a concept, and show
that it significantly improves the ability of
multi-modal models to learn and represent
word meanings. The method relies solely
on image data, and can be applied to a variety
of other NLP tasks.