Crossmodal Semantic Representations

Recently at CCRi, we have been doing a lot of research in the area of reduced dimensional semantic embedding models: models where semantically similar objects possess similar representations. In a previous post, Nick discussed how such a model can be learned for relational data; since relationships between entities are explicitly provided in this data, the resulting representations are also able to capture these relationships. There are also models which learn representations from plain text; the most popular such model is word2vec by Tomas Mikolov, et al., which is able to quickly train on immensely large corpora to produce state of the art results [1,2]. At a basic level, word2vec works by learning the representation of a word based upon the representations of the words around it. Amazingly, although relationships are not explicitly stated in plain text corpora, this model is able to discover latent relationships in the text and capture them in the model. This process is traditionally described as “analogy completion”, as it answers questions of the form “Paris is to France as Berlin is to _” (which can be thought of as the representations effectively learning to capture the “capital of” predicate).

word2vec CBOW architecture example (from [3])

One of our recent goals was to develop an architecture that allows for crossmodal semantic representations: representations of words, paragraphs, documents, and images in the same semantic space. This work would enable semantic search across modalities (e.g. searching for “puppy” would return words, pictures, and paragraphs all relating to puppies). To learn joint embeddings for words, paragraphs, and documents, we extended the gensim implementation of word2vec to additionally learn vector representations for text snippets of arbitrary length as in [3]. This is done, loosely speaking, by adding paragraph and document vectors to the contexts found in the more traditional word2vec algorithm.

doc2vec DM architecture example (from [3])

Finally, in order to add images to this semantic space, we ran standard image feature extraction techniques and learned a mapping which preserved semantics across modalities (sending pictures of cats to the word “cat”, for instance). This technique has been explored in the literature with impressive results [4,5,6]. However, learning a mapping in this way requires a large volume of labeled training data. For natural imagery, the obvious choice is ImageNet, which aims to provide an average of 1000 images for every noun in WordNet. For aerial imagery, no similar corpus exists (the UC Merced Aerial Land Use data set is probably the closest, with only 21 classes with 100 images each), so we needed to build our own. We developed a tool which automatically constructs training data from GIS layers, allowing us to quickly generate data, or use existing GIS layers to quickly label data.

Here are some sample images of the results, projected into 2d using t-SNE [7]. (Images property of DigitalGlobe.)

water tiles grouping togethersemantically similar phrases grouping togetherpictures of marshes grouping next to marsh related words