Tools

"... We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combinati ..."

We present a model that generates natural language de-scriptions of images and their regions. Our approach lever-ages datasets of images and their sentence descriptions to learn about the inter-modal correspondences between lan-guage and visual data. Our alignment model is based on a novel combination of Convolutional Neural Networks over image regions, bidirectional Recurrent Neural Networks over sentences, and a structured objective that aligns the two modalities through a multimodal embedding. We then describe a Multimodal Recurrent Neural Network architecture that uses the inferred alignments to learn to generate novel descriptions of image regions. We demonstrate that our alignment model produces state of the art results in retrieval experiments on Flickr8K, Flickr30K and MSCOCO datasets. We then show that the generated descriptions significantly outperform retrieval baselines on both full images and on a new dataset of region-level annotations.

"... In recent years, the problem of associating a sentence with an image has gained a lot of attention. This work con-tinues to push the envelope and makes further progress in the performance of image annotation and image search by a sentence tasks. In this work, we are using the Fisher Vec-tor as a sen ..."

In recent years, the problem of associating a sentence with an image has gained a lot of attention. This work con-tinues to push the envelope and makes further progress in the performance of image annotation and image search by a sentence tasks. In this work, we are using the Fisher Vec-tor as a sentence representation by pooling the word2vec embedding of each word in the sentence. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). In this work we present two other Mixture Models and derive their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mix-ture Model (LMM), which is based on the Laplacian dis-tribution. The second Mixture Model presented is a Hy-brid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. Finally, by using the new Fisher Vectors derived from HGLMMs to represent sentences, we achieve state-of-the-art results for both the image annota-tion and the image search by a sentence tasks on four bench-

"... Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich ..."

Books are a rich source of both fine-grained information, how a character, an object or a scene looks like, as well as high-level semantics, what someone is thinking, feeling and how these states evolve through a story. This paper aims to align books to their movie releases in order to provide rich descriptive explanations for visual content that go semanti-cally far beyond the captions available in current datasets. To align movies and books we exploit a neural sentence embedding that is trained in an unsupervised way from a large corpus of books, as well as a video-text neural em-bedding for computing similarities between movie clips and sentences in the book. We propose a context-aware CNN to combine information from multiple sources. We demon-strate good quantitative performance for movie/book align-ment and show several qualitative examples that showcase the diversity of tasks our model can be used for. 1.

...ere is a wide variety of video applications based on the description, ranging from editing, indexing, search, to sharing. However, the problem itself has been taken as a grand challenge for decades in the research communities, as the description generation model should be powerful enough not only to recognize key objects from visual content, but also discover their spatio-temporal relationships and the dynamics expressed in a natural language. Despite the difficulty of the problem, there have been a few attempts to address video description generation [5, 30, 34], and image caption generation [6, 13, 16, 31], which are mainly inspired by recent advances in machine translation using RNN [1]. Among these successful attempts, most of them use Long Short-Term Memory (LSTM) [9], a variant of RNN, which can capture long-term temporal information by mapping sequences to sequences. Thus, we follow this elegant recipe and use LSTM as our RNN model to generate the video sentence in this paper. However, existing approaches to video description generation mainly optimize the next word given the input video and previous words locally, while leaving the relationship between the semantics of the entire sentence...

"... In the traditional object recognition pipeline, descriptors are densely sampled over an image, pooled into a high di-mensional non-linear representation and then passed to a classifier. In recent years, Fisher Vectors have proven em-pirically to be the leading representation for a large vari-ety of ..."

In the traditional object recognition pipeline, descriptors are densely sampled over an image, pooled into a high di-mensional non-linear representation and then passed to a classifier. In recent years, Fisher Vectors have proven em-pirically to be the leading representation for a large vari-ety of applications. The Fisher Vector is typically taken as the gradients of the log-likelihood of descriptors, with respect to the parameters of a Gaussian Mixture Model (GMM). Motivated by the assumption that different distri-butions should be applied for different datasets, we present two other Mixture Models and derived their Expectation-Maximization and Fisher Vector expressions. The first is a Laplacian Mixture Model (LMM), which is based on the Laplacian distribution. The second Mixture Model presented is a Hybrid Gaussian-Laplacian Mixture Model (HGLMM) which is based on a weighted geometric mean of the Gaussian and Laplacian distribution. An interest-ing property of the Expectation-Maximization algorithm for the latter is that in the maximization step, each dimension in each component is chosen to be either a Gaussian or a Laplacian. Finally, by using the new Fisher Vectors derived from HGLMMs, we achieve state-of-the-art results for both the image annotation and the image search by a sentence tasks. 1.

...er can generate novel descriptions from scratch. They use long short-term memory (LSTM) to encode sentences, and the VGG [40] deep convolution neural network (CNN) to represent images. Vinyals et al. =-=[48]-=- also describe a method of image description generation. Their work was inspired by recent advances in machine translation, where the task is to transform a sentence S written in a source language, in...

"... Abstract The chain-structured long short-term memory (LSTM) has showed to be effective in a wide range of problems such as speech recognition and machine translation. In this paper, we propose to extend it to tree structures, in which a memory cell can reflect the history memories of multiple child ..."

Abstract The chain-structured long short-term memory (LSTM) has showed to be effective in a wide range of problems such as speech recognition and machine translation. In this paper, we propose to extend it to tree structures, in which a memory cell can reflect the history memories of multiple child cells or multiple descendant cells in a recursive process. We call the model S-LSTM, which provides a principled way of considering long-distance interaction over hierarchies, e.g., language or image parse structures. We leverage the models for semantic composition to understand the meaning of text, a fundamental problem in natural language understanding, and show that it outperforms a state-of-theart recursive model by replacing its composition layers with the S-LSTM memory blocks. We also show that utilizing the given structures is helpful in achieving a performance better than that without considering the structures.

...show that it outperforms a state-of-theart recursive model by replacing its composition layers with the S-LSTM memory blocks. We also show that utilizing the given structures is helpful in achieving a performance better than that without considering the structures. 1. Introduction Recent years have seen a revival of the long short-term memory (LSTM) (Hochreiter & Schmidhuber, 1997), with its effectiveness being demonstrated on a wide range of problems such as speech recognition (Graves et al., 2013), machine translation (Sutskever et al., 2014; Cho et al., 2014), and image-to-text conversion (Vinyals et al., 2014), Proceedings of the 32nd International Conference on Machine Learning, Lille, France, 2015. JMLR: W&CP volume 37. Copyright 2015 by the author(s). among many others, in which history is summarized and coded in the memory blocks in a full-order fashion. Recursion is a fundamental process associated with many problems—a recursive process and the structure it forms are common in different modalities. For example, semantics of sentences in human languages is arguably to be carried not merely by a linear concatenation of words; instead, sentences often have structures (Manning & Schutze, 1999). I...

"... Abstract—We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring many real-world scenarios, such as helping the visually impaired, both the q ..."

Abstract—We propose the task of free-form and open-ended Visual Question Answering (VQA). Given an image and a natural language question about the image, the task is to provide an accurate natural language answer. Mirroring many real-world scenarios, such as helping the visually impaired, both the questions and answers are open-ended. Visual questions selectively target different areas of an image, including background details and underlying context. As a result, a system that succeeds at VQA typically needs a more detailed understanding of the image and complex reasoning than a system producing generic image captions. Moreover, VQA is amenable to automatic evaluation, since many open-ended answers contain only a few words or a closed set of answers that can be provided in a multiple-choice format. We provide a dataset containing 100, 000’s of images and questions and discuss the information it provides. Numerous baselines for VQA are provided and compared with human performance. F

...ing that combines Computer Vision (CV), Natural Language Processing (NLP), and Knowledge Representation & Reasoning (KR) has dramatically increased in the past year [14], [7], [10], [31], [22], [20], =-=[42]-=-. Part of this excitement stems from a belief that multi-discipline tasks like image captioning are a step towards solving AI. However, the current state of the art demonstrates that a coarse scene-le...

"... This work aims to address the problem of image-based question-answering (QA) with new models and datasets. In our work, we propose to use neural networks and visual semantic embeddings, without intermediate stages such as object de-tection and image segmentation, to predict answers to simple questio ..."

This work aims to address the problem of image-based question-answering (QA) with new models and datasets. In our work, we propose to use neural networks and visual semantic embeddings, without intermediate stages such as object de-tection and image segmentation, to predict answers to simple questions about im-ages. Our model performs 1.8 times better than the only published results on an existing image QA dataset. We also present a question generation algorithm that converts image descriptions, which are widely available, into QA form. We used this algorithm to produce an order-of-magnitude larger dataset, with more evenly distributed answers. A suite of baseline results on this new dataset are also pre-sented. 1

"... Generating a novel textual description of an im-age is an interesting problem that connects com-puter vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syn-tax of ..."

Generating a novel textual description of an im-age is an interesting problem that connects com-puter vision and natural language processing. In this paper, we present a simple model that is able to generate descriptive sentences given a sample image. This model has a strong focus on the syn-tax of the descriptions. We train a purely bilinear model that learns a metric between an image rep-resentation (generated from a previously trained Convolutional Neural Network) and phrases that are used to described them. The system is then able to infer phrases from a given image sam-ple. Based on caption syntax statistics, we pro-pose a simple language model that can produce relevant descriptions for a given test image us-ing the phrases inferred. Our approach, which is considerably simpler than state-of-the-art mod-els, achieves comparable results in two popular datasets for the task: Flickr30k and the recently proposed Microsoft COCO. 1.

...ation is to segment video sequences into smaller clips that contain subactions, using a hierarchical approach (Pirsiavash and Ramanan, 2014). The generation of short description from video sequences (=-=Vinyals et al., 2015-=-) based on convolutional neural networks (CNN) (Ciresan et al., 2011) was also used for activity recognition (Donahue et al., 2015). Intermediate semantic features representation for recognizing unsee...