This paper presents a well designed system that automatically learns language models for describing images via textual descriptions. The system comprises of a step-by-step setup to first identify objects/"stuff" regions relevant to the entire image, then use attribute classifiers to assign probabilities of descriptive qualities of those objects, i.e. adjectives. Next, the system uses its pre-trained language model to also qualify objects/stuff with positional prepositions such as "[Object a] is near [Object b]". The system then constructs a Conditional Random Field (CRF) model which comprises of three kinds of nodes - (1) Object (2) Attributes (3) Prepositions. Preposition nodes usually form a 3-clique with 2 object nodes to quantify there relation. Attribute nodes singly qualify the property of an object node such as "furry [Object a]". The authors further describe the use of unary, pairwise and trinary potential functions in the energy function over labelings L over the input image I. While the details are outlined in the paper, one should note that the authors used Flickr image descriptions to learn attribute classifiers and prepositional keywords, which implies that the system is indeed entirely automatic.

The authors present interesting (and sometimes funny!) textual descriptions. The textual descriptions that "work well" aren't always right. For example "the cow is BY the sky" - it is surprising that the system doesn't learn the preposition "UNDER" during the text mining part. Further, in cases when the system doesn't seem to work really well, the authors suggest the mistakes are due to a "missing object detection", a detection proposing an incorrect object category or incorrect attribute prediction. However, this seems insufficient due to the lack of reasoning behind why this happened and the lack of a good quantitative metric. The authors could have constructed their own metric - ideally they should have generated textual descriptions by hand (THEMSELVES) and compared their generated descriptions with the hand descriptions. Constructing such a dataset would have contributed more to the computer vision research community as well.

I think the authors build the whole system, including the CRF model and the language model and so on, based on the assumption that their detect/classifiers always work well, therefore, it is quite reasonable to me that the system fails when mis-detection/mis-classification occurs. I don't really understand why you said it is insufficient?

As for the image caption dataset, I think Xinlei is currently working on one. I just found this on arXiv, the Microsoft COCO Captions Dataset (http://arxiv.org/pdf/1504.00325v2.pdf).

I read the second paper "Im2Text: Describing Images Using 1 Million Captioned Photographs". This work tackles the problem of generating descriptions to web-scale images. They filtered the web-scale images making the descriptions more likely refer to the visual contents in the image. When re-ranking the collected images, they combined the estimate from some visual categories as well as tfidf score to compute the similarity to the query image. This is quite content-based and works well. I just thought if there could be a method reasoning about some structures between the visual contents since the description of the image should be a "sentence" like a human being description rather than set of words.

A very interesting read. I am amazed by the accuracy of the resulting phrase, even those results that were slightly off represent some meaning of the image and makes some sense. What I am curious about and not fully explained by the paper is how they determine the grammar for grouping the descriptive phrases since the caption corresponding to training sets were not so grammatically organized.

Also, I would like so opinion on why would the contribution by this paper be useful. How would one apply it in real life use cases.

It is kind of remarkable that these methods (Im2Text excluded) seem to learn anything at all, given so little data. Just given that many state of the art NLP approaches use heaps and heaps of data. Although the Im2Text method uses more data, it is using retrieval, which feels a bit like cheating.. I wonder if the fact that nearest-neighbor type approaches perform comparably means that there is not a lot of learning going on.

I would be very interested in seeing the "Baby Talk" paper with deep learning. Their work seems like a solid basis for further work. Admittedly their work is pretty limited in terms of vocabulary (24 obj, 21 attr, etc) and the potentials are kind of hand crafted (using pretty solid engineering though, I like how they queried Google and Flickr in all the right places.). I guess their language model (N-gram) is outdated today? Maybe with more data, better language model, and deep learning, this could give really good results.

I read the main paper "Learning a Recurrent Visual Representation for Image Caption Generation". The main contribution to the RNN structure proposed before is there is a latent variable that captures the the the visual features of generated words. This allows the model to sort of remember which visual features were represented by which word to simulate long term memory. My question is that although this introduces a memory and allows for bidirectional inference, doesn't it just end up add an constraint? It sort of forces the model to continue to learn what its already learned, similar to how the authors say that only connecting the visual layer to only half of the "s" layer allows for better specialization. Although it does seem to do better and better with more data, as the difference in both sentence and image recall grows as the dataset increases as seen from PASCAL results to the Flickr 8k and 30k results. Which make sense as we've learned that neural networks scale much better with more data than not.

I read the paper on using RCNN's and Image Features for Image Caption Generation. As I read the paper, it seemed to me that there is an inherent bias (perhaps significant?) in retaining words from previous time steps. It also is evident that the generation of a particular word depends heavily on the detection of a particular object from an image. If an incorrect object is detected (which is sometimes the case), and an incorrect caption for that object is generated, retaining that particular word in memory and generating visual features in later time steps would seem particularly disruptive to the overall model. Positive reinforcement of these incorrect words, as the paper says, would propagate from one time step to the next. Although, the idea is fascinating, I feel that significant filtering of incorrect words might need to occur in certain images. For example, figure 3 in the paper has a few examples where the captions generated are incorrect. Exploring the spatial relationship of objects detected in images using deep networks, as the paper says, might be very beneficial to pruning out incorrect captions

Today I'll be presenting the main paper "Learning a Recurrent Visual Representation for Image Caption Generation". This paper discusses a neural network that models the bi-directional mapping between images and their sentence-based descriptions using a recurrent neural network. The interesting thing about this network is that is has the structure of both a recurrent neural network - used to generate new words at each time step, and the structure of an auto-encoder, which enables reconstruction of the image feature vector. I hope to have a good discussion with all of you today in class.

I read the main paper and found it to be very interesting. I have the same question as Tejas. That is, how does the system handle incorrect detection at the early stages?

The other aspect of the problem I found interesting was the ambiguity in comparing and bench marking results. The paper talks about the fact that 19.8% of the time, the annotations generated by the algorithms were accepted more than the human generated annotations. Although, I find this very ambiguous as what is an acceptable or a better annotation is not very straightforward to define. Also, how was evaluation done during development? It would have been infeasible to involve human subjects to evaluate the results during development. The authors have also used metrics like perplexity but I am not sure how one can define a single ground truth annotation for a given image.

I read the main paper and found it interesting. What I understand is a correlation between detected words and visual words so far is used to predict the sentence but as Venkatesh and Tejas pointed out if the object detection is faulty it will propagate throughout. http://cs.stanford.edu/people/karpathy/nips2014.pdf does similar task but using a sentence corpse and detecting fragments of the sentence in the image. Intuitively this seems more correct. In evaluation though the results look better compared to baby talk and midge the captions of midge from the figures have something related to the image where as the paper gives some faulty results which are not in the image it would be nice to know the reasons for this behavior.

1. "Im2Text:Describing Images Using 1 Million Captioned Photographs"This paper tries to directly transfer the caption of a visually similar image to the query image. As the dataset is large enough, a visually descriptive and relevant caption is expected to be selected from the dataset. Instead of transferring the whole sentence, some later work along this line fit sentence fragments into some template to create more flexible captions. However, the automatic language generation is still missing in this kind of work.

2. Baby Talk: Understanding and Generating Image DescriptionsThis paper models the visual content of an image as modified objects and their spatial relationships using CRF. Every triple in form of is developed into a sentence with some gluing words. However, the resulting captions are verbose and non-humanlike. This method is unable to express the salient content or the gist of the image in a rich variable description.

3. Learning a Recurrent Visual Representation for Image Caption GenerationBy considering image and language as two representations of the same semantic space, a lot of work has been done to translate image to caption using CNN+RNN. Different from those approaches focussing on the one-directional translation, this paper proposed a model to learn the bi-directional mapping between images and sentences. To achieve this goal, a symmetric vision-sentence-vision structure is used to remember the spoken words and the observed visual contents at the same time. Notice that the visual memory U is updated at each time step to help remembering visual concepts in a long term. This provides a new way to avoid the weakness of RNN in providing long context. Another interesting point is that instead of feeding the image features into the network only at the first time step like [1] and [2], the input static visual features V are constantly shown to word memory s, this helps the error propagation to CNN.

This paper addresses the problem of bi-directional mapping of between images and captions. Unlike previous approaches that map both images and sentences to a common space, they take a novel recurrent neural network approach that allows for the generation of novel captions that did not appear in the training data set.

A recurrent visual memory allows for long-term visual concepts. This visual memory gives a representation for what the model expects (i.e. the prior probability of visual features) that is updated over time depending on the words that appear in the caption. In addition, this network structure also allows for predicting visual features from sentences, not just the other way around.

Models like this that are based on recurrent neural networks seem to depend heavily on sentence structure, but we know that the same sentence can be represented in many different equivalent ways with different word orderings. (E.g. passive vs. active voice.) It would be interesting to see an experiment comparing the effect of word ordering (maybe the network learns to relate these equivalent sentences...?) or incorporate part-of-speech tag information to lessen the reliance on deterministic "temporal" structure.

I think I understand the approach taken in the paper, and I am also curious as to the bias being propagated from early predictions.

Something I don't understand is how a human readable output is generated, as I am not familiar with NLP. Most of the paper refers to individual words, and not clusters of non-descriptive words (like "the", "is", "to", etc). Is there an explicit step to this generation or a cleanup of the final output? or does the RNN handle everything?

I read the paper "Learning a Recurrent Visual Representation for Image Caption Generation". As Tejas et.al have pointed out, the biased estimation in the sequence is heavily dependent on the performance of the object detection, which may be incorrect and introduce larger errors as the words generation sequence goes on. I am curious how this is dealt with, since this issue may happen not only here but also in any other RNN-based structures. Also, I'm wondering how the generated visual features ( "v bar" in the paper ) help in terms of the caption generation.