14 comments:

Regarding Xinlei's paper. Obviously a great paper. (Disclaimer: I am in Abhinav's lab) This paper demonstrates a serious engineering effort that shows how the idea of "macro-vision" can give remarkable results. (Just imagine where this would be with deep learning!) I was curious what you all thought about the relationships they learn. Does this seem like a sufficiently flexible framework for learning relationships? (Obj-Obj, Obj-Atr, Sc-Obj, Sc-Atr). Are there any that are missing?Also, I would be interested in hearing Xinlei's description of if the knowledge base has exhibited significant drift since NEIL's inception? How many of the current relationships are correct?

This paper proposed a semi-supervised method to automaticly extract and learn commonsense visual concept from the web. Starting from few seeding data, NEIL trains seeding classifiers and extracts relationships from the data. The data is retrained using a semi-supervised manner. To avoid semantic drifting, NEIL not only use the trained classifiers but the relationship to high confidence data. They showed pretty impressive results including 2700 learned concepts and near 9000 visual models.

Strengths-------------* It's the very first system applys never ending learning framework on images.* The system is very complex. I believe to make the system work it need a lot of solid engineering.* I like the idea of using relationships to prevent semantic drifting.

Improvement-----------------This paper is a very good paper. I only give some suggestions of the future improvement.

* The method start from a pre-defined vocabulary. The system is limited by the size of the vocabulary. Further work may move towards by combining the NELL and NEIL to learn the visual and textual knowledge jointly.

* The relationship set seems also to be hard-coded. Is it possible to learn the relationship set automaticly?

Divvala et al. describe an automated method of learning wide variations of any concept in their paper titled “Learning Everything about Anything: Webly-Supervised Visual Concept Learning”. Right off the bat, the authors describe the elusive problem of attempting to learn everything about any concept, where “everything” refers to all possible variations of a given concept and “any” refers to the span of different concepts. Clearly, this sounds like a very daunting task. The authors first describe various issues with using explicit human supervision to tackle this problem such as biases in a manually-defined vocabulary and the inability to effectively generalize. To alleviate the aforementioned issues, the authors introduce a “webly-supervised” approach which utilizes Google Books Ngrams to generate the vocabulary. A variety of procedures is performed to reduce the number of ngrams considered and to group them by similarity to a total of around 250 superngrams. The image dataset is generated by performing image searches on the ngrams that constitute these superngrams and models are trained from these downloaded images. Object detection experiments show that the authors' method surpasses two weakly-supervised state of the art methods but not the fully supervised state of the art method.

Downloading and processing all of the images pulled from the ngram-based search queries seems like it would take significant amount of time given that the authors mentioned that they have annotated more than 10 million images. On the system's webpage (http://levan.cs.washington.edu/), it shows 100 million images processed. It would be interesting to see some more details on how the authors crawl and process these images from a systems perspective. Did the authors utilize a cluster of machines to accomplish this task? If so, how?

I read the NEIL paper, It is a immense system, and very well innovated. I was wondering how much the initial seed affects the system, for example if we have a bad seeding would it eventually converge to fix the bad seeding? Also as Gunnar mention, the idea of introducing flexible relationships is interesting. It seems that the relationships themselves can be learned in an unsupervised way since there are a lot of information is semantic labeling. In fact I think instead of jointly learning from text and images, you can bootstrap the image with the learned semantics from NELL. Even though there are probably more relationship types that are learn, there can be an iterative process between the language relationship learning and image relationship learnin.

I will be presenting the NEIL and LEVAN papers today. They are interesting because they are not just tackling research problems but have produced engineering systems. We are in an interesting phase where more and more computer vision algorithms are being promoted from the lab to production.

While NEIL focuses on deriving semantic relationships between concepts, LEVAN focuses on acquiring every possible visual subcategory for any given category. We will also watch a short video on Robobrain (http://robobrain.me/#/) which is similar to NEIL but is tailored to providing intelligence to robots.

The author puts forward a semi-supervised learning framework that jointly labels the instance on the web and discovers the relationships. To deal with the large scale instances either unlabeled or labeled incorrectly, the author uses a bootstrap-like method. In addition to the traditional bootstrap methods where only categories/attributes are considered, the relationships between categories are taken into account. The semantic drift is reduced with this additional information, thus better results are achieved compared with the traditional bootstrap method. I'm wondering whether other additional info, such as the comparative attributes, can help even further to reduce the semantic drifting and boost the performance.

I read the paper 'NEIL'. Its an interesting and rather an important paper in the era of availability of large data. For each kind of a relation, a classifier or a detector is trained and matrix is formed and more instances are added and the process of training continues. Images from SUN and other datasets were used initially. How are attributes identified and will the number of attributes be increasing with iterations? With this large a data instead of the proposed features would using Deep Learning further improve the accuracy? Can we also know about the engineering behind the NEIL, how does it scale up with data? When the initial seed images are added by a text search how are new object categories and scenes trained? How is NELL used in NEIL? Though I got the overall idea I was unable to get some details like I mentioned. it would be nice if the presenters could discuss these too.

It is a really interesting and bold attempt to actually dealing with real data. One important factor in this project is the speed of learning. I am wondering wether the learning stays at constant pace or is there anyway to improve it. Either by re-seeding after each time period or consulting previous trained models and adding to it constantly when training new relations.

I read the NEIL paper. The authors use lot of different ideas from the field of recognition. They leverage the increasing amount of data on the internet and iteratively train a classifier/detector and learner which not only detects object and classifies scenes but also learns common sense relationships by representing them via attributes.The authors use the ideas from semi supervised learning presented last week to avoid semantic drift for attribute learning. They also use the exemplar methods for finding latent visual sub categories. NEIL gets better after every iteration as it adds more examples to its dataset.I like the simple yet effective way of statistically learning relationships using affinity matrix and co occurrences. I think this learning would be very well generalized because of the web scale data since it is of high magnitude and has randomness. In general, I think its a very bold and fresh idea. I think there is a lot of scope of future work- for example, some context information can be mined when mining images from the web (hashtags?! :) ). Moreover, I would love to see how this can be combined with CNNs or deep learning approached if at all.

I read the NEIL paper. For the limited space, the paper does a good job introducing the system and the layout.

One thing I do not understand completely is the attributes:

Are they learned or just initialized? If they are learned, does the learning approach differ? On of the constraints of learning was to try and reduce Polysemy, which seems like it would be hard to enforce for a query such as "round".

A well written paper dealing with so much online data. The system is thoroughly designed to discover relationships between object parts and whole objects, and different object categories.

An important question is how does NEIL maintain the correctness and effectiveness of its system and learnt relationships? How does one ensure that NEIL doesn't learn incorrect relationships given that the data on the internet has a LOT of variety, including a lot of not-so-good images of many objects? (eg. a text image stating "Corolla" vs an actual car image)