PhD Thesis Proposal

February

Abstract:
Our visual world is highly structured and the visual data is highly redundant. In recent years, the computer vision field has been transformed by the success of Convolutional Neural Networks (ConvNets). However, the structure and redundancy in visual data has not been well explored in deep learning. The benefits of exploring data redundancy are two-fold: (i) Redundancy can provide supervision signals for training ConvNets, avoiding large amounts of manual labeling; (ii) By modeling the structure in visual data and relationships between redundant visual patches, our recognition system can be significantly improved.

Visual data is redundant, we can easily predict some portion of the data based on the rest part of the data. For example, given an RGBD image, we can predict the depth channel from the RGB channels. Videos are also highly redundant, we can predict one frame based on its previous frames since the objects and the scene in two consecutive frames are likely to be the same. We exploit these redundancies and design self-supervised methods to train ConvNets. We show the effectiveness of our learned representations on different tasks including image classification, object detection and 3D geometry estimation. Going beyond pixel space, redundancy also appears in the semantic space of our visual world. For instance, dogs look more similar to cats than cars. Instead of learning each of these visual classifiers independently, we propose to borrow the information learned in an existing classifier to a new one (e.g., training a cat classifier by borrowing information from a dog classifier).

Redundancy is not only useful for providing supervision signals to train ConvNets and classifiers, it can also be utilized in modeling and inference. We present non-local operation to model the structured and repeating patterns in the visual world. Our operation can be readily plugged into existing networks, and capture the long-range pairwise relationships between visual patches. By incorporating these relationships in the model, we obtain significant improvements in video classification and object recognition tasks in still images. The surprising results of non-local networks drives me to further organize the pairwise relationships in a graph structure and perform graph convolutions for visual recognition.