Despite this encouraging progress, no clear understanding of why they perform so well, or how they might be improved. There is still little insight into the internal operation and behavior of these complex models, or how they achieve such good performance. From a scientific standpoint, this is deeply unsatisfactory. Without clear understanding of how and why they work, the development of better models is reduced to trial-and-error.

Addressed problem:

Explore why Large Convolutional Network perform so well, and how they might be improve.

Novelty:

Introduce a novel visualization technique that gives insight into the function of intermediate feature layers and the operation of the classifier. The new visualization technique:

Reveals the input stimuli that excite individual feature maps at any layer in the model.

Allows us to observe the evolution of features during training and to diagnose potential problems with the model.

Used in a diagnostic role, these visualizations allow us to find model architectures that outperform Krizhevsky et al. on the ImageNet classification benchmark.

Perform an ablation study to discover the performance contribution from different model layers.

The proposed ImageNet model generalizes well to other datasets: when the softmax classifier is retrained, it convincingly beats the current state-of-the-art results on Caltech-101 and Caltech-256 datasets.

Perform a sensitivity analysis of the classifier output by occluding portions of the input image, revealing which parts of the scene are important for classification

Previous works:

Since their introduction by LeCun et al. [20] in the early 1990’s, Convolutional Networks (convnets) have demonstrated excellent performance at tasks such as hand-written digit classification and face detection.

In the last 18 months, several papers have shown that they can also deliver outstanding performance on more challenging visual classification tasks:

Several factors are responsible for this dramatic improvement in performance:

The availability of much larger training sets, with millions of labeled examples;

Powerful GPU implementations, making the training of very large models practical

Better model regularization strategies, such as Dropout.

Key Ideas:

Visualization:

Visualizing features to gain intuition about the network is common practice, but mostly limited to the 1st layer where projections to pixel space are possible. In higher layers alternate methods must be used.

The proposed visualization technique uses a multi-layered Deconvolutional Network (deconvnet), as proposed by Zeiler et al. [29], to project the feature activations back to the input pixel space.

A deconvnet can be thought of as a convnet model that uses the same components (filtering, pooling) but in reverse, so instead of mapping pixels to features does the opposite. In Zeiler et al. [29], deconvnets were proposed as a way of performing unsupervised learning. Here, they are not used in any learning capacity, just as a probe of an already trained convnet.

[8] find the optimal stimulus for each unit by performing gradient descent in image space to maximize the unit’s activation.

The proposed approach in the paper is similar to contemporary work by Simonyan et al. [23] who demonstrate how saliency maps can be obtained from a convnet by projecting back from the fully connected layers of the network, instead of the convolutional features that are used.

Girshick et al. [10] show visualizations that identify patches within a dataset that are responsible for strong activations at higher layers in the model. The proposed visualizations differ in that they are not just crops of input images, but rather top-down projections that reveal structures within each patch that stimulate a particular feature map.

Generalization:

The demonstration of the generalization ability of convnet features is also explored in concurrent work by Donahue et al. [7] and Girshick et al. [10].

Analysis steps:

Start with the architecture of Krizhevsky et al. [18] and explore different architectures, discovering ones that outperform their results on ImageNet.

Explore the generalization ability of the model to other datasets, just retraining the softmax classifier on top. As such, this is a for of supervised pre-training, which contrasts with the unsupervised pre-training methods popularized by Hinton et al. [13] and others [1], 26].

Each RGB image was preprocessed by resizing the smallest dimension to 256, cropping the center 256×256 region, subtracting the per-pixel mean (across all images) and then using 10 different sub-crops of size 224×224 (corners + center with(out) horizontal flips).

Learning algorithm:

Uses a large set of labeled images, where label is a discrete variable indicating the true class. A cross-entropy loss function, suitable for image classification, is used to compare the output and the target.

The parameters of the network (filters in the convolutional layers, weight matrices in the fullyconnected layers and biases) are trained by back-propagating the derivative of the loss with respect to the parameters throughout the network, and updating the parameters via stochastic gradient descent.

Stochastic gradient descent with a mini-batch size of 128 was used to update the parameters, starting with a learning rate of 0.01, in conjunction with a momentum term of 0.9.

Anneal the learning rate throughout training manually when the validation error plateaus. Dropout [14] is used in the fully connected layers (6 and 7) with a rate of 0.5. All weights are initialized to 0.01 and biases are set to 0.

Stopped training after 70 epochs.

Training time: 12 days on a single GTX580 GPU, using an implementation based on [18]

Insights:

The paper explored large convolutional neural network models, trained for image classification, in a number ways.

The paper presented a novel way to visualize the activity within the model. This reveals the features to be far from random, uninterpretable patterns. Rather, they show many intuitively desirable properties such as compositionality, increasing invariance and class discrimination as we ascend the layers.

The paper also shows how these visualization can be used to identify problems with the model and so obtain better results, for example improving on Krizhevsky et al. impressive ImageNet 2012 result.

The paper demonstrated through a series of occlusion experiments that the model, while trained for classification, is highly sensitive to local structure in the image and is not just using broad scene context.

An ablation study on the model revealed that having a minimum depth to the network, rather than any individual section, is vital to the model’s performance.

The ImageNet trained model can generalize well to other datasets. For Caltech-101 and Caltech-256, the datasets are similar enough that the model can beat the best reported results, in the latter case by a significant margin.

The proposed convnet model generalized less well to the PASCAL data, perhaps suffering from dataset bias [25], although it was still within 3.2% of the best reported result, despite no tuning for the task.

My notes and review:

This paper is interesting since it visualizes the obtained features inside deep convolution neural networks via deconvolutional network. The visualization hardly can be found in other CNN papers. Such visualization brings more understanding on why the convolutional neural networks perform so well for visual recognition tasks.

A look into visualization of features in a fully trained model, which show the top activations in a random subset of feature maps across the validation data, projected down to pixel space using deconvolutional network, we can confirm that the features visually represents back the input.

I am still curious whether our brain does doing convolution and deconvolution. The answer might be found by looking back into the work of Kunihiko Fukushima on neocognitron which is inspired by the model proposed by Hubel & Wiesel in 1959.