CNNs are very good at classifying scrambled images but humans are not.

In this post I will show why state-of-the-art Deep Neural Networks can still recognise scrambled images perfectly well and how this helps to uncover a puzzlingy simple strategy that DNNs seem to use to classify natural images. These findings, published at ICLR 2019, have a number of ramifications: first, they show that solving ImageNet is much simpler than many have thought. Second, the findings allow us to build much more interpretable and transparent image classification pipelines. Third, they explain a number of phenomena observed in modern CNNs like their bias towards texture (see our other paper at ICLR 2019 and our corresponding blog post) and their neglect of the spatial ordering of object parts.

Good ol’ bag-of-features models

In the old days, before Deep Learning, object recognition in natural images used to be fairly simple: define a set of key visual features (“words”), recognize how often each visual feature is present in an image (“bag”) and then classify the image based on these numbers. These models are therefore called “bag-of-features” models (BoF models). For illustration, say we have only two visual features, a human eye and a feather, and we want to classify images into “human” and “bird” class. The simplest BoF model would work as follows: for each eye in the image it increases evidence for “human” by +1. Vice versa, for each feather in the image it will increase evidence for “bird” by +1. Whatever class accumulates the most evidence across the image will be the predicted one.

A nice property of this simplest BoF model is its interpretability and transparent decision making: we can check exactly which image features carry evidence for a given class, the spatial integration of evidence is super simple (in contrast to the deep non-linear feature integration in deep neural networks) and so it is quite straight-forward to understand how the model reaches its decisions.

Traditional BoF models have been extremely popular and state-of-the-art before the onset of Deep Learning but quickly fell out of favour due to their subpar classification performance. But are we sure that Deep Neural Networks really use a fundamentally different decision-strategy as BoF-models?

A deep but interpretable bag-of-feature network (BagNet)

To test this we combine the interpretability and transparency of BoF models with the performance of DNNs. The high-level strategy is as follows:

Split the image into small q x q image patches

Pass patches through a DNN to get class evidences (logits) for each patch.

Sum the evidence over all patches to reach an image-level decision.

Classification strategy of BagNets: for each patch we extract class evidences (logits) using a DNN and sum up the total class evidences over all patches.

To implement this strategy in the simplest and most efficient way we take a standard ResNet-50 architecture and replace most (but not all) 3x3 convolutions with 1x1 convolutions. In this case the hidden units in the last convolutional layer each only “see” a small part of the image (i.e. their receptive field is much smaller than the size of the image). This avoids an explicit partitioning of the image and is as close as possible to standard CNNs while still implementing the outlined strategy. We call the resulting model architecture BagNet-q where q stands for the receptive field size of the top-most layer (we test q = 9, 17 and 33). The runtime of BagNet-q is roughly 2,5 the runtime of a ResNet-50.

Performance of BagNets with different patch sizes on ImageNet.

The performance of BagNets on ImageNet is impressive even for very small patch sizes: image features of size 17 x 17 pixels are enough to reach AlexNet-level performance while features of size 33 x 33 pixels are sufficient to reach around 87% top-5 accuracy. Higher performance values might be achievable with a more careful placement of the 3 x 3 convolutions and additional hyperparameter tuning.

That’s our first main result: you can solve ImageNet using only a collection of small image features. Long-range spatial relationships like object shape or the relation between object parts can be completely neglected and are unnecessary to solve the task.

A great feature of the BagNets is their transparent decision-making. For example, we can now look which image features are most predictive for a given class (see below). For example, a tench (a very big fish) is typically recognized by fingers on top of a greenish background. Why? Because most images in this category feature a fisherman holding up the tench like a trophy. Whenever the BagNet wrongly classifies an image as a tench it’s often because there are some fingers on top of a greenish background somewhere in the image.

Image features with the most class evidence. We show both features that correctly predicted the class (top rows) and distracting features that predicted the wrong class (bottom rows).

Similarly, we also get a precisely defined heatmap that shows which parts of the image contributed to a certain decision.

Heatmaps from BagNets showing exactly which image parts contributed to the decision. The heatmaps are not approximated but show the true contribution of each image part.

ResNet-50 is surprisingly similar to BagNets

BagNets show that one can reach high accuracy on ImageNet just based on the weak statistical correlations between local image features and the object category. If that’s enough, why would standard deep nets like ResNet-50 learn anything fundamentally different? Why should a ResNet-50 learn about complex large-scale relationships like object shape if the abundance of local image features is sufficient to solve the task?

To test the hypothesis that modern DNNs follow a similar strategy as simple bag-of-feature networks we test different ResNets, DenseNets and VGGs on the following “signatures” of BagNets:

Decisions are invariant against spatial shuffling of image features (could only be tested on VGG models).

Modifications of different image parts should be independent (in terms of their effect on the total class evidence).

Errors made by standard CNNs and BagNets should be similar.

Standard CNNs and BagNets should be sensitive to similar features.

In all four experiments we find a strikingly similar behaviour between CNNs and BagNets. As an example, in the last experiment we show that those image parts to which the BagNets are most sensitive to (e.g. if you occlude those parts) are basically the same to which CNNs are most sensitive to. In fact, the heatmaps (the spatial map of sensitivity) of BagNets are better predictors of the sensitivity of DenseNet-169 than heatmaps generated by attribution methods like DeepLift (which compute heatmaps directly for DenseNet-169). Of course, DNNs do not perfectly resemble bag-of-feature models but do show some deviations. In particular, we find increased features sizes and more long-range dependencies the deeper the networks get. Hence, deeper neural networks do improve over simpler bag-of-feature models but I don’t think that the core classification strategy has really changed.

Going beyond bag-of-features classification

Viewing the decision-making of CNNs as a bag-of-feature strategy could explain several weird observations about CNNs. First, it would explain why CNNs have such a strong texture-bias. Second, it would explain why CNNs are so insensitive to the shuffling of image parts. It might even explain the existence of adversarial stickers and adversarial perturbations in general: one can place misleading signals anywhere in the image and the CNN will still reliably pick up that signal whether or not these signal fit to the rest of the image.

At its core our work shows that CNNs use the many weak statistical regularities present in natural images for classification and don’t make the jump towards object-level integration of image parts like humans. The same is likely true for other tasks and sensory modalities.

We have to think hard about how to build our architectures, tasks and learning methods to counteract this tendency towards weak statistical correlations. One angle would be to improve the inductive biases of CNNs away from small local features towards more global features. Another angle would be to remove or replace those features onto which the networks should not rely, which is exactly what we did in another ICLR 2019 publication using a style transfer preprocessing to remove the natural object texture.

One of the biggest problems, however, is certainly the task of image classification itself: if local image features are sufficient to solve the task, there is no incentive to learn the true “physics” of the natural world. We will have to restructure the task itself in a way that pushes models to learn the physical nature of objects. This likely has to go beyond purely observational learning of correlations between input and output features in order to allow models to extract causal dependencies.

Taken together, our results suggest that CNNs might follow an extremely simple classification strategy. The fact that such a discovery can still be made in the year 2019 highlights how little we have yet understood about the inner workings of deep neural networks. This lack of understanding prevents us from developing fundamentally better models and architectures that close the gap between human and machine perception. Deepening our understanding will allow us to uncover ways to close this gap. This can be extremely fruitful: as we tried to bias CNNs towards more physical properties of objects we suddenly reached human-like noise robustness. I expect many more exciting results on our way towards CNNs that truly learn the physical and causal nature of our world.

I am regularly posting about my research and professional work via Twitter at @wielandbr or via linkedin.