PINBOARD SUMMARY

To understand neural networks better, researchers have recently made extensive use of information theory.

One popular approach is to train a neural network conventionally, and then evaluate neuron outputs using information-theoretic cost functions: For example, it has been found out that neurons in deeper layers "learn" about the class label during training (by increasing mutual information between layer and class) and that, in some cases, layers "forget" irrelevant aspects of the input (by decreasing mutual information between layer and input). Similarly, it was observed that, during training, the information contained in a layer passes through a period in which it is highly redundant, and ends up in a stage in which individual neurons a informative about different classes. Even more generally, neurons in deeper layers tend to become "cat neurons", i.e., they help in distinguishing exactly one class from the rest. Neurons in early layers, in contrast, mainly collect general features, thus appear to matter for classification not alone, but mainly together with other neurons. Surprisingly, the existence of these "cat neurons" was linked to bad generalization performance, i.e., to the performance on data that was not used during training.

The information bottleneck principle encapsulates exactly what one would wish for a neural network, or any classification system in general: That the network preserves all information in the input relevant for classification, but forgets everything that is irrelevant. Researchers have thus tried to train a neural network using the information bottleneck principle. Since it can be shown that -- taken as it is -- this principle is inadequate as a cost function for training, researchers have successfully replaced it by cost functions that are similar in spirit - with remarkable success: The trained networks were more successful in compressing the input information to what is relevant for classification and were more robust to adversarial examples.

However, it seems as if even qualitative results depends a lot on how the network is constructed: Is the activation function sigmoidal or a ReLU? How are the information-theoretic quantities estimated? What is the influence of the number of layers? Nevertheless, even though even general trends cannot be claimed with certainty, information theory holds some promise to "open the black box of deep learning" -- what are you waiting for?