Normalization in Deep Learning

A few days ago (Jun 2017), a 100 page on Self-Normalizing Networks appeared. An amazing piece of theoretical work, it claims to have solved the problem of building very large Feed Forward Networks (FNNs).

It builds upon a Batch Normalization (BN), introduced in 2015– and is now the defacto standard for all CNNs and RNNs. But not so useful for FNNs.

What makes normalization so special? It makes very Deep Networks easier to train, by damping out oscillations in the distribution of activations.

To see this, the diagram below uses data from Figure 1 (from the BN paper) to depict how the distribution evolves for a typical node outputsin the last hidden layer of a typical network:

node outputs, with and without Batch Normalization

Very Deep nets can be trained faster and generalize better when the distribution of activations is kept normalized during BackProp.

We regularly see Ultra-Deep ConvNets like Inception, Highway Networks, and ResNet. And giant RNNs for speech recognition, machine translation, etc. But we don’t see powerful Feedforward Neural Nets (FNNS) with more than 4 layers. Until now.

Batch Normalization is great for CNNs and RNNs.

But we still can not build deep MLPs

This new method — Self-Normalization— has been proposed for building very deep MultiLayer Perceptions (MLPs) and other Feed Forward Nets (FNNs).

The idea is just to tweak a the Exponential Linear Unit (ELU) activation function to obtain a Scaled ELU (SELU):

With this new SELU activation function, and a new, alpha Dropout method, it appears we can, now, build very deep MLPs. And this opens the door for Deep Learning applications on very general data sets. That would be great!

The paper is, however, ~100 pages long of pure math! Fun stuff.. but a summary is in order.

The problem is that during SGD training, the distribution of weights W and/or the outputs x can vary widely from iteration to iteration. These large variations lead to instabilities in training that require small learning rates. In particular, if the layer weights W or inputs u blow up, the activations can become saturated:

,

leading to vanishing gradients. Traditionally, this was in MLPs avoided by using larger learning rates, and/or early stopping.

One solution is better activation functions, such as a Rectified Linear Unit (ReLu)

Note: ReLUs only help Deep CNNs and RNNs

SGD training introduces perturbations in training that propagate through the net, causing large variations in weights and activations. For FNNs, this is a huge problem. But for CNNs and RNNs..not so much. why ?

CNNs and RNNs, are less distorted by the SGD perturbations–presumably because of their weight sharing architectures.

Moreover, Dropout (a stochastic regularizer) works very well with ReLUs in CNNs and RNNs, but not so much for MLPs and other FNNs.

And very Deep Nets, like ResNet, which have > 150 layers, use skip connections to help propagate the internal residuals.

It has been said the no real theoretical progress has been made in deep nets in 30 years. That is absurd. We did not have ReLus or ELUs. In fact, up until Batch Normalization, we were still using SVM-style regularization techniques for Deep Nets. It is clear now that we need to rethink generalization in deep learning.

saturation regions to dampen the variance if it is too large in the lower layer,

a slope > 1 to increase if it is too small in the lower layer, and

a continuous curve, which ensures a fixed point.

Amazingly, the implicit self-normalizing properties are actually proved–in only about 100 pages–using the Banach Fixed Point Theorem.

They show that, for an FNN using selu(x) actions, there exists a unique attracting and stable fixed point for the mean and variance. (Curiously, this resembles the argument that Deep Learning (RBMs at least) the Variational Renormalization Group (VRG) Transform.

There are, of course, conditions on the weights–things can’t get too crazy. This is hopefully satisfied by selecting initial weights with zero mean and unit variance.

,

(depending how we define terms).

To apply SELUs, we need a special initialization procedure, and a modified version of Dropout, alpha-Dropout,

SELU Initialization

We select initial weights from a Gaussian distribution with mean 0 and variance , where N is number of weights:

TensorFlow implementation of Weight Initialization

In Statistical Mechanics, this the Temperature is proportional to the variance of the Energy, and therefore sets the Energy scale. Since E ~ W,

SELU Weight initialization is similar in spirit to fixing T=1.

Alpha Dropout

Note that to apply Dropout with an SELU, we desire that the mean and variance are invariant.

We must set random inputs to saturated negative value of SELU, Then, apply an affine transformation, computing relative to dropout rate.

Summary

We have reviewed several variants of normalization in deep nets, including

Max Norm weight constraints

Batch Normalization, and

Self-Normalizing, Deep Feed Forward Nets

Along the way, I have tried to convince you that recent developments in the normalization of Deep Nets represent a culmination over 30 years of research into Neural Network theory, and that early ideas about finite Temperature methods from Statistical Mechanics have evolved into and are deeply related to the Normalization methods employed today to create very Deep Neural Networks

Appendix:

Temperature Control in Neural Networks

Very early research in Neural Networks lifted idea from statistical mechanics. Early work by Hinton formulated AutoEncoders and the principle of the Minimum Description Length (MDL) as minimizing a Helmholtz Free Energy: