Paper summarydavidstutzBa et al. propose layer normalization, normalizing the activations of a layer by its mean and standard deviation. In contrast to batch normalization, this scheme does not depend on the current batch; thus, it performs the same computation at training and test time. The general scheme, however, is very similar. Given the $l$-th layer of a multi-layer perceptron,
$a_i^l = (w_i^l)^T h^l$ and $h_i^{l + 1} = f(a_i^l + b_i^l)$
with $W^l$ being the weight matrix, the activations $a_i^l$ are normalized by mean $\mu_i^l$ and standard deviation $\sigma_i^l$. For batch normalization these are estimated over the current mini batch:
$\mu_i^l = \mathbb{E}_{p(x)} [a_i^l]$ and $\sigma_i^l = \sqrt{\mathbb{E}_{p(x)} [(a_i^l - \mu_i^l)^2}$.
However, this estimation depends heavily on the batch size; additionally, models change during training and test time (at test time, these statistics are estimated over the training set). For layer normalization, instead, these statistics are evaluated over the activations in the same layer:
$\mu^l = \frac{1}{H}\sum_{i = 1}^H a_i^l$ and $\sigma^l = \sqrt{\frac{1}{H}\sum_{i = 1}^H (a_i^l - \mu^l)^2}$.
Thus, the normalization is not depending on the batch size anymore. Additionally, layer normalization is invariant to scaling and shifts of the weight matrix (for batch normalization, this only holds for the columns of the matrix). In experiments, this approach is shown to work well for a variety of tasks including models with attention mechanisms and recurrent neural networks. For convolutional neural networks, the authors state that layer normalization does not outperform batch normalization, but performs better than using no normalization at all.
Also find this summary at [davidstutz.de](https://davidstutz.de/category/reading/).

First published: 2016/07/21 (3 years ago)Abstract: Training state-of-the-art, deep neural networks is computationally expensive.
One way to reduce the training time is to normalize the activities of the
neurons. A recently introduced technique called batch normalization uses the
distribution of the summed input to a neuron over a mini-batch of training
cases to compute a mean and variance which are then used to normalize the
summed input to that neuron on each training case. This significantly reduces
the training time in feed-forward neural networks. However, the effect of batch
normalization is dependent on the mini-batch size and it is not obvious how to
apply it to recurrent neural networks. In this paper, we transpose batch
normalization into layer normalization by computing the mean and variance used
for normalization from all of the summed inputs to the neurons in a layer on a
single training case. Like batch normalization, we also give each neuron its
own adaptive bias and gain which are applied after the normalization but before
the non-linearity. Unlike batch normalization, layer normalization performs
exactly the same computation at training and test times. It is also
straightforward to apply to recurrent neural networks by computing the
normalization statistics separately at each time step. Layer normalization is
very effective at stabilizing the hidden state dynamics in recurrent networks.
Empirically, we show that layer normalization can substantially reduce the
training time compared with previously published techniques.

TLDR; The authors propose a new normalization scheme called "Layer Normalization" that works especially well for recurrent networks. Layer Normalization is similar to Batch Normalization, but only depends on a single training case. As such, it's well suited for variable length sequences or small batches. In Layer Normalization each hidden unit shares the same normalization term. The authors show through experiments that Layer Normalization converges faster, and sometimes to better solutions, than batch- or unnormalized RNNs. Batch normalization still performs better for CNNs.