My Notes - Weight Normalization

Deep Learning is defined as (Goodfellow et al., 2016) a sub-field of machine learning consists in learning models that are wholly or partially specified by a class of flexible differentiable functions.

In this study there are three main methods which are Weight Normalization, a new data depended initialization method and Mean Only Batch Normalization.

Weight normalization id formalized as below. Weight values w are decoupled by their norms g and the direction v / ||v||. In this way they propose that SGD gives faster convergence.

They compare Weight Normalization with Batch Normalization. The main disadvantage they posit that BN has stochasticity due to varying data batches and one additional difference is that WN has lower computational burden compared to BN.

the second perk is data depended initialization of the network. They first give a initial minibatch to network and compute mean activation and std per layer. Then given the initial weight values sampled from mean 0 and std 0.05, they set g = 1 / std and b = - mean / std

One downside is that since this scheme is batch depended, it might suffer for the forthcoming batches with possible different data statistics. However, they say that this scheme works well in practice.

The third perk is Mean Only Batch Normalization.

This is a lighter operation due to the avoidance of variance normalization. We might easily skip variance normalization because of the initialization scheme already applied it. One another upside is that avodiance of variance normalization provides less distracted gradient feedbacks and therefore better learning.

At the experiments side, they note that batch normalization is 16% slower than weight normalization whereas BN yields better progress especially for initial iterations. As a final remark they note 7.31% CIFAR-10 performance which is the state of art up to my knowledge (not better then my best network :)) in terms of published works. they also experiment with different architectures like RNNs , reinforcement learning and others but please refer to the paper for more.