You’d think that initializing the weights and biases in a neural network wouldn’t be very difficult or interesting. Not so.

The simplest way to initialize weights and biases is to set them to small (perhaps -0.01 to +0.01) uniform random values. And this works well for NNs with a single hidden layer. But a simple approach doesn’t always work well with deep NNs, especially those that use ReLU (rectified linear unit) activation.

One common initialization scheme for deep NNs is called Glorot (also known as Xavier) Initialization. The idea is to initialize each weight with a small Gaussian value with mean = 0.0 and variance based on the fan-in and fan-out of the weight.

For example, each weight that connects an input node to a hidden node has fan-in of the number of input nodes and fan-out of the number of hidden nodes. In pseudo-code the initialization is:

Instead of using variance = 2.0 / (fan-in + fan-out) with a Gaussian distribution, you can also use a Uniform distribution between [-sqrt(6) / sqrt(fan-in + fan-out), sqrt(6) / sqrt(fan-in + fan-out)]. Therefore, the term “Glorot Initialization” is ambiguous because it can refer to two somewhat different algorithms.

If you want to read the original research paper, do a Web search for “Understanding the Difficulty of Training Deep Feedforward Neural Networks”.