I gave a talk about neural network back-propagation recently. In my introduction, I said that you can make a strong argument that back-propagation is the most important algorithm in all of machine learning.

Explaining back-propagation is difficult because there are many interrelated ideas. Each idea by itself is relatively simple, but when you compound dozens of ideas the overall concept gets very murky. The challenge when explaining back-propagation is to know the audience and then explain just enough, but not too much, theory.

Here’s my blitz summary:

1. The goal of back-propagation is to adjust weight and bias values so that with training data, the computed output values closely match the known, correct output values.

2. Each weight is adjusted by a weight delta term which is a small learning rate (like 0.05) times the gradient of the weight.

3. So, each weight has an associated gradient, which is the Calculus derivative of whatever Error function you are using (usually mean squared error or cross entropy error) times its associated input value.

4. The gradient of a hidden-to-output weight is the “signal” of pointed-to node (an output node) times the associated input (a hidden node value).

5. The signal of an output node is the (target – computed) times the Calculus derivative of the output layer activation function (usually softmax or logistic sigmoid).

6. The gradient of an input-to-hidden weight is the “signal” of pointed-to node (a hidden node) times the associated input (an input node value).

7. The signal of a hidden is a sum of signals of associated output nodes times the Calculus derivative of the hidden layer activation function (usually tanh, logistic sigmoid, or ReLU).

Whoa! Each part is relatively simple, but learning each of these parts and how they fit together took me several months.

I believe I know as much about back-propagation as anyone (well, the engineering aspects of it anyway), but explaining back-propagation is very difficult.