Building a Neural Network from Scratch: Part 2

In this post we’ll improve our training algorithm from the previous post. When we’re done we’ll be able to achieve 98% precision on the MNIST data set, after just 9 epochs of training—which only takes about 30 seconds to run on my laptop.

For comparison, last time we only achieved 92% precision after 2,000 epochs of training, which took over an hour!

The main driver in this improvement is just switching from batch gradient descent to mini-batch gradient descent. But we’ll also make two other, smaller improvements: we’ll add momentum to our descent algorithm, and we’ll smarten up the initialization of our network’s weights.

We’ll also reorganize our code a bit while we’re at it, making things more modular.

But first we need to import and massage our data. These steps are the same as in the previous post:

Then we’ll define our key functions. Only the last two are new, and they just put the steps of forward and backward propagation into their own functions. This tidies up the training code to follow, so that we can focus on the novel elements, especially mini-batch descent and momentum.

Notice that in the process we introduce three dictionaries:params, cache, and grads. These are for conveniently passing information back and forth between the forward and backward passes.

To switch to mini-batch descent, we add another for loop inside the pass through each epoch. At each pass we randomly shuffle the training set, then iterate through it in chunks of batch_size, which we’ll arbitrarily set to 128. We’ll see the code for all this in a moment.

Next, to add momentum, we keep a moving average of our gradients. So instead of updating our parameters by doing e.g.:

Finally, to smarten up our initialization, we shrink the variance of the weights in each layer. Following this nice video by Andrew Ng (whose excellent Coursera materials I’ve been relying on heavily in these posts), we’ll set the variance for each layer to $1/n$, where $n$ is the number of inputs feeding into that layer.

We’ve been using the np.random.randn() function to get our initial weights. And this function draws from the standard normal distribution. So to adjust the variance to $1/n$, we just divide by $\sqrt{n}$. In code this means that instead of doing e.g. np.random.randn(n_h, n_x), we do np.random.randn(n_h, n_x) * np.sqrt(1. / n_x).