Building a Neural Network from Scratch: Part 1

In this post we’re going to build a neural network from scratch. We’ll train it to recognize hand-written digits, using the famous MNIST data set.

We’ll use just basic Python with NumPy to build our network (no high-level stuff like Keras or TensorFlow). We will dip into scikit-learn, but only to get the MNIST data and to assess our model once its built.

We’ll start with the simplest possible “network”: a single node that recognizes just the digit 0. This is actually just an implementation of logistic regression, which may seem kind of silly. But it’ll help us get some key components working before things get more complicated.

Then we’ll extend that into a network with one hidden layer, still recognizing just 0. Then we’ll add a softmax for recognizing all the digits 0 through 9. That’ll give us a 92% accurate digit-recognizer, bringing us up to the cutting edge of 1985 technology.

In a followup post we’ll bring that up into the high nineties by making sundry improvements: better optimization, more hidden layers, and smarter initialization.

1. Hello, MNIST

MNIST contains 70,000 images of hand-written digits, each 28 x 28 pixels, in greyscale with pixel-values from 0 to 255. We could download and preprocess the data ourselves. But the makers of scikit-learn already did that for us. Since it would be rude to neglect their efforts, we’ll just import it:

The default MNIST labels record 7 for an image of a seven, 4 for an image of a four, etc. But we’re just building a zero-classifier for now. So we want our labels to say 1 when we have a zero, and 0 otherwise (intuitive, I know). So we’ll overwrite the labels to make that happen:

Now we can make our train/test split. The MNIST images are pre-arranged so that the first 60,000 can be used for training, and the last 10,000 for testing. We’ll also transform the data into the shape we want, with each example in a column (instead of a row):

We’ll vectorize by stacking examples side-by-side, so that our input matrix $X$ has an example in each column. The vectorized form of the forward pass is then:
$$ \hat{y} = \sigma(w^T X + b). $$
Note that $\hat{y}$ is now a vector, not a scalar as it was in the previous equation.

In our code we’ll compute this in two stages: Z = np.matmul(W.T, X) + b and then A = sigmoid(Z). (A for Activation.) Breaking things up into stages like this is just for tidiness—it’ll make our forward propagation computations mirror the steps in our backward propagation computations.

So, now that we’ve got a working model and optimization algorithm, let’s enrich it.

3. One Hidden Layer

Let’s add a hidden layer now, with 64 units (a mostly arbitrary choice). I won’t go through the derivations of all the formulas for the forward and backward passes this time; they’re a pretty direct extension of the work we did earlier. Instead let’s just dive right in and build the model:

Hmm, not bad, but about the same as our one-neuron model did. We could do more training and add more nodes/layers. But it’ll be slow going until we improve our optimization algorithm, which we’ll do in a followup post.

So for now let’s turn to recognizing all ten digits.

4. Upgrading to Multiclass

4.1 Labels

First we need to redo our labels. We’ll re-import everything, so that we don’t have to go back and coordinate with our earlier shuffling:

Looks good, so let’s consider what changes we need to make to the model itself.

4.2 Forward Propagation

Only the last layer of our network is changing. To add the softmax, we have to replace our lone, final node with a 10-unit layer. Its final activations are the exponentials of its $z$-values, normalized across all ten such exponentials. So instead of just computing $\sigma(z)$, we compute the activation for each unit $i$:
$$ \frac{e^{z_i}}{\sum_{j=0}^{9} e^{z_j}}.$$
So, in our vectorized code, the last line of forward propagation will be A2 = np.exp(Z2) / np.sum(np.exp(Z2), axis=0).

4.4 Backprop

Luckily it turns out that backprop isn’t really affected by the switch to a softmax. A softmax generalizes the sigmoid activiation we’ve been using, and in such a way that the code we wrote earlier still works. We could verify this by deriving:
$$\frac{\partial L}{\partial z_i} = \hat{y}_i - y_i.$$
But I won’t walk through the steps here. Let’s just go ahead and build our final network.