Give me your digits

Let us first dispense with the unpleasantries. While our neural network shall
run on the base install, to read the input data we must venture slightly
further because the database files are gzip-compressed.
Accordingly, we run:

Ample Samples

Later on, we will need to sample from a normal distribution.
Packages exist for this, but we stay within core Haskell by implementing the
Box-Muller
transform. Though our version is inefficient, it’s good enough for our
purposes.

Weight Training

A single neuron takes in a bunch of floats from other neurons or
directly from the input data, multiplies each by a weight, and sums them.
It adds this total to a bias, passes it through the activation function,
and outputs the result.

This is easier to state with code. Let ws be the weights, b be the bias,
and as be the inputs, and f be the activation function. Then the output
is:

f (sum (zipWith (*) ws as) + b)

We choose the
rectifier
(ReLU) to be our activation function. This differs from Nielsen, who uses
the sigmoid function, and it means we should initialize our biases differently.

relu = max 0

Like Nielsen, our network has an input layer of 282 = 784 raw numbers which
feed into a hidden layer of 30 neurons, which in turn feeds into an output
layer of 10 neurons. Each pair of adjacent layers has full mesh topology,
namely, every node of one layer is connected to every node of the next.

The input values lie in [0,1] and indicate the darkness of a pixel.

The output neurons give nonnegative values: we’ll train the nth neuron to
output at least 1 if the digit is n, and 0 otherwise.

This scheme results in a peculiar cost function. If we had also chosen the
sigmoid function, then our outputs would be bounded above by 1 and we could
train our network to aim for exactly 1. But we’ve chosen relu, whose outputs
can be arbitarily large. Inituitively, I feel it makes sense to interpret
anything that is at least 1 as a strong positive indication, and we should
train for 0 otherwise. I don’t know what the literature recommends.

For each layer of neurons, we store the biases in a list of floats [Float]:
the ith float is the bias for the ith neuron. Similarly, we store the weights
in a two-dimensional matrix of floats [[Float]]: the ith row holds the
weights of the inputs to the ith node. Since our neural network is a bunch of
layers, we store it as a list of biases and weights for each layer:
[([Float], [[Float]])].

We initialize all the biases to 1, and the weights to come from a normal
distribution with mean 0 and standard deviation 0.01.

We’ll want to hang on to the values before they are fed to the
activation functions; these are called the weighted inputs. We compute an
entire layer of weighted inputs from the previous layer as follows:

Lesson Plan

We train the weights and biases using gradient descent, so we need the
derivative of the activation function (with its discontinuity filled in):

relu' | x < 0 = 0
| otherwise = 1

We use the same desired output vector format as Nielsen, that is,
if a given training example is labeled with the digit i, then we set the ith
component to 1, and the others to 0.

However, since we consider any value greater than 1 to play the same role
as 1, in our output vectors, our derivative of the cost with respect to the
output value is:

dCostay|y==1&&a>=y=0|otherwise=a-y

Now, backpropagation requires the weighted inputs and activations of every
neuron, that is, the values just before and after we run the activation
function during feedforward. Because backpropagation goes through the layers
in reverse order, it helps to have these values in reverse order, which
leads to the following awkwardly-named utility function:

-- xv: vector of inputs-- Returns a list of (weighted inputs, activations) of each layer,-- from last layer to first.revaz::[Float]->[([Float],[[Float]])]->([[Float]],[[Float]])revazxv=foldl'(\(avs@(av:_),zs)(bs,wms)->letzs'=zLayerav(bs,wms)in(((relu<$>zs'):avs),(zs':zs)))([xv],[])

From these values we compute the output error delta0, and backpropagate:

-- xv: vector of inputs-- yv: vector of desired outputs-- Returns list of (activations, deltas) of each layer in order.deltas::[Float]->[Float]->[([Float],[[Float]])]->([[Float]],[[Float]])deltasxvyvlayers=let(avs@(av:_),zv:zvs)=revazxvlayersdelta0=zipWith(*)(zipWithdCostavyv)(relu'<$>zv)in(reverseavs,f(transpose.snd<$>reverselayers)zvs[delta0])wheref_[]dvs=dvsf(wm:wms)(zv:zvs)dvs@(dv:_)=fwmszvs$(:dvs)$zipWith(*)[(sum$zipWith(*)rowdv)|row<-wm](relu'<$>zv)

We opt for online learning, in other words, we modify our weights and biases
immediately after processing a case, rather than batching the changes.
Our learning rate is 0.002:

When trained on the first 10000 cases, our neural network correctly classifies
about 80% of the test cases.

The Numbers Game

The following program assumes the four input MNIST database files reside in the
current directory. We randomly select a test case as an example and print it.
We then show how training pays off with crude graphs that represent the neural
network’s response to our example. Finally, we try all 10000 test cases and
count the number we get right.

The neural network takes about 40 lines and compiles on a base Haskell
system. The code to read the data, train the network, and print results at
certain checkpoints takes about another 40 lines.

Beginner’s Luck

This project was guided by expedience. I wanted to code a neural network
from scratch, but I also wanted it running as soon as possible, so I
cut corners.

I chose online learning so I could avoid code for batching deltas. My code also
trains on the data in the order of appearance.

I chose ReLU because I heard it can converge faster that the sigmoid. This
meant I was unable to mimic Nielsen’s code for picking weights and biases,
because a ReLU neuron that is initially likely to output zero will be hard to
train. Luckily, a brief web search revealed that we can set all the initial
biases to 1. As for the weights, the normal distribution around 0 with standard
deviation 0.01 felt about right.

I stored matrices in Haskell lists: great for economy of expression, but not so
great for performance. To compensate, the code only trains for one session, and
on only one sixth of the data. To further alleviate my impatience, my program
gives updates on its progress.

Despite so many compromises, the above reliably produces neural networks that
correctly classify roughly 75% to 85% of the test cases, taking about a minute
to run on my laptop.

Haste Makes Sloth

I obtained better results by randomizing the order of the training cases and
feeding them all to the network, though sometimes the results were worse.
One run produced a neural net scoring 9202 out of 10000.

I had hoped to code in Haskell exclusively, but the Haste
compiler introduced too much overhead. I resorted to moving both the neural
net feedforward routine and the sample image data directly into JavaScript.
I wonder if Fay can help.