Vanilla Neural Network

I. Objective:

II. Linear Model:

A simple linear model with a softmax layer on top. The main difference here is the lack of a non-linear activation function (ReLU, tanh, etc.). Thanks to Karpathy for the data and code structure, but we will break down the math behind the lines for better understanding. You can check out the code for loading the data on the Github repo but here we will focus on the main model operations.

We can see that the decision boundary of our classifier is linear and cannot adapt to the non-linear contortions of the data.

III. Neural Network:

Now we introduce a neural net with a softmax on the last layer for class probabilities. We use a ReLU unit to introduce non-linearity. Our network will have two layers, where the shape of the input will be manipulated as follows:

The resulting decision boundary is able to classify the non-linear data really well.

IV. Tensorflow Implementation:

We will start by setting up our tensorflow model but we will have an extra function called summarize() which will store the progress as we training through the epochs. We will decide which values to store with tf.scalar_summary() so we can see the changes later.

Extras (DropOut and DropConnect):

There are many add on techniques to this vanilla neural network that works to increase optimization, robustness and overall performance. We will be convering many of them in future posts but I will briefly talk about a very common regularization technique: dropout.

What is it? Dropout is a regularization technique that allows us to nullify the outputs of certain neurons to zero. This will effectively be the same as the neuron not existing in the network. We will do this for p% of the total neurons in each layer and for each batch, a new p% of the neurons in each layer are “dropped”.

Why do we do this? It works out to be a great regularization technique because for each input batch, we are sampling from a different neural net since a whole new set of neurons are dropped. By repeating this, we are preventing the units from co-adapting too much to the data. The original paper describes each iteration as a “thinned” network because p% of the neurons are dropped. Note: Dropout is only for training time. At test time, we will not be dropping any neurons.

In the image above, the layer has p=0.5 which means half of it’s units are dropped. In an other iteration, a different set of 1/2 of the neurons will be dropped. Let’s take a look at masking code to really understand what’s happening.

We use a Bernoulli distribution to generation 0/1 with probability p for 0. We apply this mask to the outputs from our layer. The parts that are multiplied by zero are our “dropped” neurons since they will yield an output of 0 when multiplied by the next set of weights.

Another regularization method, which is an extension of dropout, is dropconnect. It also involves a similar mechanism but is applied to the weights instead.

Notice that here, a set of weights are dropped instead of the neurons.

We apply a similar bernoulli mask to the weights and we use those weights for the layers. Any inputs that are dot producted with the zeroed weights will result in 0. You can see the similarity with dropout and so, empirically, both techniques offer similar results. Dropconnect was proposed because you always have more weights than neurons, so there are more ways to create “thinned” models thus results in more robust training. However, in more papers you will see mostly dropout being utilized and very rarely drop connect since results are similar.