Month: September 2015

Recalling the perceptron

In the previous post we have seen how an artificial neuron works to classify simple functions and we have also seen its limitations when the space function to be classified is not linear, e.g. it fails to compute correctly the XOR function.

We have also seen that the way to overcome this limitation is to combine more neurons together in a network but one thing we did not yet seen was how to compute the “weights” of the network (they were hard-coded in the previous example).

As a reminder, here is the perceptron able to compute an OR function:

The input values are multiplied with specific weights and combined together into an activation function which outputs the OR function.
The “artificial neural network” (ANN) is really just this matrix of weights. Here the matrix has only one column (the OR function is very simple) but we may have “layers” and layers of neurons and their weights.
The values calculated are just transient values based on the input dataset. We don’t save them, they are forgotten.
All of the learning is stored in the weight matrix.

Like similar supervised models, the ANNs can be trained to learn the weights using some labeled training data (the “supervised” data) and once the weights matrix is learnt, we can use it to compute the function that we trained for.

Sigmoid neurons

Before seeing how the weights can be learned, we will slightly tune the activation function (the red circle in the figure above) in order to make the learning phase easier.
In the perceptron example, this activation function was a simple threshold one, outputting 0 if the input was below a threshold and 1 otherwise. Instead, we will use a sigmoid function which has the advantage that small changes in the weights and bias cause only a small change in their output. The sigmoid function – often called the logistic function – is defined by:

The sigmoid activation function is always positive, bounded between 0 and 1 and strictly increasing.

The sigmoid function can be thought as a “squashing function”, because it takes the input and squashes it to be between zero and one: very negative values are squashed towards zero and positive values get squashed towards one.
For example we have:

sigmoid(-5) = 0.007; sigmoid(5) = 0.993

The output of a sigmoid neuron

The output of a sigmoid neuron is obviously not 0 or 1 but is a real number between 0 and 1. This can be useful, for example, if we want to use the output value to represent the intensity of the pixels in an image input to a neural network. But sometimes it is not, in fact to model an OR gate we need to have as output 0 or 1, as it was in the perceptron.
In practice we can use a convention to deal with this, for example, by deciding to interpret any output of at least 0.5 as indicating a “1”, and any output less than 0.5 as indicating a “0”.

We can quickly verify that the original perceptron simulating the OR function is still working with this new activation function:

As you notice, the weights in this OR example above are hard-coded because we can get them manually very quick since the example is pretty straightforward. But this is obviously not scalable as the complexity of a function is growing.
Therefore we will see now how we can find the appropriate weights to model a simple function like the OR.

Training the network

What we need is an algorithm that will find the appropriate weights (remember: they ARE the artificial neural network) programmatically for us so that the output from the network approximates the expected y(x) for all training inputs x.

Cost function

To quantify how well we’re achieving this goal we will define a cost function, sometimes referred to as a loss or objective function.
We have already seen examples of cost functions in the regression models and it can be a complex function, depending on the domain – a common choice is the categorical cross-entropy loss – but for this simple example we just stick to the difference between the network output and the training output:

error = actual output(label) – predicted output

Learning the weights

The output of the ANN is:

output = h(W * X + w0)

where X is the input, w0 is the bias, W are the weights and h is the non-linear activation function.

Learning the parameters for our network means finding weights that minimise the error on our training data. By finding parameters that minimise the error we maximise the likelihood of our training data.
Designing and training a neural network is not much diﬀerent from training any other machine learning model with gradient descent.

Sigmoid derivative

The gradient descent works by using the derivative function. The derivatives of the sigmoid with respect to the inputs and the weights are very simple:

Now we have everything to explore the learning algorithm for artificial neural network.

The learning algorithm

The general idea is to forward-feed the input dataset to an initial set of weights to compute the output and the error, then back-propagate the error using the gradient (this is the “core” part!) and update the weights until the error is acceptable:

Training data and bias

The input dataset is a numpy matrix. Each row is a single “training example”. Each column corresponds to one of our input nodes.
In our OR example above, we have 2 input nodes (X1 and X2) and 4 training examples.
Similarly the number of nodes in the output layer (the “labels” of our dataset) is determined by the number of classes, in this case we have only one output node for two classes: 0 and 1. That is the number of column for the y matrix while the rows are determined again by the number of training examples.

This will add a “bias” neuron, i.e. an additional column of weights. It’s always a good idea to add a bias to a neural network.
The bias node in a neural network is a node that is always ‘on’. That is, its value is set to 1. It is analogous to the intercept in a regression model, and serves the same function: allows you to shift the activation function to the left or right, which may be critical for successful learning.

For example, assume that you want a neuron to fire y≈1 when all the inputs are x≈0. If there is no bias no matter what weights W you have, given the equation y = σ(WX) the neuron will always fire y≈0.5.

Initial weights

This creates our weight matrix for the neural network.
And it initialises the weights to a random number, using the numpy function random() which returns a random float number or array.
The size of the output is controlled by the parameter. In our example it’s a tuple (n_inputs+1, n_outputs) which means it will return a matrix.

We can choose the dimensionality of the hidden layer.
The more nodes we put into the hidden layer the more complex functions the neural network will be able to model but higher dimensionality means more computation is required to learn the network weights.

Since we only have one layer (the input values) we only need a matrix of weights with one column, to connect them. We keep its dimension here minimal: (3,1) because we have two inputs plus the bias and one output.

Results from the random() function are from the “continuous uniform” distribution over the half-open interval [0.0, 1.0). It’s a best practice to sample the weights from a uniform distribution with mean zero: to sample Unif[-1,1) we just have to multiply the output of random() by 1-(-1)=2 and add -1.

Iterate

This is the loop to train the network weights.
It iterates over the training code until we have reached the wished error (converged) or until we have reached the maximum number of iterations.
Both the max_iterations and the error tolerance are optional inputs of this function.

Feed-forward

This is the forward propagation prediction step: the network predicts the output given the input and the current weights by matrix multiplication and applying the activation function.

y = σ(Xw)

Note that X contains all training examples (input) and we process all of them at the same time in this implementation. This is known as “full batch” training.

The X matrix is multiplied with the current weights matrix w, using the numpy function dot() and its result is then going through the sigmoid squashing function, our activation function.
The result will be also a matrix, in this case of 4 rows (the four training examples) and one column (the predicted output):

Each output corresponds with the network’s prediction for a given training input. This would work independently on the number of training examples, i.e. if I train the network with more examples, the matrix will be bigger but no changes in code are required.

Loss function (the error)

# Compute the error
error = y - pred_y

We can now compare how well it did by subtracting the true answer (y from the training dataset) from the prediction (pred_y).
The error is an array: the difference between actual and prediction for each training example.
Again: this is a super-simple loss function just for illustration. For more complex examples you will use e.g. a quadratic function like we have seen in regression algorithms.

Back-propagation

One popular approach is an algorithm called back-propagation that has similarities to the gradient descent algorithm we looked at earlier. The back-propagation algorithm (Rumelhart et al., 1986), often simply called backprop, allows the error information to ﬂow backwards through the network, in order to compute the gradient. Once we have the gradient of this error as a function of the neuron’s weights, we can propagate these output errors backward to infer errors for the other layers and adjust its weights in the direction that most decreases the error.

The term back-propagation is often misunderstood as meaning the whole learning algorithm for neural networks but actually, it refers only to the method for computing the gradient, while another algorithm, such as stochastic or batch gradient descent, is used to perform learning using this gradient.

We use the gradient to find the minimum; this means we first need to compute the gradient. The gradient is just made up of the derivatives of all the inputs concatenated in a vector (remember that a derivative of a function measures the sensitivity to change):

for all i,j,l

where w are the weights and e(w) is the error on the example X,y.

Back-propagation is the key algorithm that makes training deep models computationally tractable, basically is just a clever “trick” to efficiently calculate the gradients starting from the output. I am not going into details here but it’s an old trick and you can find the generic implementation under the name “reverse-mode differentiation”:

Classical methods are slow at computing the partial derivatives of a function with respect to many inputs, as is needed for gradient-based optimization algorithms. Automatic differentiation solves all of these problems […]
In reverse accumulation, one first fixes the dependent variable to be differentiated and computes the derivative with respect to each sub-expression recursively. In a pen-and-paper calculation, one can perform the equivalent by repeatedly substituting the derivative of the outer functions in the chain rule:

The chain rule of calculus is used to compute the derivatives of functions formed by composing other functions whose derivatives are known. The chain rule very simply states that the right thing to do is to simply multiply the gradients together to chain them.
Reverse-mode differentiation is an algorithm that computes the chain rule, with a speciﬁc order of operations that is highly eﬃcient.

I won’t go into more details here but there are many excellent explanations in the web, such as the Andrew Ng or Yaser Abu-Mostafa courses.

Bottom line: back-propagation tracks how every neuron affects one output.

# backpropagation: multiply how much we missed by the
# slope of the sigmoid at the values in pred_y
delta = error * sigmoidDeriv(pred_y)

The first thing we need is a function that computes the gradient of the sigmoid function we created earlier. This is sigmoidDeriv(). The derivative will give back the slope of the predictions.

The variable error is the total error across the whole data, calculated by running the data plus current parameters through the “network” (the forward-propagate function) and comparing the output to the true labels. This is the part we discussed earlier as the cost function.

When we multiply the “slopes” by the error, we are reducing the error of high confidence predictions.
The multiplication is essentially answering the question “how can I adjust my parameters to reduce the error the next time I run through the network”?
It does this by computing the contributions at each layer to the total error and adjusting appropriately by coming up with a gradient matrix (or, how much to change each parameter and in what direction).

Update the weights

# update weights
w += np.dot(X.T, delta)

We are now ready to update our network! This line computes the weight updates for each weight for each training example, sums them, and updates the weights, all in a simple line. A small error and a small slope means a VERY small update. However, because we’re using a “full batch” configuration, we’re doing the above step on all four training examples.

There are several combinations of weights that can bring to the same result and there are other equivalent solutions that gradient descent could also ﬁnd.
The convergence point of gradient descent depends on the initial values of the parameters: the initial set of weights (that is random chosen), the gradient descent meta-parameters (not used in this simple example) and in minor part also on the error tolerance defined.

It is also possible to demonstrate that multiplying the resulting weights for a constant C will keep the ANN to output the same result but this is out of the scope for this post. You can try it yourself if you are curious.

Note that the training set X_train is now a matrix with 3 columns for the 3 inputs and that you do not need to list ALL combinations as training examples.
For an OR function with 3 operators there would be a total of 2^3 = 8 possible combinations but here we train the network with only 4 examples.
Choosing the right training dataset (large and representative enough) is a science for itself but I will not dig more here. You can refer to the previous posts about overfitting and cross-validations.

It works perfectly for all 8 combinations and convergence has been reached after 169 iterations, pretty fast.

Non-linear example: XOR

The real proof of truth is to model a non-linear function, such as the XOR function, a sort of Hello World for Neural Networks.
We have seen in the previous post that a simple perceptron (one neuron with only one single layer of input) cannot model it. But it’s possible by combining together perceptrons that model each one different functions, such as OR, AND and NOT:

x1 XOR x2 = (x1 AND NOT x2) OR (NOT x1 AND x2)

The resulting multi-layer network of perceptrons is basically an artificial neural network with one additional layer – often called the hidden layer:

To model such a network we need to add one layer of weights, i.e. one column to the weights matrix and to compute the hidden layer weights in the training algorithm.

XOR with two inputs

Note: To keep the example more clear I will keep the weights of the two layers separately.

The mainly difference is that we have now two arrays of weights w1 and w2, one storing the weights for the input layer and one for the hidden layer.
Then accordingly the code is changed to take in considerations both sets of weights.

It works but note that it takes now 967 iterations to converge.
Adding layers to a network will normally make it slower and harder to train. Normally deeper neural network will be more optimised, e.g. will use better gradient descent algorithms, regularisations and so on.

Conclusion

The perceptron can model linear functions but cannot model non-linear functions, unless you combine perceptrons together. Nevertheless the perceptron convergence procedure works by ensuring that every time the weights change, they get closer to every “generously feasible” set of weights. This type of guarantee cannot be extended to more complex networks in which the average of two good solutions may be a bad solution (non-convex problems). Therefore the perceptron learning procedure cannot be generalised to hidden layers.

Instead of showing the weights get closer to a good set of weights, the Artificial Neural Networks learn the weights so that the actual output values get closer the target values. The aim of learning is to minimise the error summed over all training cases.
Sigmoid neurons give a real-valued output that is a smooth and bounded function of their total input; they have nice derivatives which make learning easy.

Such an Artificial Neural Network can represent non-linear functions.
The Universal Approximation theorem states that a simple feed-forward neural network with a single hidden layer is enough to represent a wide variety of continuous functions.

This is a bare example. Deep neural network have dozens of layers and use many additional optimisations to improve the learning: different activation functions, regularisations to prevent overfitting ((L1 and L2 regularization, dropout), better choice of cost function (squared difference, cross-entropy), tuned weight initialisation and choice of hyper-parameters (learning rate, number of layers and neurons, …), sophisticated gradient descent algorithms. But the principle is the same.

ANNs can represent efficiently almost any kind of function and studies are constantly progressing.

Advertisements

Welcome!

This is my personal blog, where I write about what I learned, mostly about software, project management and machine learning.
Why this name? The blog should help me to navigate into the future using (and not forgetting) the past experiences.
From Europe to the world.