How the Human Brain Learns?

Much is still unknown about how the brain trains itself to process information, so theories abound. In the human brain, a
typical neuron collects signals from others through a host of fine structures called dendrites. The neuron sends out
spikes of electrical activity through a long, thin stand known as an axon, which splits into thousands of branches. At the
end of each branch, a structure called a synapse converts the activity from the axon into electrical effects that inhibit or
excite activity from the axon into electrical effects that inhibit or excite activity in the connected neurones. When a
neuron receives excitatory input that is sufficiently large compared with its inhibitory input, it sends a spike of electrical
activity down its axon. Learning occurs by changing the effectiveness of the synapses so that the influence of one neuron
on another changes.

Components of a neuron

The synapse

From Human Neurones to Artificial Neurones.

We conduct these neural networks by first trying to deduce the essential features of neurones and their interconnections.
We then typically program a computer to simulate these features. However because our knowledge of neurones is
incomplete and our computing power is limited, our models are necessarily gross idealisations of real networks of
neurones.

The neuron model

How Neural Networks Learn?

Artificial neural networks are typically composed of interconnected "units", which serve as model neurones. The
function of the synapse is modelled by a modifiable weight, which is associated with each connection. Each unit
converts the pattern of incoming activities that it receives into a single outgoing activity that it broadcasts to other units. It
performs this conversion in two stages:

It multiplies each incoming activity by the weight on the connection and adds together all these weighted inputs to
get a quantity called the total input.

A unit uses an input-output function that transforms the total input into the outgoing activity.

The behaviour of an ANN (Artificial Neural Network) depends on both the weights and the input-output function
(transfer function) that is specified for the units. This function typically falls into one of three categories:

linear

threshold

sigmoid

For linear units, the output activity is proportional to the total weighted output.

For threshold units, the output is set at one of two levels, depending on whether the total input is greater than or less
than some threshold value.

For sigmoid units, the output varies continuously but not linearly as the input changes. Sigmoid units bear a greater
resemblance to real neurones than do linear or threshold units, but all three must be considered rough approximations.

To make a neural network that performs some specific task, we must choose how the units are connected to one
another, and we must set the weights on the connections appropriately. The connections determine whether it is
possible for one unit to influence another. The weights specify the strength of the influence.

The commonest type of artificial neural network consists of three groups, or layers, of units: a layer of "input" units is
connected to a layer of "hidden" units, which is connected to a layer of "output" units.

The activity of the input units represents the raw information that is fed into the network.

The activity of each hidden unit is determined by the activities of the input units and the weights on the
connections between the input and the hidden units.

The behaviour of the output units depends on the activity of the hidden units and the weights between the hidden
and output units.

This simple type of network is interesting because the hidden units are free to construct their own representations of the
input. The weights between the input and hidden units determine when each hidden unit is active, and so by modifying
these weights, a hidden unit can choose what it represents.

We can teach a three-layer network to perform a particular task by using the following procedure:

We present the network with training examples, which consist of a pattern of activities for the input units together
with the desired pattern of activities for the output units.

We determine how closely the actual output of the network matches the desired output.

We change the weight of each connection so that the network produces a better approximation of the desired
output.

An Example to illustrate the above teaching procedure:

Assume that we want a network to recognise hand-written digits. We might use an array of, say, 256 sensors, each
recording the presence or absence of ink in a small area of a single digit. The network would therefore need 256 input
units (one for each sensor), 10 output units (one for each kind of digit) and a number of hidden units.

For each kind of digit recorded by the sensors, the network should produce high activity in the appropriate output unit
and low activity in the other output units.

To train the network, we present an image of a digit and compare the actual activity of the 10 output units with the
desired activity. We then calculate the error, which is defined as the square of the difference between the actual and the
desired activities. Next we change the weight of each connection so as to reduce the error.We repeat this training
process for many different images of each different images of each kind of digit until the network classifies every image
correctly.

To implement this procedure we need to calculate the error derivative for the weight (EW) in order to change the
weight by an amount that is proportional to the rate at which the error changes as the weight is changed. One way to
calculate the EW is to perturb a weight slightly and observe how the error changes. But that method is inefficient
because it requires a separate perturbation for each of the many weights.

Another way to calculate the EW is to use the Back-propagation algorithm which is described below, and has become
nowadays one of the most important tools for training neural networks. It was developed independently by two teams,
one (Fogelman-Soulie, Gallinari and Le Cun) in France, the other (Rumelhart, Hinton and Williams) in U.S.

A description of the Back Propagation Algorithm.

To train a neural network to perform some task, we must adjust the weights of each unit in such a way that the error
between the desired output and the actual output is reduced. This process requires that the neural network compute the
error derivative of the weights (EW). In other words, it must calculate how the error changes as each weight is
increased or decreased slightly. The back propagation algorithm is the most widely used method for determining the
EW.

The back-propagation algorithm is easiest to understand if all the units in the network are linear. The algorithm
computes each EW by first computing the EA, the rate at which the error changes as the activity level of a unit is
changed. For output units, the EA is simply the difference between the actual and the desired output. To compute the
EA for a hidden unit in the layer just before the output layer, we first identify all the weights between that hidden unit
and the output units to which it is connected. We then multiply those weights by the EAs of those output units and add
the products. This sum equals the EA for the chosen hidden unit. After calculating all the EAs in the hidden layer just
before the output layer, we can compute in like fashion the EAs for other layers, moving from layer to layer in a
direction opposite to the way activities propagate through the network. This is what gives back propagation its name.
Once the EA has been computed for a unit, it is straight forward to compute the EW for each incoming connection of
the unit. The EW is the product of the EA and the activity through the incoming connection.

Note that for non-linear units, the back-propagation algorithm includes an extra step. Before back-propagating, the EA
must be converted into the EI, the rate at which the error changes as the total input received by a unit is changed.

A Back-Propagation Network Example.

In this example a back-propagation network would be used to solve a specific problem, that one of an X-OR logic
gate. That means that patterns of (0,0) or (1,1) should produce a value close to zero in the output node, and input
patterns of (1,0) or (0,1) should produce a value near one in the output node.

Finding a set of connection weights for this task is not easy; it requires application of the back-propagation algorithm for
several thousand iterations to achieve a good set of connection weights and neuron thresholds.

The basic architecture for this problem has two input nodes, two hidden nodes, and a single output node as shown
above. This structure has variable thresholds on the two hidden and one output node (unit). This means that there are a
total of 9 variables in the system:

4 weights connecting the input to the hidden nodes

2 weights connecting the hidden to the output node

3 thresholds

Suppose we put in a pattern, say (0,1). That mean that there is 0 activation in the left-hand neuron on the first layer and
an activation of 1 in the neuron on the right.

Now we move our attention to the next layer up. For each neuron in this layer, we calculate an input which is the
weighted sum of all the activations from the first layer. The weighted sum is achieved by vector multiplying the
activations in the first layer by a "connection matrix". In our case we get a value of 0*(-11,62)+ 1*(10,99) = 10,99 for
the neuron on the left in the second layer, and 0*(12,88) +1*(13,13) = =13,13 for the neuron on the right.

These are not the activation of these neurones, though. To obtain the activations, we add a "threshold" value (which is
found for each neuron using the back-propagation rule), and apply an input-output (transfer) function. The transfer
function is defined for each different network. In our case it is a sigmoid:

In this case it has been shown, that the activation of the neuron on the left side of the hidden (middle) layer is the transfer
function applied to the difference (10,99-6,06) = 4,94. Applying the transfer function yields an activation value close to
1. The activation of the neuron on the right is the transfer function applied to (-13,13+7,19) = -5,14. Applying the
transfer function yields a value close to 0.

Approximating the next step, we use a value of 1 for the activation of the neuron on the left, and 0 for the neuron on the
right, multiply each activation by its appropriate connection weight, and sum the values as input to the topmost neuron.
This is approximately 1*(13,34)+0*(13,13) = 13,34. We add the threshold of -6,56 to obtain a value of 6,78.
Applying the transfer function to it will yield a value close to 1 (0,946), which is the desired result. Using the other 3
binary input patterns, we can similarly show that this network yields the desired classification within an acceptable
tolerance.