In neural networks, an activation function is
the function that describes the output behaviour of a neuron. Most network
architectures start by computing the weighted sum of the inputs (that is,
the sum of the product of each input with the weight
associated with that input. This quantity, the
total net input is then usually transformed in some way, using
what is sometimes called a squashing function.
The simplest
squashing function is a step function: if the total net input is less than 0
(or more generally, less than some threshold T)
then the output of the neuron is 0, otherwise it is 1. A common
squashing function is the logistic function.

In summary, the activation function is the result of applying a squashing
function to the total net input.

In the asynchronous case, if the yellow node fires first, then it uses
the then current value of its input from the red node to determine its
output in time step 2, and the red node, if it fires next, will use
the updated output from the yellow node to compute its new output in
time step 3.
In summary, the output values of the red and yellow nodes
in time step 3 depend on the outputs of the yellow and red nodes in time
steps 2 and 1, respectively.

In the synchronous case, each node obtains the current output of
the other node at the same time, and uses the value obtained to
compute its new output (in time step 2).
In summary, the output values of the red and yellow nodes
in time step 2 depend on the outputs of the yellow and red nodes in time
step 1.
This can produce a different result from the asynchronous method.

Some neural network algorithms are firmly tied to synchronous updates,
and some can be operated in either mode. Biological neurons normally
fire asynchronously.

An attribute is a property of an instance
that may be used to determine its classification.
For example, when classifying objects into different types in a robotic
vision task, the size and shape of an instance may be appropriate
attributes. Determining useful attributes that can be reasonably calculated
may be a difficult job - for example, what attributes of an arbitrary
chess end-game position would you use to decide who can win the game?
This particular attribute selection problem has been solved, but with
considerable effort and difficulty.

The axon is the "output" part of a biologicalneuron.
When a neuron fires, a pulse of electrical activity flows along the
axon. Towards its end, or ends, the axon splits into a tree. The ends
of the axon come into close contact with the dendrites
of other neurons. These junctions are termed
synapses. Axons may be short (a couple of millimetres) or long (e.g.
the axons of the nerves that run down the legs of a reasonably large animal.)

In decision tree pruning one of the
issues in deciding whether to prune a branch of the tree is
whether the estimated error in classification is greater if the branch
is present or pruned. To estimate the error if the branch is present,
one takes the estimated errors associated with the children of the
branch nodes (which of course must have been previously computed),
multiplies them by the estimated frequencies that the current branch
will classify data to each child node, and adds up the resulting products.
The frequencies are estimated from the numbers of training data
instances that are classified as belonging to each child node. This sum
is called the backed-up error estimate for the branch node. (The
concept of a backed-up error estimate does not make sense for a
leaf node.)

The backward pass starts at the outputlayer of the
feedforward network, and updates the incoming weights to
units in that layer using the delta rule. Then it works backward,
starting with the penultimate layer (last
hidden layer), updating the incoming weights to those layers.

Statistics collected during the forward pass
are used during the backward pass in updating the weights.

In a classification task in machine learning, the task is to
take each instance and assign it to a
particular class. For example, in a machine vision application,
the task might involve analysing images of objects on a conveyor
belt, and classifying them as nuts, bolts, or other
components of some object being assembled. In an optical
character recognition task, the task would involve taking
instances representing images of characters, and classifying
according to which character they are. Frequently in examples,
for the sake of simplicity if nothing else, just two classes,
sometimes called positive and negative, are
used.

A decision tree is a tree in which
each non-leaf node is labelled with an
attribute or
a question of some sort, and in which the branches at that node
correspond to the possible values of the attribute, or answers to the
question. For example, if the attribute was shape, then there
would be branches below that node for the possible values of
shape, say square, round and
triangular. Leaf nodes are labelled with a
class. Decision trees are used for classifying
instances - one starts at the root of the tree, and, taking appropriate
branches according to the attribute or question asked about at each
branch node, one eventually comes to a leaf node. The label on that
leaf node is the class for that instance.

The delta rule in error backpropagation learning
specifies the update to be made to each weight
during backprop learning. Roughly speaking, it states that the change
to the weight from nodei to node j
should be proportional to output of node j and also proportional to
the "local gradient" at node j.

The local gradient, for an output node, is
the product to the derivative of the squashing
function evaluated at the total net input to node j, and
the error signal (i.e. the
difference between the target output and the actual output). In the
case of a hidden node, the local gradient is
the product of the derivative the squashing function (as above) and
the weighted sum of the local gradients of the nodes to which node j
is connected in subsequent layers of the net. Got it?

The rationale for this is as follows: –log2(p)
is the amount of information in bits associated with an event
of probability p - for example, with an event of
probability ½, like flipping a fair coin, log2((p)
is –log2(½) = 1, so there is one bit of information. This
should coincide with our intuition of what a bit means (if we
have one). If there is a range of possible outcomes with
associated probabilities, then to work out the average number
of bits, we need to multiply the number of bits for each
outcome (–log2(p)) by the probability p and
sum over all the outcomes. This is where the formula comes
from.

In training a neural net, the term
epoch is used to describe a complete pass through all of the
training patterns. The
weights in the neural net may be
updated after each pattern is presented to the net, or they may
be updated just once at the end of the epoch. Frequently used
as a measure of speed of learning - as in "training was complete
after x epochs".

Initialization: the weights of the network
are initialized to small random values.

Forward pass:
The inputs of each training pattern are
presented to the network. The outputs are computed using the
inputs and the current weights of the
network. Certain statistics are kept from this computation, and
used in the next phase. The target outputs
from each training pattern are compared
with the actual activation levels of
the output units - the difference
between the two is termed the error. Training may be
pattern-by-pattern or epoch-by-epoch. With pattern-by-pattern
training, the pattern error is provided directly to the backward
pass. With epoch-by-epoch training, the pattern errors are
summed across all training patterns, and the total error is
provided to the backward pass.

When total error of a
backpropagation-trained
neural network
is expressed as a function of the weights,
and graphed (to the extent that this is possible with a large
number of weights), the result is a surface termed the error
surface. The course of learning can be traced on the error
surface: as learning is supposed to reduce error, when the learning
algorithm causes the weights to change, the current point on
the error surface should descend into a valley of the error
surface.

The "point" defined by the current set of weights is termed
a point in weight space. Thus weight space is the
set of all possible values of the weights.

In practice, the nodes of most feedforward nets are partitioned into
layers - that is, sets of nodes, and the layers may be numbered
in such a way that the
nodes in each layer are connected only to nodes in the
next layer - that is, the layer with the next higher number.
Commonly successive layers are totally interconnected - each
node in the earlier layer is connected to every node in the next
layer.

The first layer has no input connections, so consists of
input units and is termed the input
layer (yellow nodes in the diagram below).

The last layer has no output connections, so consists of
output units and is termed the output
layer (maroon nodes in the diagram below).

The layers in between the input and output layers are termed
hidden layers, and consist of
hidden units (light blue nodes and brown nodes in the diagram below).

Feedforward network. All connections (arrows) are in one direction; there are
no cycles of activation flow (cyclic subgraphs). Each colour identifies
a different layer in the network. The layers 1 and 2 are fully interconnected,
and so are layers 3 and 4. Layers 2 and 3 are only partly interconnected.

In a biological neural network: neurons
in a biological neural network fire when and if they receive enough
stimulus via their (input) synapses. This means
that an electrical impulse is propagated along the neuron's
axon and transmitted to other neurons via
the output synaptic connections of the neuron. The firing rate
of a neuron is the frequency with which it fires (cf.
activation in an artificial neural network.

Learning in backprop seems
to operate by first of all getting a rough set of
weights which fit the training patterns
in a general sort of way, and then working progressively towards
a set of weights that fit the training patterns exactly. If learning
goes too far down this path, one may reach a set of weights that
fits the idiosyncrasies of the particular set of patterns very well,
but does not interpolate (i.e. generalize) well.

Moreover, with large complex sets of training patterns, it is
likely that some errors may occur, either in the inputs or in
the outputs. In that case, and again particularly in the later
parts of the learning process, it is likely that backprop will be
contorting the weights so as to fit precisely around training
patterns that are actually erroneous! This phenomenon is
known as over-fitting.

This problem can to some extent be avoided by stopping learning
early. How does one tell when to stop? One method is to
partition the training patterns into two sets (assuming that there
are enough of them). The larger part of the training patterns,
say 80% of them, chosen at random, form the training set,
and the remaining 20% are referred to as the test set.
Every now and again during training, one measures the
performance of the current set of weights on the test set.
One normally finds that the error on the training set drops
monotonically (that's what a gradient descent algorithm is
supposed to do, after all). However, error on the test set (which
will be larger, per pattern, than the error on the training set)
will fall at first, then start to rise as the algorithm begins to
overtrain. Best generalization performance is gained by stopping
the algorithm at the point where error on the test set starts to
rise.

in the notation of Haykin's text (Neural networks - a
comprehensive foundation). The constant α
is a termed the momentum constant and can be adjusted to
achieve the best effect. The second summand corresponds
to the standard delta rule, while the first summand says "add
α × the previous change to this weight."

This new rule is called the generalized delta rule. The effect
is that if the basic delta rule would be consistently pushing
a weight in the same direction, then it gradually gathers
"momentum" in that direction.

When an artificial neural network
learning algorithm causes the weights
of the net to change, it will do so in such a way that the
current point on the error surface will descend into a valley
of the error surface, in a direction that corresponds to the
steepest (downhill) gradient or slope at the current
point on the error surface. For this reason,
backprop is said to be a gradient descent method,
and to perform gradient descent in weight space.

Term used in analysing machine learning
methods. The hypothesis language refers to the notation used
by the learning method to represent what it has learned so far.
For example, in ID3, the hypothesis language
would be the notation used to represent the decision tree (including
partial descriptions of incomplete decision trees). In
backprop, the hypothesis language would
be the notation used to represent the current set of
weights. In Aq, the hypothesis language
would be the notation used to represent the class descriptions (e.g.

In the case of
supervised learning,
in order to construct q', one needs a set
of inputs xi and corresponding target outputs zi
(i.e. you want P(xi | q) = zi
when learning is complete). The new state function L
is computed as:

When an artificial neural network
learning algorithm causes the total error
of the net to descend into a valley of the error surface, that
valley may or may not lead to the lowest point on the entire
error surface. If it does not, the minimum into which the total
error will eventually fall is termed a local minimum. The
learning algorithm is sometimes referred to in this case as
"trapped in a local minimum."

In such cases, it usually helps to restart the algorithm with
a new, randomly chosen initial set of
weights - i.e. at a new random point in weight space.
As this means a new starting point on the
error surface, it is likely to lead into a different valley, and
hopefully this one will lead to the true (absolute) minimum
error, or at least a better minimum error.

A related function, also sometimes used in backprop-trained
networks, is 2φ(x)–1, which can also be expressed
as tanh(x/2). tanh(x/2) is, of course, a smoothed version
of the step function which jumps from –1 to 1 at x = 0, i.e.
the function which = –1 if x < 0, and = 1 if x ≥ 0.

A simple model of a biological neuron
used in neural networks to perform a small part of some overall
computational problem. It has inputs from other neurons, with
each of which is associated a weight - that
is, a number which indicates the degree of importance which this
neuron attaches to that input. It also has an activation
function, and a bias. The bias acts like
a threshold in a perceptron.

Term used in analysing machine learning
methods. The observation language refers to the notation used
by the learning method to represent the data it uses for training.
For example, in ID3, the observation language
would be the notation used to represent the training
instances, including attributes and their
allowable values, and the way instances
are described using attributes. In
backprop, the observation language would
be the notation used to represent the
training patterns. In Aq, the observation language
would again be the notation used to represent the instances,
much as in ID3.

Perceptrons were originally used as pattern classifiers,
where the term pattern is here used not in the sense of
training pattern, but just in the
sense of an input pattern that is to be put into on of
several classes. Perceptual pattern classifiers of this sort
(not based on perceptrons!) occur in simple animal
visual systems, which can distinguish between prey,
predators, and neutral environmental objects.

If for example, a node of the tree contains, say, 99 items
in class C1 and 1 in class C2, it is plausible that the 1 item
in class C2 is there because of an error either of classification
or of feature value. There can thus be an argument for regarding
this node as a leaf node of class C1. This termed pruning
the decision tree.

The algorithm given in lectures for deciding when to prune is as
follows:
At a branch node that is a candidate for pruning:

A recurrent connection is one that is part of a directed
cycle, although term is sometimes reserved for a connection
which is clearly going in the "wrong" direction in an
otherwise feedforward network.

Recurrent networks include fully recurrent networks
in which each neuron is connected to every other
neuron, and partly recurrent networks in which
greater or lesser numbers of recurrent connections exist.
See also simple recurrent network.

This article is included for general interest - recurrent
networks are not part of the syllabus of COMP9414 Artificial
Intelligence.

Feedforward networks are
fine for classifying objects, but their units
(as distinct from their weights) have no
memory of previous inputs. Consequently they are unable to
cope with sequence prediction tasks - tasks like predicting, given
a sequence of sunspot activity counts, what the sunspot activity
for the next time period will be, and financial prediction tasks
(e.g. given share prices for the last n days, and presumably
other economic data, etc., predict tomorrow's share price).

A simple recurrent network is like a
feedforward network with an input layer, and output layer,
and a single hidden layer, except that
there is a further group of units called
state units or context units. There is one state
unit for each hidden unit. The activation
function
of the state unit is as follows: the activation of a state unit in time
step n is the same of that of the corresponding hidden unit
in time step n–1. That is, the state unit activations are copies
of the hidden unit activations from the previous time step. Each
state unit is also connected to each hidden unit by a
trainable weight - the direction of this
connection is from the state unit to the hidden unit.

if all the instances belong to a single class, there is
nothing to do (except create a
leaf node labelled with the name of that class).

otherwise, for each attribute that has not already been
used, calculate the information gain that would be obtained
by using that attribute on the particular set of instances
classified to this branch node.

use the attribute with the greatest information gain.

This leaves the question of how to calculate the information
gain associated with using a particular attribute A. Suppose
that there are k classes C1,
C2, ..., Ck,
and that of the N instances
classified to this node,
I1 belong to class C1,
I2 belong to class C2, ..., and
Ik belong to class Ck.
Let p1 = I1/N,
p2 = I2/N, ..., and
pk = Ik/N.
The initial entropy E at this node is:

–p1log2(p1)
–p2log2(p2) ...
–pklog2(pk).

Now split the instances on each value of the chosen attribute
A. Suppose that there are r attribute values
for A, namely a1, a2, ..., ar.
For a particular value aj, say, suppose that there are
Jj,1 instances in class C1,
Jj,2 instances in class C2, ..., and
Jj,k instances in class Ck,
for a total of Jj instances having attribute value
aj.
Let qj,1 = Jj,1/Jj,
qj,2 = Jj,2/Jj, ..., and
qj,k = Jj,k/Jj.
The entropy Ej associated with this attribute value
aj this position is:

–qj,1log2(qj,1)
–qj,2log2(qj,2) ...
–qj,klog2(qj,k).

Now compute:

E – ((J1/N).E1
+ (J2/N).E2 + ... + (Jr/N).Er).

This is the information gain for attribute A.
Note that Jj/N is the estimated probability
that an instance classified to this node will have value
aj for attribute A. Thus we are weighting
the entropy estimates Ej by the estimated
probability that an instance has the associated attribute value.

In terms of the example
used in the lecture notes, (see also
calculations
in lecture notes),
k = 2 since
there are just two classes, positive and negative.
I1 = 4 and I2 = 3, and N =7,
and so p1 = 4/7 and p2 = 3/7, and E =
–p1log2(p1)
–p2log2(p2) = –(4/7)×log2(4/7) – (3/7)×log2(3/7).
In the example, the first attribute
A considered is size, and the first value of
size considered,
large, corresponds to a1,
in the example in the lecture notes, so J1,1 = 2 =
J1,2, and J1 = 4.
Thus q1,1 = J1,1/J1 = 2/4 = ½, and
q1,2 = J1,2/J1 = 2/4 = ½, and
E1 =
–q1,1log2(q1,1)
–q1,2log2(q1,2) = -½×log2(½) – ½×log2(½)
= 1.
Similarly E2 = 1 and J2 = 2
(size = small),
and E3 = 0 and J3 = 1
(size = medium)
so the final information gain,

Combinations of the two (e.g. whichever of the two occurs
first) and other stopping conditions are possible. See the reference
by Haykin (Neural networks: a comprehensive foundation
p. 153) for more details.

A synapse, in a biological neuron,
is a connection between the axon of one
neuron and the dendrite of another.
It corresponds to a weight in an artificial
neuron. Synapses have varying strengths
and can be changed by learning (like weights). The operation
mechanism of synapses is biochemical in nature, with transmitter
substances crossing the tiny "synaptic gap" between axon and
dendrite when the axon fires.

Machine learning algorithms are trained on a collection of
instances or patterns.
Once training is complete, it is usual to test the trained system
on a collection of test instances or test patterns which were not
used when training the system, in order to find out to what degree
the system is able to generalise beyond its training data.

Given a training pattern,
its squared error is obtained by squaring the difference
between the target output of an output neuron and the actual output.
The sum-squared error, or pattern sum-squared error (PSS), is
obtained by adding up the sum-squared errors for each output neuron.
The total sum-squared error is obtained by adding up the PSS for each
training pattern.

This article describes the basic tree induction algorithm used by
ID3 and successors. The basic idea is to
pick an attributeA with values
a1, a2, ..., ar, split the training
instances into subsets Sa1,
Sa2, ..., Sar consisting of those instances
that have the corresponding attribute value. Then if a subset
has only instances in a single class, that part of the tree stops
with a leaf node labelled with the single class. If not, then the
subset is split again, recursively, using a different attribute.

This leaves the question of how to choose the best
attribute to split on at any branch node. This issue is handled
in the article on splitting criterion in ID3.

A weight, in a artificial neural network,
is a parameter associated with a connection from one
neuron, M, to another neuron N.
It corresponds to a synapse in a
biological neuron, and it determines how much
notice the neuron N pays to the activation it receives from neuron
N. If the weight is positive, the connection is called excitatory,
while if the weight is negative, the connection is called inhibitory.
See also to a neuron.

Without windowing, such an algorithm can be really slow, as it
needs to do its information gain calculations (see
tree induction algorithms) over huge amounts of data.

With windowing, training is done on a relatively small sample
of the data, and then checked against the full set of training data.

Here is the windowing algorithm:

Select a sample S of the training instances at random - say 10%
of them. The actual proportion chosen would need to be small
enough that ID3 could run fairly fast on them, but large enough
to be representative of the whole set of examples.

Run the ID3 algorithm on the set of training instances to
obtain a decision tree.

Check the decision tree on the full data set, to obtain a set
E of training instances that are misclassified by the tree obtained.