Optimization of a Neural Network

Many discussions of neural networks unnecessarily introduce
a large vocabulary of specialized terms. In fact, neural networks
are a simple system of recursive equations that can be optimized
by conventional means. Any good book on optimization
provides all the necessary tools [1].
We need only unambiguously write down the equations we are using,
identify the objective function, and calculate a few derivatives.

Neural-network equations map vectors of known data
to corresponding known vectors of parameters , indexed by .
The vector of parameters might be actual physical properties or
statistical probabilities. Data can be physical measurements
or numbers that index possible events. The equations
contain unknown coefficients, or weights, that the user
adjusts until known parameters are predicted accurately.
Then it is hoped that unknown parameters can
be estimated from new measurements of data.

Let's look at how simple these equations are. Intermediate vectors
are expressed recursively
as a function of other vectors
,
usually with a different dimensionality.
The first vector
corresponds to data , and the
final vector
corresponds to the parameters
to be estimated.
First, a pre-established non-linear transform is applied to
each element of a vector. Then linear transforms calculate
each sample of the output vector as a weighted sum
of samples from the input vector.

(1)

(2)

(3)

(4)

Usually the non-linear functions apply independently to each
element of the vector. Functions are expected
to be monotonic over the range
for values
.
The derivative should also be continuous.
Usually, the shape is a sigmoid
or linear. Because the notation presents
no difficulty, I write this non-linear scalar function as a more general
vector function.

I could also have reversed the order of linear and non-linear operators.
A reversed order is equivalent to setting and to
identity operations. This form (1)
is somewhat easier to manipulate.

The best weights are assumed to minimize the following least-squares
objective function , summed over all known pairs of data measurements
and model parameters .

(6)

A perturbation of the weights
will result in
a linearized perturbation of the objective function:

(7)

(8)

(9)

(10)

Unperturbed variables retain their original reference values from
the original non-linear forward transform (1).

The perturbations (9) of the vector elements
are
expressed as a linear function of perturbed weights
.
We can abbreviate the recursive equations (9)
as a single linear transform
.

(11)

(12)

Gradient optimization also requires the adjoint of this linearization.
If the linearized forward transform is expressed as a matrix,
then the adjoint transform is just the transpose of this matrix.
A matrix would be unwieldy, but the recursive version of the
adjoint equations is not.

(13)

(14)

(15)

(16)

(17)

These perturbations do not equal those of the forward linearization.
The adjoint recursion can also be abbreviated with the
linear transform
.

(18)

(19)

If the linear transform
were written as a single
large matrix, then indeed the matrix would be identical
in the forward and adjoint equations (11) and (18).
For optimization, fortunately, we need not construct this matrix
explicitly (as would be required by singular-value decomposition).
Instead, we need only be able to multiply this linear transform
or its adjoint by specific perturbations, using the recursions
(9) and (16).

The simplest neural network ``training'' algorithm adjusts the previous
choice of weights by a scaled gradient. This recursive algorithm
is called back-propagation.

Initialize each weight matrix .

Calculate
.

Replace each by
.

Return to step 2.

To reduce the objective function, the perturbation reverses the
sign of the gradient. The small scale factor is sometimes
fixed a priori and never changed.
If the scale factor is too small, then many consecutive steps may
move in the same direction. If the scale factor is too large,
the perturbations may increase rather than decrease the objective
function and may even diverge. Although this algorithm is most
commonly cited, we can easily do better. Let us turn to
standard non-linear optimization methods.

Many steepest-descent algorithms would replace step 3
by a line search to find an optimum scale factor.

Initialize each weight matrix .

Calculate
as before.

Find to minimize

Replace each by
.

Return to step 2.

This revised algorithm is guaranteed to find a local minimum where the
gradient is zero. Step sizes are large. The line search can use
combinations of parabolic fitting and golden sections for speed
and robustness.

A good estimate of the scale factor can be estimated without
an explicit line search.

Initialize each .

Calculate
as before.

Calculate
.

Find to minimize
,
or equivalently

Replace each by
.

Return to step 2.

When the linearization is a good approximation (particularly in
the vicinity of a minimum), the scale factor will be very accurate.
Each iteration will be much more efficient than a line search.
If the scaled perturbation increases the objective function in early
non-linear iterations, then a line search can be used until the
linearization improves.

The convergence of these steepest descent algorithms should be improved
by a PARTAN or Fletcher-Reeves algorithm, which retains and combines
consecutive gradients. Neural network references use the word ``momentum''
to describe similar strategies, usually with invariable scale factors.

An even more efficient alternative is to solve the linearized least-squares
problem fully in each iteration -- a Gauss-Newton approach:

Initialize each .

Find the perturbations
that minimize
.

Replace each by
.

Return to step 2.

Since the objective function in step 2 is a fully quadratic function
of the perturbations , we can use iterative least-squares
methods like conjugate gradients.

If a steepest descent algorithm were applied with an unlimited
number of infinitesimal perturbations, then the changing weights
would continuously decrease the objective function until a local
minimum was reached. Visualize following a path that is directly
downhill at every point. The path cannot ascend, no matter what
may be beyond the next rise.
Practical descent algorithms take large step sizes and may step over
a small local increase in the objective function, to find another
minimum beyond.

To increase the chances of finding the global minimum with
the pseudo-quadratic objective function (6), we could
initialize with different initial weights and discard suboptimum
local minima. If different weights consistently lead to the same
solution, then the objective function is probably convex and
has only one global minimum.