DERIVATION OF THE ALGORITHM

Outline.
We are interested in weights representing nets
with tolerable error but flat outputs (see section 2 and appendix A.1).
To find nets with flat outputs,
two conditions will be defined
to specify for
and, as a consequence, (see section 3).
The first condition ensures flatness.
The second condition enforces ``equal flatness'' in all
weight space directions, to obtain low variance of the net
functions induced by weights within a box. The second
condition will be
justified using an MDL-based argument.
In both cases, linear approximations will be made
(to be justified in A.2).

Formal details.
We are interested in weights causing tolerable error (see
``acceptable minima'' in section 2) that
can be perturbed without causing significant
output changes, thus indicating the presence of
many neighboring weights leading to the same net function.
By searching for the boxes from
section 2, we are actually
searching for low-error weights whose perturbation
does not significantly change the net function.

In what follows we treat the input as fixed: for
convenience, we suppress , i.e. we abbreviate
by .
Perturbing the weights by
(with components ),
we obtain
,
where expresses 's
dependence on (in what follows,
however, often will be suppressed for
convenience, i.e. we abbreviate by ).
Linear approximation (justified in A.2)
gives us ``Flatness Condition 1'':

Many boxes satisfy flatness condition 1.
To select a particular, very flat ,
the following
``Flatness Condition 2''
uses up degrees of freedom
left by inequality (3):

(4)

Flatness condition 2
enforces equal ``directed errors''
,
where has the obvious meaning,
and is the -th component of .
Linear approximation is justified by
the choice of in inequality (3).
As will be seen in the MDL-justification to be presented below,
flatness condition 2 favors the box which
minimizes the mean perturbation error within the box.
This corresponds to minimizing the variance of the
net functions induced by weights within the box
(recall that
is quadratic).

How to derive the algorithm from flatness conditions 1 and 2.
We first solve equation (4) for
(fixing for all ).
Then we insert the
(with fixed ) into inequality
(3)
(replacing the second ``'' in (3) by ``'', since
we search for the box with maximal volume).
This gives us an equation for the
(which depend on
, but this is notationally suppressed):

(5)

The
( is replaced by ) approximate
the from section 2. The box is approximated
by , the box with center and edge lengths
.
's volume is approximated by 's box volume
.
Thus,
(see section 3) can be
approximated by
.
This immediately leads to the algorithm given by
equation (1).

How can the above approximations be justified?
The learning process itself enforces their validity (see A.2).
Initially, the conditions above are
valid only in a very small environment of
an ``initial'' acceptable minimum. But during search
for new acceptable minima with more associated box volume,
the corresponding environments are enlarged.
Appendix A.2 will prove this for feedforward nets
(experiments indicate that
this appears to be true for recurrent nets as well).

Comments.
Flatness condition 2 influences
the algorithm as follows: (1) The algorithm prefers to
increase the 's of weights whose current
contributions are
not important to compute the target output.
(2) The algorithm enforces equal sensitivity of all
output units with respect to weights of connections to
hidden units. Hence, output units tend to share hidden
units, i.e., different hidden units tend to
contribute equally to the computation of the target. The
contributions of a particular hidden unit
to different output unit activations tend to be equal, too.

Flatness condition 2 is essential:
flatness condition 1 by itself
corresponds to nothing more but first order derivative reduction
(ordinary sensitivity reduction).
However, as mentioned above, what we really want is to
minimize the variance of the net functions induced by weights
near the actual weight vector.

Automatically, the algorithm treats units and weights
in different layers differently, and takes the nature of
the activation functions into account.