LIMITATIONS / FINAL REMARKS / FUTURE RESEARCH

How to adjust ?
Given recent trends in neural computing
(see, e.g., MacKay, 1992a, 1992b),
it may seem like a step backwards that is
adapted using an
ad-hoc heuristic from Weigend et al., 1991.
However, for determining in MacKay's style,
one would have to compute the Hessian of the cost function.
Since our term includes
first order derivatives, adjusting would require
the computation of
third order derivatives. This is impracticable.
Also, to optimize the regularizing parameter (see MacKay, 1992b),
we need to compute the function
,
but it is not obvious how:
the ``quick and dirty version'' (MacKay, 1992a) cannot deal with
the unknown constant in .

Future work will investigate how to adjust
without too much computational effort.
In fact, as will be seen in appendix A.1,
the choices of and
are correlated -- the optimal choice of
may indeed correspond to the optimal choice of .

Generalized boxes?
The boxes found by the current version of FMS are axis-aligned.
This may cause an under-estimate of flat minimum volume.
Although our experiments indicate that box search works very well,
it will be interesting to compare alternative
approximations of flat minimum volumes.

Multiple initializations?
First, consider this FMS ``alternative'':
run conventional backprop starting with several
random initial guesses, and pick the flattest minimum with largest
volume. This does not work: conventional backprop changes the weights
according to steepest descent -- it runs away from flat
ranges in weight space!
Using an ``FMS committee'' (multiple runs with different
initializations), however, would
lead to a better approximation of the posterior. This is left for
future work.

Notes on generalization error.
If the prior distribution of targets (see appendix A.1)
is uniform (or if the distribution of prior distributions is uniform),
no algorithm can obtain a lower expected generalization
error than training error reducing algorithms
(see, e.g., Wolpert, 1994b).
Typical target distributions in the real world are
not uniform, however - the real world appears to
favor problem solutions with low algorithmic complexity. See, e.g.,
Schmidhuber (1994a).
MacKay (1992a) suggests to search for alternative priors if
the generalization error indicates a ``poor regulariser''.
He also points out that
with a ``good'' approximation of the non-uniform prior,
more probable posterior hypothesis do not necessarily have
a lower generalization error. For instance,
there may be noise on the test set, or two hypotheses representing
the same function may have different posterior values,
and the expected generalization error ought to be computed
over the whole posterior and not for a single solution.
Schmidhuber (1994b) proposes a general, ``self-improving''
system whose entire life is viewed as a single
training sequence and
which continually attempts to incrementally modify its
priors based on experience with previous problems -- see
also Schmidhuber (1996).
It remains to be seen, however, whether this will lead to
practicable algorithms.

Ongoing work on low-complexity coding.
FMS can also be useful for unsupervised learning.
In recent work,
we postulate that a ``generally useful'' code of given
input data fulfills three MDL-inspired criteria:
(1) It conveys information about the input data.
(2) It can be computed from the data by a low-complexity mapping.
(3) The data can be computed from the code by a low-complexity mapping.
To obtain such codes,
we simply train an auto-associator with FMS (after
training, codes are represented across the hidden units).
In initial experiments,
depending on data and architecture,
this always led to well-known kinds of codes considered
useful in previous work by numerous researchers:
we sometimes obtained factorial codes,
sometimes local codes, and sometimes sparse codes.
In most cases, the codes were of the low-redundancy, binary kind.
Initial experiments with a speech data benchmark problem
(vowel recognition) already showed the true usefulness of
codes obtained by FMS: feeding the codes into standard, supervised,
overfitting backprop classifiers, we obtained much
better generalization performance than competing approaches.