Using the principle of history compression
we can build a self-organizing hierarchical neural `chunking' system.
The system
detects causal dependencies in the temporal input stream and learns
to attend to unexpected inputs
instead of focussing on every input. It learns to reflect both the
relatively local and the relatively global temporal
regularities contained in the
input stream.

The basic task can be formulated as a prediction task.
At a given time step
the goal is to predict the next input from previous inputs.
If there are
external target vectors at certain time steps then they are simply
treated as another part of the input to be predicted.

The architecture is a hierarchy of predictors, the input to
each level of the hierarchy is coming from the previous level.
denotes the th level network which
is trained to predict its own next input
from its previous inputs1.
We take to be one of the conventional dynamic
recurrent neural networks mentioned in the introduction;
however, it might be some other adaptive sequence
processing device as well2.

At each time
step the input of the lowest-level recurrent
predictor is the current external input.
We create a new higher-level adaptive
predictor
whenever the adaptive predictor at the previous level, , stops
improving its predictions. When this happens the weight-changing mechanism
of is switched off (to exclude potential instabilities
caused by ongoing modifications of the lower-level predictors).
If at a given time step ()
fails to predict its next input (or if we are at the beginning of a training
sequence which usually is not predictable either)
then will receive
as input the concatenation of this next input of
plus a unique representation of the
corresponding time step3;
the activations of
's
hidden and output units will be updated.
Otherwise will not perform an activation update.
This procedure ensures that is fed
with an unambiguous reduced description4of the
input sequence observed by .
This is theoretically
justified by the principle of
history compression.

In general,
will receive fewer inputs over time than .
With existing learning
algorithms, the higher-level predictor should have
less difficulties in learning to predict the critical inputs
than the lower-level predictor.
This is because
's `credit assignment paths' will often
be short compared to those of .
This will
happen if the incoming inputs
carry global temporal structure which has not yet been discovered
by .

This method
is a simplification and an
improvement of the recent chunking method described
by [19].

A multi-level predictor hierarchy is a rather safe way
of learning to deal with sequences with multi-level temporal
structure (e.g speech). Experiments have shown that multi-level predictors
can quickly learn tasks which are practically unlearnable by conventional
recurrent networks, e.g. [5].