The system described below
is the on-line version of a representative
of a number of variations of the basic
principle described in 4.1.
See [Schmidhuber, 1991c] for
various modifications.

Table 1 gives an overview of various time-dependent activation vectors
relevant for the description of the algorithm. Additional notation:
`' is the concatenation operator;
if the teacher provides a target vector
at time and
otherwise.
If
then takes on some default value,
e.g. the zero vector.

------------INSERT TABLE 1 HERE--------------

Table 1:Definitions of symbols representing time-dependent activation vectors.
`' is the concatenation operator.
and are based on
previous inputs and are computed without knowledge about
and .

vector

description (referring to time )

dimension

`normal' environmental input

teacher-defined target

A's input

A's hidden activations

A's prediction of

A's prediction of

unique representation of

C's hidden activations

C's prediction of 's next target input

C's prediction of 's next `normal' input

C's prediction of 's next `time' input

A's prediction of

A has input units,
hidden units, and
output units (see table 1).
With pure prediction tasks
.
C has
hidden units, and
output units.
All of A's input and hidden units
have directed connections to
all of A's hidden and output units.
All input units
of A have directed connections to
all hidden and output units of C. This is because A's
input units
serve as input units for C
at certain time steps. There are additional
input units for C for providing unique representations
of the current time step.
These additional input units also have directed connections to
all hidden and output units of C.
All hidden units
of C have directed connections to
all hidden and output units of C.

A will try to make
equal to
if , and
it will try to make
equal to , thus trying to predict .
Here again the target prediction problem is defined as a special case
of an input prediction problem.
C will try to make
equal to the externally provided teaching vector
if and if A failed to emit .
Furthermore, it will always try to make
equal to
the next non-teaching input to be processed
by C. This input may be many time steps ahead.
Finally, and most importantly, A will try to make
equal to
, thus trying to predict the
state of C. The activations of C's
output units are considered
as part of its state.

Both C and A simultaneously are
trained by a conventional
algorithm for recurrent networks in an on-line fashion. Both
the IID-Algorithm
and BPTT
are appropriate.
In particular, computationally inexpensive
variants of BPTT [Williams and Peng, 1990]
are interesting: There are tasks with hierarchical
temporal structure where only a few iterations
of `back-propagation back into time' per time step are
in principle sufficient to bridge arbitrary time lags (see section 5).

I now describe the (quite familiar) procedure for
updating activations in a net.

Repeat for a constant number of iterations (typically one or two):

I now specify the
input-output behavior of the chunker and the automatizer as well as
the details of error injection: