A training sequence with discrete time steps (called an episode)
consists
of ordered pairs
,
.
At time of episode a learning system receives as
an input and produces the output . The goal of the learning system
is to minimize

where is the th of the components of ,
and is the th of the components of .

In general, this task requires
storage of input events in a short-term memory.
Previous solutions to this problem
have employed gradient-based dynamic recurrent nets
(e.g.,
[Robinson and Fallside, 1987],
[Pearlmutter, 1989],
[Williams and Zipser, 1989]).
In the next section an alternative gradient-based
approach is described. For convenience,
we drop the indices which stand for the various episodes.

The gradient of the error over
all episodes is equal to the sum of the gradients for each episode.
Thus we only require a method for
minimizing the error observed during one particular episode:

where
. (In the
practical on-line version of the algorithm below there will be
no episode boundaries; one episode will 'blend' into the next
[Williams and Zipser, 1989].)