I will keep the architecture and the objective function from section 1
but I will modify the system dynamics.
Recall that unquantized
variables are assumed to take on their maximal range.
For our single training sequence with discrete time steps, the
system dynamics (explanation follows below)
are defined by

(3)

(4)

where is a differentiable function (e.g. for limiting
the weight on
to a given interval),
and and are differentiable monotonic functions (the `threshold
approximators', to be explained below).

Equation (3) is just the conventional recurrent net update rule (1).
Unlike with conventional recurrent nets, however, the weights do
not remain constant during sequence processing :
Equation (4) says that connections between units active at
successive time steps
are immediately strengthened or weakened essentially in proportion
to pre-synaptic
and post-synaptic activity.
These intra-sequence weight changes are
modulated by the non-linear functions and
and may be negative (anti-Hebb-like)
or zero as well as positive.
Let us assume that all input vectors and all are such
that all units can take on only activations between 0 and 1.
and
are meant to specify the upper and lower thresholds that determine
how strongly units have to be excited or inhibited to contribute to
intra-sequence weight changes. A reasonable
choice for and is one where and
are strongly negative only if their argument is close to 0
and are strongly positive only if their argument is close to 1.
Both and should return values close to 0 for arguments from
the largest part of the interval between 0 and 1. This implies
hardly any intra-sequence weight changes for connections
between units that have non-extreme activations during successive time steps.

The overall effect is that only connections between units
that are exceptionally active or exceptionally inactive during
successive time steps can be significantly modified.
Intra-sequence weight changes
essentially occur only if the
network `pays a lot of attention' to certain units by
strongly exciting them or strongly inhibiting them.
Weights to units that are not `illuminated by adaptive internal
spotlights of attention'
essentially remain invariant and participate only in `automatic
processing' as opposed to `active intra-sequence learning'.
The remainder of this paper derives an exact gradient-based algorithm
designed to adjust the system (via inter-sequence weight
changes) such that it creates
appropriate intra-sequence weight changes at appropriate time steps.