The basic idea is to use a slowly learning feed-forward network
(with a set of randomly initialized weights )
whose input at time is the vector and whose
output is transformed into immediate
weight changes for
a second `fast-weight' network .
The input to at time is also ,
its -dimensional output is
, and its set of weight variables is . serves as
a short term memory: At different time steps, the same input event
may be processed in different ways depending on the time-varying state of .

The standard method for processing temporal sequences is to employ
a recurrent net with feedback connections. The feedback connections
allow for a short-term memory of information earlier in a sequence.
The present work suggests a novel approach to building a short-term
memory by employing fast weights that can be set and reset by the
'memory controller' . Fast weights can hold on to information
over time because they remain essentially invariant unless they are
explicitly modified.

One potential advantage of the method over the more conventional
recurrent net algorithms is that it does not necessarily
require full-fledged units - experiencing some sort of feedback -
for storing temporal information. A single weight may be sufficient.
Because there are many more weights than units in most networks,
this property represents a potential
for storage efficiency. For related reasons,
the novel representation of past inputs
is well-suited for solving certain problems involving
temporary variable binding
in a natural manner:
's current input may be viewed as a representation of
the addresses of a set of variables;
's current output may be viewed as the representation of
the current contents of this set of variables.
In contrast with recurrent nets, temporary bindings can be established very
naturally by temporary connectivity patterns instead of temporary
activation patterns
(see section 3.2 for an illustrative experiment).

For initialization reasons we introduce an additional time step 0
at the beginning of an episode.
At time step
each weight variable
of a directed connection from unit to unit is set to
(a function of 's outputs as described below).
At time step , the are used to compute the output of
according to the usual activation spreading rules for
back-propagation networks (e.g. [Werbos, 1974]).
After this, each weight variable
is altered
according to

(1)

where (e.g. a sum-and-squash function)
is differentiable with respect to all its parameters
and where the activations of 's output units
(again computed according to the usual activation spreading rules for
back-propagation networks)
serve to compute
by a mechanism specified below.
is 's contribution to the modification
of at
time step .

Equation (1) is essentially identical to Möller and Thrun's
equation (1) in [Möller and Thrun, 1990]. Unlike
[Möller and Thrun, 1990], however, the current paper derives an exact
gradient descent algorithm for time-varying inputs and outputs
for this kind of architecture.

For all weights
(from unit to unit )
we are interested in the increment

(2)

Here is a constant learning rate.
At each time step , the factor

can be computed by conventional back-propagation (e.g. [Werbos, 1974]).
For we obtain the recursion

depends on the interface between and .
With a given interface (two possibilities are given below)
an appropriate back-propagation procedure
for each
gives us
for all
. After having updated the variables,
(2) can be computed using the formula

A simple interface between and
would provide one output unit for each weight variable
, where

(4)

being the output unit's activation at time .

A disadvantage of (4) is that the
number of output units in
grows in proportion to the number of weights in .
An alternative is the following:
Provide an output unit in for each unit in from which
at least one fast weight originates. Call the set of
these output units FROM.
Provide an output unit in for each unit in to which
at least one fast weight leads. Call the set of
these output units TO.
For each weight variable
we now have a unit
FROM and a unit
TO. At time , define
, where
is differentiable with respect to all its parameters. As a
representative example
we will focus on the special case
of being the multiplication operator:

(5)

Here the fast weights in are manipulated by the outputs
of in a Hebb-like manner, assuming that is just a
sum-and-squash function as employed in the experiments described below.

One way to interpret the FROM/TO architecture is to view
as a device for creating temorary associations by giving two parameters
to the short term memory: The first parameter is an activation
pattern over FROM representing a key to
a temporary association pair, the second
parameter is an activation
pattern over TO representing the corresponding entry.
Note that both
key and entry may involve hidden units.

(4) and (5) differ in the way that error signals are
obtained at 's output units:
If (4) is employed, then we use conventional back-propagation to
compute
in (3).
If (5) is employed, note that

(6)

Conventional back-propagation can be used to compute
for each output
unit and for all . The results
can be kept in
variables.
This makes it easy to solve (6) in a second pass.

The algorithm is local in time, its
update-complexity per time step is
.
However, it is not local in space
(see [Schmidhuber, 1990b] for a definition of
locality in space and time).