Dynamics and RNNs

Consider the recurrent network illustrated below. A single input unit
is connected to each of the three "hidden" units. Each hidden unit in
turn is connected to itself and the other hidden units. As in the
RTRL derivation, we do not distinguish now between hidden and output
units. Any activation which enters the network through the input node
can flow around from one unit to the other, potentially forever.
Weights less than 1.0 will exponentially reduce the activation,
weights larger than 1.0 will cause it to increase. The non-linear
activation functions of the hidden units will hopefully prevent it
from growing without bound.

As we have three hidden units, their activation at any given time
t describes a point in a 3-dimensional state space. We can
visualize the temporal evolution of the network state by watching the
state evolve over time.

In the absence of input, or in the presence of a steady-state input, a
network will usually approach a fixed point attractor. Other
behaviors are possible, however. Networks can be trained to oscillate
in regular fashion, and chaotic behavior has also been observed. The
development of architectures and algorithms to generate specific forms
of dynamic behavior is still an active research area.

Some limitations of gradient methods and RNNs

The simple recurrent network computed a gradient based on the present
state of the network and its state one time step ago. Using Back Prop
Through Time, we could compute a gradient based on some finite
n time steps of network operation. RTRL provided a way of
computing the true gradient based on the complete network history from
time 0 to the present. Is this perfection?

Unfortunately not. With feedforward networks which have a large
number of layers, the weights which are closest to the output are the
easiest to train. This is no surprise, as their contribution to the
network error is direct and easily measurable. Every time we back
propagate an error one layer further back, however, our estimate of
the contribution of a particular weight to the observed error becomes
more indirect. You can think of error flowing in the top of the
network in distinct streams. Each pack propagation dilutes the error,
mixing up error from distinct sources, until, far back in the network,
it becomes virtually impossible to tell who is responsible for what.
The error signal has become completely diluted.

With RTRL and BPTT we face a similar problem. Error is now propagated
back in time, but each time step is exactly equivalent to propagating
through an additional layer of a feed forward network. The result, of
course, is that it becomes very difficult to assess the importance of
the network state at times which lie far back in the past. Typically,
gradient based networks cannot reliably use information which lies
more than about 10 time steps in the past. If you now imagine an
attempt to use a recurrent neural network in a real life situation,
e.g. monitoring an industrial process, where data are presented as a
time series at some realistic sampling rate (say 100 Hz), it becomes
clear that these networks are of limited use. The next section shows
a recent model which tries to address this problem.