Echo States as Bases and Proj sections .................. ...............30........... ..
ESN Dynamics as a Combination of Linear Systems .................. .......... ................ ...3 5
Average State Entropy as a Measure of the Richness of ESN Reservoir ............... .... ...........3 8

1-4 Echo state network (ESN). ESN is composed of two parts: a fixed-weight (W) recurrent
network and a linear readout............... ...............23

2-1 Performance of ESNs for different realizations of W with the same weight distribution.
Results show that spectral radius of w is not the unique parameter that determines
the performance of an ESN ................. ...............34........... ...

2-2 Pole tracks of the linearized ESN when the input goes through a cycle.. ............ ..... ........._..37

2-3 Echo states and state entropy for different ESNs. ............. ...............41.....

2-4 The ASE values obtained from 50 realizations of ESNs with the same spectral radius .........42

3-1 Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly generated
ESNs with different sparseness............... ...............4

PEs (echo state) are fed to a memoryless but adaptive readout network (generally linear) that

reads the reservoir and produces the network output. The interesting property of ESN is that only

the memoryless readout is trained, whereas the recurrent topology (W) has fixed connection

weights. This reduces the complexity of RNN training to simple linear regression while

preserving a recurrent topology, but obviously places important constraints in the overall

architecture that have not yet been fully studied.

Read-out

...............

Input Layer

O"""

........... .

y(n)

Back

I
Figure 1-4. Diagram of an echo state network (ESN). ESN is composed of two parts: a fixed-
weight (W) recurrent network and a linear readout. The recurrent network is a
reservoir of highly interconnected dynamical components, states of which are called
echo states. The memoryless linear readout is trained to produce the output.

has the same form of Equation 2-3, where the cp, (u(t)) 's and a,'s are replaced by the echo states

and the readout weights, respectively. These weights are adapted in the training data, which

means that the ESN is able to find the optimal proj section in the proj section space, just like the

RNN or the TDNN.

It is interesting that a similar perspective of basis and projections for information

processing in biological networks has been proposed by (Pouget and Sejnowski, 1997). They

explored the possibility that the response of single neurons in parietal cortex serve as basis

functions for the transformations from the sensory input to the motor responses. They proposed

that "the role of spatial representations is to code the sensory inputs and posture signals in a

format that simplifies subsequent computation, particularly in the generation of motor

commands" (Pouget and Sejnowski, 1997).

The central issue in ESN design is exactly the non adaptive nature of the basis set.

Parameter sets in the reservoir that provide linearly independent states and possess a given

spectral radius, may define drastically different proj section spaces because the correlation among

the bases is not constrained. A simple experiment was designed to demonstrate that the selection

of the echo state parameters by constraining the spectral radius is not the most suitable for

function approximation. Consider an echo state network of 100-units where the input signal is

sin(2~nn/(10x)). Mimicking (Jaeger, 2001), the goal is to let the ESN generate the seventh power

of the input signal. Different realizations of a randomly connected 100 unit ESN were

constructed where the entries of W are set to 0.4, -0.4 and 0 with probabilities of 0.025, 0.025

and 0.95, respectively. This corresponds to a spectral radius of 0.88. Input weights are set to +1

or -1 with equal probabilities and ac"k iS Set to zero. The input is applied for 300 time steps and

D 10 20 30 40 50
Different Rea~l:izton 51

Figure 2-1. Performance of ESNs for different realizations of W with the same weight
distribution. The weight values are set to 0.4, -0.4 and 0 with probabilities of 0.025,
0.025 and 0.95. All realizations have the same spectral radius of 0.88. In the 50
realizations, MSEs vary from 5.9x10-9 to 8.9x10- Results show that spectral radius
of W is not the unique parameter that determines the performance of an ESN.

the echo states are calculated using Equation 2-1. The next step is to train the linear readout

when a desired training signal, d(n) is available. One method to determine the optimal output

is 8.9x10- This experiment demonstrates that a design strategy that is based solely on the

spectral radius is not sufficient to specify the system architecture for function approximation.

This shows that for each set of random weights that provide the same spectral radius, the

correlation or degree of redundancy among the bases will change and different performances are

encountered in practice.

ESN Dynamics as a Combination of Linear Systems

It is well known that the dynamics of a nonlinear system can be approximated by that of a

linear system in a small neighborhood of an equilibrium point (Kuznetsov, 1998). Here, we will

perform the analysis with hyperbolic tangent PEs and approximate the ESN dynamics by the

dynamics of the linearized system in the neighborhood of the current system state. Hence, when

the system operating point varies over time, the linear system approximating the ESN dynamics

will change. We are particularly interested in the movement of the poles of the linearized ESN.

Consider the update equation for the ESN without output feedback given by

x(n +1) = f (Wl"u(n +1) + Wx(n))

Linearizing the system around the current state x(n), one obtains the Jacobian matrix,

J(n 1), defined by

f (net, (n))w,, f (net, (n))wl2 .. 1~et 1N>>

f (net, (n))w,, f (net, (n))wN, 2. f" t (N NN

f (netl (n))w,, 0 ... O
0 f (net2 (n)) ... O
W (2-5)

0 0 ... f (net, ()
= F (n) W

Here, net,(n) is the ith' entry of the vector (Wizzu(nz +1) + Wx(nz)) and w!, denotes the (i, j)th' entry

of W. The poles of the linearized system at time n+1 are given by the eigenvalues of the

Jacobian matrix J(n+1). When the amplitude of each PE input changes, the local slope changes,

and so the poles of the linearized system are time varying, although the parameters of ESN are

fixed.

In order to visualize the movement of the poles, consider an ESN with 100 states. The

entries of the internal weight matrix are chosen to be 0, 0.4 and -0.4 with probabilities 0.9, 0.05

and 0.05. W is scaled such that a spectral radius of 0.95 is obtained. Input weights are set to +1

or -1 with equal probabilities. A sinusoidal signal with a period of 100 is fed to the system and

the echo states are computed according to Equation 2-1. Then, the Jacobian matrix and the

eigenvalues are calculated using Equation 2-5. Figure 2-2 shows the pole tracks of the linearized

ESN for different input values. A single ESN with fixed parameters implements a combination

of many linear systems with varying pole locations, hence many different time constants that

modulates the richness of the reservoir of dynamics as a function of input amplitude. Higher

amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to

shrink toward the origin of the z-plane (decreases the spectral radius), which results in a system

with large stability margin. When the input is close to zero, the poles of the linearized ESN are

close to the maximal spectral radius chosen, decreasing the stability margin. When compared to

their linear counterpart, an ESN with same number of states results in a detailed coverage of the

z-plane dynamics (E-coverage), which illustrates the power of nonlinear systems. Similar results

can be obtained using signals of different shapes at the ESN input.

0[1 0.5 1 -'1 -0.5 0
Real Real

Figure 2-2. Pole tracks of the linearized ESN when the input goes through a cycle. A single ESN
with fixed parameters implements a combination of many linear systems with varying
pole locations. A) One cycle of sinusoidal signal with a period of 100. B)-E) show the
positions of poles of the linearized systems when the input values are at B, C, D, and
E in panel A. F) Cumulative pole locations show the movement of the poles as the
input changes.

Theorem: The eigenvalues of the linearized system have the largest radius when the

Figure 2-3. Echo states and state entropy for different ESNs. A) Outputs of echo states (100 PEs)
produced by systems with spectral radius of 0.2, 0.5 and 0.8, from up to bottom,
respectively. B) Instantaneous state entropy is calculated using Equation 2-6 for echo
states in panel A.

0.28

0.26

0.24

0.22

0.2

Tm 0.18

0.16

0.14

0.12

0.1

0.08
0 10 20 30 40 50
Trials

Figure 2-4. The ASE values obtained from 50 realizations of ESNs with the same spectral radius

Maximizing ASE means that the diversity of the states over time is the largest and should

provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a

guarantee that the ESN so designed will perform the best, because the basis set is independent of

the desired response and the application may require a small spectral radius. However, when the

desired response is not accessible for the design of the ESN bases or when the same reservoir is

to be used for a number of problems, the default strategy should be to maximize the ASE of the

state vector.

CHAPTER 3
DESIGN OF THE DYNAMICAL RESERVOIR

In ESNs, reservoir weights are selected randomly with the constraint of the echo state

condition (Jaeger, 2001). However, such a random selection scheme is not optimal as it is

demonstrated in chapter 2. In this chapter, we propose a design scheme for the design of

reservoir weights that will lead to small approximation errors for a variety of tasks. For optimal

approximation with a finite number of basis functionals, both the proj section space (bases) and

the proj ector (readout) require knowledge from the input and desired responses. However, in

ESNs, the projection space is determined solely by the architecture of the ESN reservoir and the

input signal (since W is fixed), without any knowledge of the space spanned by the desired target

signal. The selection of basis functions with the knowledge of input signal only is an ill-posed

problem. However, the selection of the reservoir weights must still be done using some rule, and

here we hypothesize that a good design strategy is to let the ESN states cover with equal

resolution the projection space to anticipate any possible mapping requirement (dictated by the

unknown desired response).

Design of the Recurrent Connections

According to the interpretation of ESNs as coupled linear systems, the design of the

internal connection matrix, W, will be based on the distribution of the poles of the linearized

system around zero state. Our proposal is to design the ESN such that the linearized system has

uniform pole distribution inside the unit circle of the z-plane. With this design scenario, the

system dynamics will include uniform coverage of time constants arising from the uniform

distribution of the poles, which also decorrelates as much as possible the bases functionals. This

principle was chosen by analogy to the identification of linear systems using Kautz filters

(Kautz, 1954) which shows that the best approximation of a given transfer function by a linear

system with finite order is achieved when poles are placed in the neighborhood of the spectral

resonances. When no information is available about the desired response or when would like to

use the same reservoir for a variety of tasks, we should uniformly spread the poles to anticipate

good approximation to arbitrary mappings.

We again use a maximum entropy principle to distribute the poles inside the unit circle

uniformly. The constraints of a circle as boundary conditions for discrete linear systems and

complex conjugate locations are easy to include for the pole distribution (Thogula, 2003). The

poles are first initialized at random locations; the quadratic Renyi's entropy is calculated using

Equation 2-6 and poles are moved such that the entropy of the new distribution is increased over

iterations (Erdogmus et al., 2003). This method is efficient to find a uniform coverage of the unit

circle with an arbitrary number of poles. However, any other method can be used to find the

location of poles with uniform coverage. Notice that this operation has to be done only once for a

given number of poles, which is equal to the number of hidden unit PEs.

The system with the uniform pole locations can be interpreted using linear system theory.

The poles that are close to the unit circle correspond to many sharp band pass filters specialized

Figure 3-1. Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly
generated ESNs with different sparseness. As observed from the figure, the ASE-ESN
with uniform pole distribution generates higher ASE on the average for all spectral
radii compared to ESNs with random connections.

Figure 3-1 compares the ASE values obtained using different ESNs over 1000 realizations.

As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE

on the average for all spectral radii compared to ESNs with sparse and uniform random

connections. This approach is indeed conceptually similar to Jeffreys' maximum entropy prior

(Jeffreys, 1946): it will provide a consistently good response for the largest class of problems.

Concentrating the poles of the linearized system in certain regions of the space, only provide

good performance if the desired response has energy in this part of the space, as is well known

from the theory of Kautz filters (Kautz, 1954).

Design of the Adaptive Bias

In conventional ESNs, only the output weights are trained optimizing the proj sections of the

ESN are 13.09, 13.55, 16.70 and 16.90, respectively. First of all, ESNs with uniform pole

distribution (ASE-ESN and BASE-ESN) have MCs that are much longer than the randomly

generated ESN given in (Jaeger, 2002) in spite of all having the same spectral radius. In fact, the

STM capacity of ASE-ESN is close to the theoretical maximum value of N=20. A closer look at

the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9.

Indeed, for small k, large ASE degrades the performance because the tasks do not need long

memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN

which reduces the ASE to the appropriate level required for the task. Overall, the addition of the

bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, U-

ESN has slightly better STM compared to R-ESN with only 3 different weight values although

U-ESN has more distinct weight values than R-ESN. It is also significant to note that the MC

will be very poor for an ESN with smaller spectral radius even with an adaptive bias since the

problem requires large ASE and bias can only reduce ASE. This experiment demonstrates the

need for maximizing ASE in ESN design.

Binary Parity Check

The effect of the adaptive bias was marginal in the previous experiment since the nature of

problem required large ASE values and long short-term memory. However, there are tasks in

which the optimal solutions require smaller ASE values and smaller spectral radius. Those are

the tasks where the adaptive bias becomes a crucial design parameter in our design methodology.

Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary

input signal, u(n), that assumes the values 0 or 1. The goal is to train an ESN to generate the m-

bit parity, where m is 3,...,8 Similar to the previous experiments, we used the R-ESN, ASE-ESN

and BASE-ESN topologies. R-ESN is a randomly connected ESN where the entries of W matrix

are set to 0, 0.06, -0.06 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse

connectivity of 20% and a spectral radius of 0.3. ASE-ESN and BASE-ESN are designed with a

spectral radius of 0.9. The input weights are set to 1 or -1 with equal probability and direct

connections from the input to the output are allowed whereas Wback is Set to 0. The echo states

are calculated using Equation 2-1 for 1000 samples of the input signal and the first 100 samples

corresponding to initial transient are eliminated. Then, the output weight matrix is calculated

using Equation 2-4. For ESN with adaptive bias, the bias is trained for each task. Binary decision

is made by a threshold detector that compares the output of the ESN to 0.5. Figure 3-3 shows the

number of wrong decisions (averaged over 100 different realizations) made by each ESN for

m=3,...,8.

350

300C

250

.0 200

en~ 150

3100

50 R-ESN
P ~ ASE-ESN
00 a BASE-ESN
3 4 5 6 7 8
m

Figure 3-3. Number of wrong decisions made by each ESN for m = 3, .. ,8 in the binary parity
check problem. The total number of wrong decisions for m = 3, .. ,8 of R-ESN,
ASE-ESN and BASE-ESN are given by 722, 862 and 699.

The total number of wrong decisions for m=3,...,8 of R-ESN, ASE-ESN and BASE-ESN

are 722, 862 and 699, respectively. ASE-ESN performs poorly since the nature of the problem

requires short time-constant for fast response but ASE-ESN has large spectral radius. For 5-bit

vector with the value of the echo state at time n for the ith training sequence.

3. The discrete Fourier transform of x, is denoted by X, Define the overall dxK training

matrix in the frequency domain by X = [X, X2,..., XK i

4. The optimal coefficients of the LAM for the class are computed using Equation 4-1 in

the frequency domain and the corresponding MACE filter weights hP are obtained by inverse

discrete Fourier transform. The output of the MACE for the ith training input pattern for the pth

class can be obtained by x~ hP

Input signal

0.5

10 15 20

Overlay plot of echo states

10 15 20

Echo state image

2 4 6 8 10 12 14 16 18 20
time

Figure 4-1. Interpretation of echo states as a 2-d image. A) The triangular input signal. B) The
echo states of a 100 unit ESN. C) The echo state image where a point in state index
and time is interpreted similar to a pixel in an image.

The same procedure is repeated for the training sequences of other classes to obtain an optimal

filter for each class.

During testing, the input signal can be either a continuous stream (asynchronous operation)

or a series of frames (synchronous operation). Depending on the mode of operation; the choice of

the initial conditions for ESN in step 1 varies. In synchronous operation, the timing of the signal

and the frames of interest are already known both in training and testing. Therefore, ESN is

initialized with the same zero initial condition both in training and testing. The zero initial

condition is chosen such that the system dynamics is not biased and the system states are

controlled by the input. During testing, the MACE output is calculated for each frame by

correlating the filter coefficients with the echo states generated by the frame at zero lag and with

a zero initial condition.

When the signal timing is unknown (asynchronous), the MACE filter generates an output

at each time instant n by correlating the filter coefficients with the echo states between time

instants n-T+1 and n. and the maximum in time above a certain pre specified threshold will be

picked to represent the occurrence of the pattern. In case of a continuous stream, the initial

conditions for each frame are dictated by the history of the input signal during testing; therefore

they are not necessarily zero. In order to mimic the test conditions while training the MACE, the

initial conditions of the ESN are not set to zero but determined by the history of the input. In

other words, the whole input signal is used to create echo states and then the resulting NxT

dimensional echo state images are extracted and used to train the MACE filter using the above

algorithm.

The Design of ESNs for Dynamical Pattern Recognition

One of the drawbacks of LAMs is the storage capacity that is limited by the input space

dimension (Haykin, 1998). However, the ESN/LSM with a LAM readout becomes very powerful

since echo/liquid states provide the LAM with a user defined high dimensional input space

increasing the number of patterns that can be stored in the LAM with minimal cross talk. In

ESNs, the reservoir states can be interpreted as a set of bases functionals (representation space)

constructed dynamically and nonlinearly from the input (Ozturk et al., 2007). Due to the

nonlinearity, it is easy to get states that are linearly independent of each other (Ito, 1996; Ozturk

Figure 4-2. Performance comparisons of ESNs for the classification of signals with unknown
peak values. ESNs with the MACE filter readout performs better for all noise levels
compared to discriminative and predictive ESNs. The classification accuracy of ASE-
ESN-MACE is almost perfect up to a noise standard deviation of 0.3. ASE-ESN-
MACE performs better than randomly connected ESNs with MACE readout since it
distributes the echo states uniformly and increases separation between states
corresponding to different classes.

To train the MACE filter readout for the ESNs, 100 different realizations for each class of

signals are generated and the echo states are calculated using Equation 2-2 with zero initial

conditions. This creates an image of dimension 30x20, where 30 is the number of PEs and 20 is

the signal length. One MACE filter is trained for each class using only data from the

corresponding class. Output correlation peak amplitudes are assigned to be 1.0 for the training

data. The two MACE filters are synthesized in the frequency domain using Equation 4-1 and the

corresponding image plane filters are obtained by inverse discrete Fourier transform. For a given

test signal, each filter output is calculated by correlating the echo states of dimension 30x20 with

compared to the linear readout for all h values. This is due to the fact that the linear readout tries

to approximate a constant signal independent of the liquid state values. This creates difficulties

for the simple linear readout especially when the input spike train is sparse. In fact, under closer

analysis, the output of the linear readout fluctuates. Therefore, the decision made by the linear

readout has to be carefully averaged over time in order to reach a final decision about the class of

the input spike train. For example, if the decision is based solely on the output of the readout at

the final time instant, the classification rate is very close to theoretical minimum of 0.5. On the

other hand, the MACE filter readout provides the ability to operate in the spike domain without

the need to low-pass filter the liquid state spike output. However, we note that the MACE filter

computation in Equation 4-1 requires extra caution since the liquid state has many zero values

mostly leading to singular matrices.

1.05

1

0.95

0.9

0.85

0 0.5

1 1.5 2 2.5

3 3.5

L near re ad out
MAC E filte r read out

Lambda

Figure 4-5. Comparison of the correct classification rates of LSM-MACE and LSM with linear
readout trained in the spike train classification problem as the parameter h varies. The
MACE fi1ter readout gives slightly better accuracy compared to the linear readout for
all h values. This is due to the fact that the linear readout tries to approximate a
constant signal independent of the liquid state values. Moreover, MACE readout
operates in the spike domain eliminating the need to convert spike trains to
continuous valued signals by low-pass filtering.

This experiment demonstrates the use of MACE filter as a readout for LSMs. A more

detailed study has to be carried out to explore the advantages and disadvantages of MACE

readout compared to the standard techniques.

CHAPTER 5
IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS

Echo state networks introduced a new type of computation that has been proven to be very

powerful. In this chapter, we study how the ESN idea can be utilized for the design of other

networks. First, we investigate the echo state condition (|W|<1) and approach the echo state

condition in terms of the effect of system stability on the power of dynamical system for

computation. In particular, we relax the echo state condition and allow some of the poles to be

larger than 1. This introduces a new dynamical regime, called "transiently stable computation",

where function approximation with a readout is still possible eventhough the system is not

globally stable. Second, we investigate the biologically plausible model of the olfactory cortex,

Freeman Model (FM), and propose to use a readout for FM to be able to use it for useful

computation. Without the proposed readout, the use of FM is limited to simple digit recognition.

An interesting property of FM is the nonlinear function used in the model. FM nonlinearity does

not have its largest slope at zero operating point unlike the sigmoidal nonlinearity used in ESNs.

We will demonstrate with experiments that FM coupled with a readout can process continuous

valued signals.

Transiently Stable Computation

Linear filters are the simplest dynamical systems employed to process signals with

temporal structure as discussed in chapter 1. They can be classified according to the impulse

response, which fully characterizes the filter, as finite impulse response (FIR) and infinite

Figure 5-1. Demonstration of a typical response of ESN with a spectral radius of 0.9 and
application to function approximation. A) The input signal. B) 100 echo states of
ESN. C) The overlay plot of the ESN output and the desired signal.

r:'

i-

Figure 5-1 B depicts all the 100 echo states on top of each other over the given time

interval. As seen from the figure, during the first 100 steps, the system state converged to zero

(after an initial transient of length about 30 steps caused by the random initial condition) since

the input to the system is zero and the spectral radius is less than 1. Then, the echo states are

constructed by the ramp signal between the time interval 100 and 200. The states again converge

to zero due to the echo state condition, when the input is removed. The resulting echo states can

be used to generate any desired response related to the input signal. Here, desired signal is

chosen to be the seventh power of the input signal and The readout weights are calculated using

systems do not have fixed point dynamics but rather wide variety of collective behavior in the

form of oscillations and even chaos (Freeman, 1975; Yao and Freeman, 1990). Inspiring from

principles of biological computation, we would like to explore the network' s response without

the restriction on the spectral radius of W. Hence, we will remove the echo state condition and

let the spectral radius to be slightly greater than 1. Obviously, this will introduce instability for

the autonomous system whereas the response of the system with an applied input is yet not

obvious.

50 100 150 200
time

The states of the 100-unit transiently stable ESN

250 300 350

A

The overlay plot of the ESN output and desired signal
1.2

Desired signal

-0.21 ESN\ output
-.0 50 100 150 200 250 300 350
time

Figure 5-2. Demonstration of a typical response of transiently stable ESN with a spectral radius
of 1.1 and application to function approximation. A) 100 echo states of the transiently
stable ESN. B) The overlay plot of the ESN output and the desired signal.

In order to observe the ESN response without the echo state condition, we will use a

similar experimental set up that we used in previous section. We use the same W matrix and

scaled it to obtain a spectral radius of 1.1. The same input signal (Figure 5-1 A) is fed to the ESN

. '

0I I

1
II ;

that is initialized to a random initial condition. We plot the resulting echo states in Figure 5-2 A.

There are a few observations to be made at this point. First of all, we see that the system does not

converge to zero during the first 100 steps although the input is zero. Instead, the echo states

exhibit a nonconvergent dynamical response that differs from the ESN with the echo state

condition. In fact, this was the expected response, since the system is designed to be unstable

(spectral radius is greater than one) around zero equilibrium. A similar response can also be

observed during the last 100 time steps when the input is again zero. The second and more

important observation is the response of the system between time steps 100 and 200 where the

ramp signal is applied as the input. The echo states during this interval become more regular

(after some initial transient between time steps 100 and 130) compared to the states when there is

no input. In fact, after the transient, the echo states look similar to the ones in Figure 5-1 B where

the ESN satisfies the echo state condition. Thirdly, there is a transition period between time steps

100 and 130 where the states switch from a disordered mode to a more regular mode. In

summary, the system responds according to its own dynamics when there is no input. As the

input amplitude gradually increases, the system response is determined by a competition between

the system dynamics and the input amplitude. When the input amplitude is sufficiently large, the

system dynamics becomes "transiently stable" and is determined by the input signal. We would

like to find out if we can utilize the transiently stable ESN for the same function approximation

problem we had in the previous section. The weights of the linear readout network are again

computed using Equation 2-4 and the corresponding output is generated in Figure 5-2 B. As seen

from the figure, we got a good match between the output and the desired signal. Similar results

can be obtained even if the system is initialized to a different initial condition or the time instant

at which the ramp signal is applied (this will change the state value when the ramp signal is

applied) changes. This experiment clearly demonstrates that "transiently stable" ESN can do

useful computation.

Understanding Transiently Stable Computation

We will utilize the linearization analysis proposed in chapter 2.2 in order to quantify the

local dynamics of the system. The pole movement of ESN with spectral radius of 0.9 is

illustrated in Figure 5-3. According to the figure, poles always stay inside the unit circle even

though the input signal changes. As the input signal strengthens, the poles move towards the

origin of the z-plane (decreases the spectral radius), which results in a more stable system. This

is due to the fact that the higher input signals saturate the nonlinear function reducing the slope at

which the system is operating. Similar observations have been already made in chapter 2.2.

learning rule (a stable version of the Hebbian learning rule) while two binary patterns to be

stored are presented to the system successively. During testing, a noise corrupted version of one

of the two stored patterns is presented to the system. Each RKII set creates limit cycle

oscillations of low or high energy depending on the stored and the input pattern applied. Then,

the energy (averaged over a time interval) of the mitral KO set of each RKII set is computed and
compared to a threshold to decide the binary system output. We see two maj or shortcomings
with the energy based readout for FM. First, utilizing just the energy is wasteful since it ignores
the information embedded in the dynamics of the system state. Second, it does not exploit the
global interdependencies among the RKII sets since it is a local method computing the energy of
individual processing elements. Hence, the energy based readout can not be optimal and can not
be used for continuous dynamic patterns.
It is clear that a more principled approach is necessary in order to use FM to process
signals that are not necessarily binary or static. With the analogy to the LSM/ESN framework,
we propose to use a readout which is a linear or nonlinear proj section of all the RKII states. The
role of the readout is to extract specified desired information from the FM states.

1a~ 0jIJ

~m~i~~ ~n~rrAs

~ar~~( EarIB

Figure 5-7. Freeman Model as an associative memory. A) Stored patterns. B) Recall of the
corrupted patterns: The first, second and their columns are the input to the KII
network, the output before the threshold detector, and the output after the threshold
detector, respectively.

We will also demonstrate that Freeman's K sets provides a complete system, that is to say,

the readout can also be chosen from the Freeman hierarchy, namely KO set or the KI network

depending on the output layer dimension. When a linear readout is used, the adaptation of the

overall system reduces to a linear regression problem, which we can be solved by either Wiener

filter equation offline or gradient-based learning rule online. We will use the KI network and KII

network architectures analogous to liquid in LSM or the reservoir in ESN. The use of KIII

network as the dynamical reservoir is outside the scope of this study. In summary, with the

adoption of LSM/ESN framework, the states of the FM are dynamical basis functionals

(representation space) created from nonlinear combinations of the input and the readout simply

finds the best proj section of the desired response in this representation space.

We will provide two theorems for the echo state property of the Freeman RKII network.

The first theorem states the sufficient conditions on the parameter values of a RKII set resulting

in echo states. In the second theorem, we will give a sufficient condition for the nonexistence of

echo states for all Freeman networks. Unfortunately, we do not have a general theoretical result

for the sufficient conditions resulting in echo states for all Freeman networks in standard form

given by Equation 5-3.

Theorem 4.1: Consider a reduced KII set with the governing equations given by equation

matrix W satisfy real(hax)>0, where 3Lax is an eigenvalue of W with largest real part. Then the

network has an asymptotically unstable null state. This implies that it has no echo states for any

input set containing 0.

Proof: Consider a Freeman network with governing equations in standard form given by

Equation 5-4.

In state-space form Equation 5-3 can be expressed as

X1 = X2
(5-9)
x, = -(a +b~x, abx, ab(WQ(x, )+ Wina))

where xl and x2 are both N dimensional vectors. For zero input, Equation 5-9 has the trivial null

solution. Now, consider the Jacobian matrix of Equation 5-9 from linearization around origin,

PAGE 1

1 DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS By MUSTAFA CAN OZTURK A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL OF THE UNIVERSITY OF FLOR IDA IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY UNIVERSITY OF FLORIDA 2007

PAGE 2

2 2007 Mustafa Can Ozturk

PAGE 3

3 To my parents, Necmi Ozturk and Saliha Oztur k, who made everything I have achieved possible, for their enduring love, support and generosity.

PAGE 4

4 ACKNOWLEDGEMENTS First and foremost, I would like to thank my supervisory committee chair, Dr. Jose Principe, for his inspiring ideas, constant suppo rt, and guidance. His enthusiasm for life and imaginative way of understanding co ncepts inspired me to learn di fferent aspects of science and engineering. Without him, this disser tation would not have been possible. Secondly, I would like to thank Dr. John Harri s, for serving on my supervisory committee and offering wise insights into academia and other aspects of life. I also wish to thank the other members of my supervisory committee, Dr. Arunava Benerjee and Dr. Thomas Demarse for their insightful comments and suggestions. During my studies, I gained much from the in teractive environment at the Computational NeuroEngineering Laboratory. All of my collea gues helped me by sharing their knowledge and inspirational ideas. I es pecially thank Dongming Xu for his help and long hours of productive discussions. Most of this di ssertation was a result of co llaboration with Aysegul Gunduz, Nicholas Dedual, Rodrigo Sachhi, Sohan Set h, Johan Nyqvist, Kyu-Hwa Jeong, Ismail Uysal, Harsha Sathyendra, and Mark Skowronski. I like to thank each one of them for their expertise and insightful comments. I would like to specif ically thank my wonderful friends, Jianwu Xu, Rui Yan, and Anant Hegde. Deep philosophical convers ations, sleepless nights for proj ects are unforgettable. They were certainly more than friends to me. I wi ll always remember them with great respect. I would like to thank my big family for their constant support and tr ust in me. Everything I have achieved is possible only with thei r love, encouragement, and guidance.

9 LIST OF FIGURES Figure page 1-1 Delay line memory of order N............................................................................................... ..16 1-2 Dynamic network with a short-term memo ry structure followed by a static mapper.............18 1-3 Generic structure for unive rsal myopic mapping theorem, bank of filters followed by a static network................................................................................................................. ....21 1-4 Echo state network (ESN). ESN is com posed of two parts: a fixed-weight ( W ) recurrent network and a linear readout..............................................................................................23 2-1 Performance of ESNs for different realizations of W with the same weight distribution. Results show that spectral radius of W is not the unique parameter that determines the performance of an ESN................................................................................................34 2-2 Pole tracks of the linearized ESN when the input goes through a cycle.................................37 2-3 Echo states and state en tropy for different ESNs....................................................................41 2-4 The ASE values obtained from 50 realizatio ns of ESNs with the same spectral radius.........42 3-1 Comparison of ASE values obtained for ASE-ESN with U-ESN and randomly generated ESNs with different sparseness..........................................................................................46 3-2 The k -delay STM capacity of each ESN for dela ys 1, . ,40 computed using the test signal......................................................................................................................... .........50 3-3 Number of wrong decisions made by each ESN for m = 3, . ,8 in the binary parity check problem.................................................................................................................. ..52 4-1 Interpretation of echo states as a 2-d image............................................................................62 4-2 Performance comparisons of ESNs for the classification of signals with unknown peak values......................................................................................................................... ........66 4-3 Time-series representing th e response pattern of the 32 el ectronic nose sensors exposed to rosemary.................................................................................................................... .....70 4-4 Comparison of the ESN-MACE filter outpu t and MACE filter out put trained on the input space for the rosemary class.....................................................................................74 4-5 Comparison of the correct classification ra tes of LSM-MACE and LSM with linear readout trained in the spike train cl assification problem as the parameter varies...........79 5-1. Demonstration of a typi cal response of ESN with a spectral radius of 0.9 and application to function approximation...............................................................................82

12 Abstract of Dissertation Pres ented to the Graduate School of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy DYNAMICAL COMPUTATION WITH ECHO STATE NETWORKS By Mustafa Can Ozturk May 2007 Chair: Jose C. Principe Major: Electrical and Computer Engineering Echo state networks (ESN) were recently pr oposed as a new recurrent neural network (RNN) paradigm. ESN couples dynamics with computation in novel ways by conceptually separating RNN into two parts: a recurrent topol ogy of nonlinear PEs that constitutes a reservoir of rich dynamics and an instantaneous linear re adout. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology has fixed connection weights. This reduces the complexity of RN N training to simple linear regression while preserving a recurrent topology, but obviously pl aces important constraints in the overall architecture that have no t yet been fully studied. Present design of fixed parameters of ES N relies on the selection of the maximum eigenvalue of the linearized system around zero (sp ectral radius). However, this procedure does not quantify in a systematic manner the performan ce of the ESN in terms of approximation error. In this study, we proposed a functional space approximation framework to better understand the operation of ESNs, and proposed an informationtheoretic metric (the average entropy of echo states) to assess the richness of the ESN dynami cs. We also provided an interpretation of the ESN dynamics rooted in system theory as families of coupled linearized systems whose poles move according to the input signal dynamics. With this interpretation, we put forward a design

PAGE 13

13 methodology for functional approximation wher e ESNs are designed with uniform pole distributions covering the frequenc y spectrum to abide to the richness metric, irrespective of the spectral radius. A single bias parameter at the ESN input, adapted with the modeling error, configures the ESN spectral radius to the input-output joint space. Function approximation examples compare the proposed design methodology versus the conventional design. On further investigating the use of ESNs fo r dynamical pattern recognition, we postulated that ESNs are particularly well suited for dyna mical pattern recognition and we proposed a linear associative memory (LAM) as a novel readout for ESNs. From th e class of LAMs, we adopted the minimum average correlation energy (MACE) filter because of its high rejection characteristics that allow its use as a detector in the automatic pattern recognition literature. In the ESN application, the MACE interprets the stat es of the ESN as a two-dimensional image: one dimension is time and the other dimension is th e processing element index (space). An optimal template image for each class, which associates ESN states with the class label, can be analytically computed using trai ning data. During testing, ESN stat es were correlated with each template image, and the class label of the template with the highest correlation is assigned to the input pattern. The ESNMACE co mbination leads to a nonlinear template matcher with robust noise performance, as needed in non-Gaussia n, nonlinear digital communication channels. We used a real-world data experiment for chemical sensing with an electron ic nose to demonstrate the power of this approach. The proposed readout can also be used with liquid state machines eliminating the need to convert spike trains into continuous signals by binning or low-pass filtering. We applied ESN on interesting real-world problems such as brain machine interface design, water inflow prediction, detection of ac tion potentials in neural recordings, matched

PAGE 14

14 filtering in digital communications, channel equali zation of a nonlinear channel and compared its performance to other standard techniques. We proposed ESNs for signal processing in the complex domain. The use of ESNs for complex do main is very convenient since system training is equivalent to simple linear regression, which is trivial in the comple x domain. The derivatives of the nonlinear activation functions are never necessa ry since the recurrent part is fixed apriori. We showed that Freeman model of the olfactory cortex can be considered in the same framework of ESN. This work provided two import ant contributions to Fr eeman networks. First, it emphasized the need for optimal readout, and s howed how to adaptively derive them. Second, it showed that the Freeman model is able to proc ess continuous signals with temporal structure. We also investigated the dynamics of ES Ns without the echo state condition. This investigation led to a novel computational mode for nonlinear systems with sigmoidal nonlinearity, which does not require global stabil ity. In this mode, although the autonomous system is unstable, the input si gnal forces the system dynamics to become transiently stable. Function approximation experiments showed that the transiently stable ESN without the echo state condition is still capabl e of useful computation.

PAGE 15

15 CHAPTER 1 INTRODUCTION Literature Review of Dynamic Computation Static networks are those whose respons e depends only on the present stimulus. Multiplayer perceptrons (MLPs), radial basis fu nction (RBFs), and self-organizing map (SOM) networks are examples of static neural network models that deal with static input patterns. These models have been well studied in function spaces, for regression and classification, using statistical principles (Haykin, 1998). On the other hand, time is an essential ingredient in many real-world problems, from cognitive tasks (such as vision and speech processing) to engineering problems (such as system identification, noise cancellation, and channel equalization). These real-world problems require the extraction of info rmation embedded in the temporal structure of signals. To incorporate time into the operation of a network, memory must be built into it. To be precise, memory here stands for short-term memo ry, the ability to remember the recent past which provides the contextual information from the history of time series to unambiguously identify the current stimulus. Short-term Memory Structures General linear discrete-time memory structure of order N (Figure 1-1 A) is also known as the generalized tapped delay line memory : each block has the transfer function G(z) or impulse response g(n) (Haykin, 2001). This is a single-input multiple-output system and taps of the system are filtered versions of the input signal. Basically, the memory reconstructs the state of the system that created the time series in a sufficiently large state space where time is imp licit; creating a trajectory that has a one-to-one correspondence (with a unique inve rse) to the time series.

PAGE 16

16 u(n) . G(z) G(z) G(z) x1(n) x2(n) xN(n) u(n-1) u(n) . z-1 z-1 z-1 u(nN +1) u(n) A B Figure 1-1. Delay line memory of order N. A) Generalized tapped delay line. B) Tapped delay line. Quality of a memory structure is m easured in two quantities: depth ( D ) and resolution ( R ) (Principe et al., 1993). Memory depth measures how long the memory can hold information about the past of a time series. Memory resolution is the amount of detail in the information memory holds. Formally, memory depth is defined as 0) (n Nn ng D where gN(n) is the overall impulse respons e of the memory obtained from N successive convolutions of g(n) Resolution is defined as the number of taps in memory structure per unit time. For a linear memory structure of fixed taps the product of memory depth and resolution is

PAGE 17

17 a constant equal to the number of taps N This creates a tradeoff between the memory depth and resolution and the choice of G(z) determines the values R and D in this tradeoff (Principe et al., 1993). The time delay embedding (Takens, 1981), which can be implemented by a simple delay line, is the most commonly used form of shor t-term memory (Figure 1-1 B). The use of time delays for memory is also biologically motivated since signal delays are ub iquitous in the brain and essential in biological information proces sing (Braitenberg, 1990). The memory depth and resolution for the tapped delay line is N and 1 independent of the number of taps, respectively. This restriction is relaxed in gamma filter which makes use of local feedback loops to create short-term memory (Pri ncipe et al., 1993; de Vries, 1991). Each section of gamma memory has the transfer function 1 1) 1 ( 1 ) ( z z z G where is the adjustable feedback parameter. Note that gamma memory reduces to tapped delay line for =1. With the introduction of the feedback l oop, stability becomes an issue. However, the overall system stability is easily guaranteed when each local feedback is stable (i.e. 0< <2). Memory depth for gamma memory is N / and the resolution is This means that the memory depth can be improved by choosing the feedback parameter less than unity. However, this also reduces the resolution. Hence, in gamma memo ry, the memory depth can be improved without increasing the number of taps unlike the tapped de lay line by sacrificing the resolution (Principe et al., 1993; de Vries, 1991). Although tapped delay line and the gamma filter are the most common forms of short-term memory structures, other structur es such as leaky integrator memory (context nodes or memory neurons) and Laguerre filter are also available in the literature. Time alignment filter, which is

PAGE 18

18 obtained from gamma memo ry by allowing nonhomogenous values in different taps, extends the structure in Figure 1-1 A by allowing tap dependent modulations of memory depth and resolution (de Vries, 1991). Network Architectures for Dynamical Computation There are basically two ways of creating a dynamic network with short-term-memory. The first one is to stimulate a static network (e.g., li near mapper, MLP) via a memory structure at its input such as a tapped delay li ne or gamma memory (Haykin, 1998). This type of dynamical networks is commonly used in th e literature. However, there are several practical drawbacks to this approach, which uses a spatial metaphor for time, especially when the dynamics of the time series are rich and time varying. The alternative approach for creating memory in an implicit manner is through the use of feedback. Feedback can be local at the singl e neuronal level or can be global covering the whole netw ork. In neural networks lite rature, networks with global feedback loops are referred to as recurrent neur al networks (RNN). The us e of feedback provides with very powerful systems with rich dynamical behavior whereas it also brings in practical problems such as stability and trai ning complexity (Haykin, 1998). x2(n) Static Mapper y(n) u(n) x1(n) Short-term Memory Structure xN(n) Figure 1-2. Dynamic network with a short-term memory (tapped delay line, gamma memory, etc.) structure followed by a static ma pper (linear mapper, MLP, RBF etc.)

PAGE 19

19 First we will consider the explicit representation of time where a static network (linear or nonlinear) is transformed into a dynamic network with a preprocessor which is a short-term memory structure. In these focuse d structures, the memory is restri cted in the input layer of the network (Sandberg and Xu, 1997; Haykin, 1998). Ther e is a clear separation of responsibilities in this approach of building a dynamical syst em, where memory represents time and static network accounts for mapping (Figure 1-2). Basica lly, memory reconstructs the state of the system that created the time series in a suffici ently large state space and static network maps the reconstructed space to the desired signal space. Moreover, the well-develop techniques of static network training such as least m ean squares (LMS) or static b ack-propagation algorithm can be applied to the dynamic network. The simplest structure for temporal processi ng in the form of Figure 1-2 is the finite impulse response filter (FIR) where the memory is a tapped delay line and the static network is a linear mapper. It has been widely used in adap tive signal processing applications due to its simplicity and the existence of effective learning algorithms (H aykin, 2001). Gamma filter is obtained when the memory structure in FIR is replaced by gamma memory (Principe et al., 1993; de Vries, 1991). Chains of first order integrator s are interesting because they effectively decrease the number of delays necessary to create em beddings. The extra degree of freedom on the memory depth, gained from local feedback, has b een fruitfully utilized in applications like system identification, echo cancellation, a nd speech classification (de Vries, 1991). The first successful demonstration of a ne ural network for temporal processing was NETtalk devised by Sejnowski and Rosenberg in 1987 (Sejnowski and Rosenberg, 1997). They used a system which is based on MLP to conve rt English speech to phonemes. A more popular network is the time delay neural network (TDNN) that replaces the linear mapper in the FIR with

PAGE 20

20 a more powerful nonlinear networ k, MLP (Haykin, 1998). Lang and Hinton have successfully applied TDNN for the recognition of 4 isolated wo rds: bee, dee, ee, and vee (Lang and Hinton, 1998). TDNN has become one of the most popular neural network architectures for temporal processing and has been successfully applied to time series prediction, system identification, control, and speech processing. The structure of TDNN can be generalized as shown in Figure 1-3. The first block is a bank of linear filters operating in parallel on the input stimulus whereas second block implements a st atic nonlinear network. This structure is a universal dynamic mapper (Sandberg and Xu, 1997, page 477) according to universal myopic mapping theorem: Any shift-invariant my opic dynamic map can be uniformly approximated arbitrarily well by a structure consisting of two functional blocks: a bank of linear filters feeding a static network. TDNN is a special case of the structure described in this theorem and is also a universal mapper. However, there are severa l drawbacks of focused structures which convert time series to spatial signals (Elman, 1990). First of all, it re quires an interface to th e external world which buffers the input for further processing. Secondly the selection of the length of buffer is not trivial and once selected it imposes a firm li mit on the memory depth. Moreover, buffering requires all the patterns to have the same length which is especially problematic in areas like speech processing where different length speech signals have to be processed (Elman, 1990). Finally, two instances of same time-series that are very similar can be very dissimilar in the reconstructed space (spatially distant) (Elman, 1990).

PAGE 21

21 xN(n) y(n) x1(n) x2(n) Static Mapper h1(n) hN(n) u(n) h2(n) Figure 1-3. Generic structure for universal m yopic mapping theorem, bank of filters followed by a static network An alternative approach for creating memory is by means of recurrent connections within the network. In this approach, th e system itself is dynamic and time, integrated into the system, is represented by the effect it has on processing. They are particular ly powerful to deal with the time varying patterns and are practical for the problems where the dynamics of the considered process is complex. Various RNN architectures ar e available in the literat ure. One of the earlier architectures proposed by Jordan uses memory units that are fed from the system output to provide the contextual informati on (Jordan, 1986). Elman network is a modification of Jordans, where the context units keep a copy of the hidden layer activati ons for the next update (Elman, 1990). Fully connected recurrent ne ural network contains severa l feedback loops between the processing elements (PE) in the hidden layer (Figure 1-4). RNNs can exhibit rich dynamical responses such as oscillations a nd even chaos. They have been wi dely used in many applications such as system identification and control of dyn amical systems (Delgado et al., 1995; Feldkamp et al., 1998; Kechriotis et al., 1994; Principe, 2001). The computat ional power of fully connected RNNs (Siegelman and Sontag, 1991) has been ma thematically proven by Siegelman and Sontag: All Turing machines may be simulated by fully c onnected recurrent networ ks built on neurons with sigmoid activation functions. However, the main problem with the RNNs is the difficulty to adapt the system weights. Various algorithms, such as back propagation through time (Werbos,

PAGE 22

22 1990) and real time recurrent learning (Williams and Zipser, 1989) have been proposed to train recurrent systems. However, these algorithms su ffer from a variety of problems: computational complexity resulting in slow training, comp lex performances surfaces the possibility of instability, and the decay of gradients th rough the topology and time (Haykin, 1998). The problem of decaying gradients has been a ddressed with special PEs (Hochreiter and Schmidhuber, 1997). Alternative second order tr aining methods based on extended Kalman filtering (Singhal and Wu, 1989; Puskorius a nd Feldkamp, 1996) and the multi-streaming training approach (Feldkamp et al., 1998) provide more reliable performance and have enabled practical applications. Literature Review of Echo State Networks Recently, a new recurrent network paradigm has been proposed by Jaeger under the name of echo state networks (ESN) (Jaeger, 2001; Jaeg er, 2002, Jaeger and Hass, 2004). ESNs aim at addressing the problems with RNN training by sepa rating the RNN architecture into two parts: a fixed recurrent topology of dynamical reservoir and an adaptive memoryless readout network. The input signal is fed to the dynamical reservoir which contains information about the history of input or/and output patterns when properly dimensioned (Figure 1-4) The outputs of the internal PEs (echo state) are fed to a memoryless but ad aptive readout network (g enerally linear) that reads the reservoir and produces th e network output. The interesting property of ESN is that only the memoryless readout is trained, whereas the recurrent topology (W) has fixed connection weights. This reduces the complexity of RNN training to simple linear regression while preserving a recurrent topology, but obviously pl aces important constraints in the overall architecture that have not yet been fully studied.

24 The readout is mostly a simple linear regresso r network whose output is computed according to ) 1 ( ) 1 ( n noutx W y (1-2) When direct connections from the input units to the output units are used the state vector is concatenated with the input vector to calculate the output. A basic necessary property for ESN reservoir is the echo state property which states that for the ESN learning principle to work, the reser voir must asymptotically forget input history (input forgetting property). It has been shown in (Jaeger, 2001) that input forgetting property is equivalent to state forgetting, th at is reservoir must forget its initial state after sufficiently long time. The echo state condition can be linked to the spectral radius (the la rgest among the absolute values of the eigenvalues of a matrix, denoted by || ||) of the reservoirs weight matrix, (|| W ||<1). In fact, this condition states th at the dynamics of the ESN is uni quely controlled by the input and the effect of initial states vanishes. The design of fixed connections of ESN is ba sed on the selection of the spectral radius. The reservoir is randomly and sparsely connected with a spectral radius suitable for the given problem. Sparse connectivity allows less couplin g between the PEs encouraging the development of individual dynamics and increasing dive rsity in overall reservoir dynamics. Thesis Goals ESNs, although recently proposed, have proven very successful in applications such as system identification, chaotic time series pred iction, and channel equalization. However, the state of the art is still imma ture. We see a few major shortc omings with the current ESN approach. First, the characteriza tion of the reservoir properties is poorly u nderstood. It is a mystery how a network with mostly random connections can be successful in solving difficult problems. A better understanding on the operation of dynamical reservoir is vital for improving

PAGE 25

25 ESN theory. Second, imposing a constraint only on the spectral radius for ESN design is a weak condition to properly set the parameters of the reservoir, as experiments show (different randomizations with the same spectral radius perfo rm differently for the same problem). Third, there is a fundamental problem with the way ESNs operate. The impact of fixed reservoir parameters for function approximation means that the information about th e desired response is only conveyed to the output projection. This is not optimal, and strategies to select different reservoirs for different applica tions have not been devised. We aim to address these problems by propos ing a framework, a metric and a design principle for ESNs. We first deal with the analysis of ESNs and explains the framework and the metric proposed. The framework is a signal proce ssing interpretation of ba sis and projections in functional spaces to describe and understand the ESN architecture. According to this interpretation, the reservoir states implement a set of bases functionals (representation space) constructed dynamically by the input, while the readout simply projects the desired response onto this representation space. The metric to de scribe the richness of the ESN dynamics is an information theoretic quantity, the average state entropy (ASE). Entropy measures the amount of information contained in a given random variab le (Shannon, 1948). Here, the random variable is the instantaneous echo state from which the entro py for the overall state (vector) is estimated. Due to the time dependency of the states, the state entropy averaged over time (ASE) will be used as the quantifier to measur e the richness of reservoir dyna mics. We also interpret the ESN dynamics as a combination of time varying linear systems obtained from the linearization of the ESN nonlinear PE in a small local neighborhood of the current state. We then propose a design methodology for the ESN reservoir in the li ght of the analysis tools developed for ESNs. According to this desi gn principle, one should consider independently

PAGE 26

26 the correlation among the basis and the spectral ra dius. In the absence of any information about the desired response, the ESN reservoir states should be de signed with the highest ASE, independent of the spectral radius. The poles of the linearized ESN reservoir should have uniform pole distributions to generate echo states with the most diverse pole locations (which correspond to the uniformity of time constants). Effectively this will cr eate the least correlated bases for a given spectral radius, which correspo nds to the largest volume spanned by the basis set. When the designer has no other information about the desired response to set the basis, this principle distributes the systems degrees of freedom uniformly in space. It approximates for ESNs the well known property of orthogonal basi s. The unresolved issue that ASE does not quantify is how to set the spectral radius, which depends again upon the desired mapping. a simple adaptive bias is added at the ESN input to control the spectral radius integrating the information from the input-output joint space in the ESN bases. For sigmoidal PEs, the bias adjusts the operating points of th e reservoir PEs, which has the ne t effect of adjusting the volume of the state manifold as required to approximate the desired response with a small error. We show that ESNs designed with this strategy obtain systematically better results in a set of experiments when compared with the conventional ESN design. We discuss the readouts for ESNs for func tion approximation and pa ttern recognition. The standard linear readout is used for function approximation tasks such as system identification, time series prediction, and cha nnel equalization. We thoroughly i nvestigate the us e of ESN for temporal pattern recognition. The standard lin ear readout can be used for pattern recognition with ESNs by minimizing the mean-square error (MSE ) with respect to a label. Classification of time series is different from the cl assification of static patterns si nce there is a structure between the samples of the time series over time. The dyn amical classification problem can be reduced to

PAGE 27

27 a static one by treating each sample of the time series individually. In this case, the time series is not treated as a whole and temporal informati on between samples is ignored. Moreover, one class label is associated with each time instant; therefore, post classification such as majority voting has to be applied. An a lternative to this approach is embedding the time series to populate a short-term history of the patter n. Again, a static mapper can be used to classify the embedded pattern. Another major difficulty in dynamical pa ttern recognition is how to design the label, which should also be a time series. One straightfo rward method is to use +1 or -1 as the desired response throughout the application of the input pattern. Howeve r, the difficulty with this method is that the desired response is forced to be constant independent of the input signal dynamics. An alternative powerful technique is to create a one-step predictor for each class in order to capture the dynamics of the class gene rating the time-series (Zahalka and Principe, 1993). Then, during testing, the test pattern is pr esented to all predictors and the label of the predictor with the least prediction error is assigned as the class of the input pattern. We also propose an alternative readout for ESNs to be used in classification tasks that does not require a desired response. The goal is to design an ESN/LSM readout that will recognize a class of inputs that differ by some quantity (e.g. amplitude or shape considered as a distortion parameter). The proposed readout, called the minimum average correlation energy (MACE) filter, is adopted from optical pattern recognition literature, where it is used for recognition of a given object in 2dimensional images in the presence of noi se, geometric distortions and pose changes (Mahalanobis et al., 1987). Instead of using a desired response, the MACE filter creates a template from echo state respons es corresponding to the training patterns of a given class. A MACE is trained for each class. During testi ng, the unknown time signal is fed to the ESN and

PAGE 28

28 the states are correlated with each template image and the cl ass label of the template with the highest correlation is assi gned to the input pattern. We investigate the implications of ESN idea on other networks. First, we investigate the echo state condition (| W |<1) in a system theoretic framewor k to understand the effect of system stability in on the power of dynamical systems fo r computation. For this reason, we relax the echo state condition by allowing it to be slightly larger than 1. This introduces a new dynamical regime, called transiently stable computation, where the system is autonomously unstable but it is stabilized by the input signal of sufficient power. In this regime, function approximation with a linear readout is still possible even though the system is not globally stable. Secondly, we investigate the biologically pl ausible model of the olfactory cortex, Freeman Model (FM), and propose to use a readout for FM to be able to use it as a universal computer. Without the proposed readout, the use of FM is limited to simple digit recognition, where the patterns are static and binary. We will demonstrate with ex periments that FM coupl ed with a readout can process continuous valued signals. An interesting property of FM is the nonlinear function used in the model. FM nonlinearity does not have its la rgest slope at zero oper ating point unlike the sigmoidal nonlinearity used in ESNs. Various applications of ESN fo r real-world problems are investigated. We first utilize ESNs for signal processing in the complex domai n and compare complex ESN with other models for complex signal processing in a nonlinear channel equalization pr oblem in a digital communication channel. Secondly, we propose ESNs to model br ain machine interfaces and compare it with the linear Wiener filter. Brai n machine interfaces aim at predicting the hand position of a primate using the brai n signals. Thirdly, we apply ESNs to predict the water inflow at a hydro power plant using the previous values of water inflow levels Then, we tackle two

PAGE 29

29 temporal pattern recognition probl ems using ESNs with the MACE filter readout. In the spike detection problem, the goal is to detect action potentials in a noisy neural recording. The second problem deals with the design of a detector similar to a matched filter in a communication channel. We compare ESN-MACE filter to the matched filter in a ba seband additive noise channel under different noise distributions.

PAGE 30

30 CHAPTER 2 ANALYSIS OF ECHO STATE NETWORKS ESN is a recurrent network paradigm described by a recurrently connected, reservoir network stimulated by an input signal, and an inst antaneous, adaptive readout which combines the reservoir states to form the desired signal. ESNs have been shown to be very powerful in many applications such as chaotic time seri es prediction, channe l equalization, speech recognition. However, it remains a mystery how a network with mo stly random connections can be so successful in solving difficult problems. This section aims at enlightening the basic mystery of ESNs by analyzing th e operation of ESNs. The first subsection proposes a framework which is a signal processing interpretation of basi s functions and projections in functional spaces to describe and understand the ESN architecture. In the second subsection, we interpret the ESN dynamics as a combination of linear systems. Acco rding to this interpreta tion, when the system operating point varies over time with the influence of the i nput signal, local ESN dynamics change. The third subsection introduces an info rmation-theoretic metric, called average state entropy, to quantify the computational pow er of a dynamical network for function approximation. The ASE metric combined with the idea of bases and projections will lead to a design procedure for ESNs that will be di scussed in the next chapter. Echo States as Bases and Projections Let us revisit the recursive update equation of an ESN. The activation of the internal PEs is updated according to )) ( ) ( ) 1 ( ( ) 1 ( n n n nback iny W Wx u W f x (2-1) where f = ( f1, f2,, fN) are the internal units activation functions. The output from the linear readout network is computed according to ) 1 ( ) 1 ( n noutx W y (2-2)

PAGE 31

31 ESNs resemble the RNN architecture propos ed in (Puskorius and Feldkamp, 1994) and also used by (Sanchez, 2004) in brain mach ine interfaces. The critical difference is the dimensionality of the hidden recurrent PE layer a nd the adaptation of the recurrent weights. We submit that the ideas of approximation theory in functional spaces (bases and projections), so useful in optimal signal processing (Principe, 2001), should be utilized to understand the ESN architecture. Let h (u( t )) be a real-valued function of a real-valued vector )] ( ),..., ( ), ( [ ) (2 1t u t u t u tM u. In functional approximation, the goa l is to estimate the behavior of h ( u (t)) as a combination of simpler functions ) ( ti called the bases functionals, such that its approximant )) ( ( t h u is given by N i i it a t h1) ( )) ( (u Here, ais are the projections of h ( u ( t )) onto each basis functional. On e of the central questions in practical functional approximation is how to choose the set of bases to approximate a given desired signal. In signal proces sing, the choice normally goes fo r complete set of orthogonal basis, independent of the input. When the basis set is complete and can be made as large as required, fixed bases work wonders (e.g. Fourier decompositions). In neural computing, the basic idea is to derive the set of bases from the input signal through a multilayered architecture Consider a single hidden layer TDNN with N PEs and a linear output. The hidden layer PE outputs can be considered a set of non-orthogona l basis functionals dependent upon the input ) ) ( ( )) ( (j j ij it u b g t u Here, bijs are the input layer weights, and g is the PE nonlinearity. The approximation produced by the TDNN is then

PAGE 32

32 N i i it a t h1)) ( ( )) ( ( u u (2-3) where ais are the weights of the output layer. Notice that the bij adapt the bases and the ai adapt the projection in the projection sp ace (readout). Here the goal is to restrict the number of bases (number of hidden layer PEs) because their number is coupled with the number of parameters to adapt which impacts generalization, training set size, etc. Usually, since all of the parameters of the network are adapted, the best basis in the joint (input and desi red) space as well as the best projection can be achieved, and represents the optimal solution. The output of the TDNN is a linear combination of it s internal representations, but to achieve a basis set (even if nonorthogonal), linear independence among the)) ( ( tiu s must be enforced. The effect of nonlinear transformation of the hidden PEs on the input si gnal has been investig ated by many authors in RN. Oh has shown that correlation among the weighted sums decrease after they pass through the sigmoid nonlinear function which can be appr oximated by piecewise linear functions (Oh and Lee, 1994). Shah and Poon showed that there exis t weight values for MLP such that the outputs of its hidden layer PEs are linearl y independent and hence, form a complete set of basis functions (Shah and Poon, 1999). The ESN (and the RNN) architecture can also be studied in this framework. The reservoir states of Equation 2-1 correspond to the basis set which are recurs ively computed from the input, output and previous states through Win, W and Wback. Notice however that none of these weight matrices are adapted, that is, the functional ba ses in the ESN are uniquely defined by the input and the initial selection of weight s. In a sense ESNs are trading the adaptive connections in the RNN hidden layer by a brute force approach of creating fixed dive rsified dynamics in the hidden layer.

PAGE 33

33 For an ESN with a linear read-out network, the output equation () 1 ( ) 1 ( n noutx W y ), has the same form of Equation 2-3, where the )) ( ( tiu s and ais are replaced by the echo states and the readout weights, respectively. These we ights are adapted in the training data, which means that the ESN is able to find the optimal projection in the projection space, just like the RNN or the TDNN. It is interesting that a similar perspectiv e of basis and projections for information processing in biological networks has been proposed by (Pouget and Sejnowski, 1997). They explored the possibility that the response of si ngle neurons in parietal cortex serve as basis functions for the transformations from the sens ory input to the motor responses. They proposed that "the role of spatial representations is to code the sensory inputs and posture signals in a format that simplifies subsequent computati on, particularly in the generation of motor commands (Pouget and Sejnowski, 1997). The central issue in ESN design is exactly the non adaptive nature of the basis set. Parameter sets in the reservoir that provide linearly independent stat es and possess a given spectral radius, may define drastically differe nt projection spaces because the correlation among the bases is not constrained. A si mple experiment was designed to demonstrate that the selection of the echo state parameters by constraining the spectral radius is not the most suitable for function approximation. Consider an echo state network of 100-units where the input signal is sin(2 n/(10 )). Mimicking (Jaeger, 2001), the goal is to let the ESN generate the seventh power of the input signal. Different realizations of a randomly connected 100 unit ESN were constructed where the entries of W are set to 0.4, -0.4 and 0 w ith probabilities of 0.025, 0.025 and 0.95, respectively. This corresponds to a spectr al radius of 0.88. Input weights are set to +1 or -1 with equal probabilities and Wback is set to zero. The input is applied for 300 time steps and

PAGE 34

34 Figure 2-1. Performance of ESNs for different realizations of W with the same weight distribution. The weight values are set to 0.4, -0.4 and 0 with probabilities of 0.025, 0.025 and 0.95. All realizations have the same spectral radius of 0.88. In the 50 realizations, MSEs vary from 5.9x10-9 to 8.9x10-5. Results show that spectral radius of W is not the unique parameter that dete rmines the performance of an ESN. the echo states are calculated usi ng Equation 2-1. The next step is to train the linear readout when a desired training signal, d ( n ) is available. One method to determine the optimal output weight matrix, Wou t in the mean square error (MSE ) sense (where MSE is defined by ( nn n n n )) ( ) ( ( )) ( ) ( ( 2 1 y d y d ), is to use the Wiener solution given by (Haykin, 2001) T n T n outn n T n n T E E0 1 0 1) ( ) ( 1 ) ( ) ( 1 ] [ ] [d x x x xd xx W (2-4) Here E [.], T and denotes the expected value operator, the length of the training sequence and the conjugate transpose of a complex vector. Figu re 2-1 depicts the MSE values for 50 different realizations of the ESNs. As observed, even though each ESN has the same sparseness and spectral radius, the MSE values obtained va ry greatly among different realizations. The

PAGE 35

35 minimum MSE value obtained among the 50 realizations is 5.9x10-9 whereas the maximum MSE is 8.9x10-5. This experiment demonstrates that a de sign strategy that is based solely on the spectral radius is not sufficient to specify th e system architecture for function approximation. This shows that for each set of random weights that provide the same spectral radius, the correlation or degree of redundancy among the base s will change and different performances are encountered in practice. ESN Dynamics as a Combinat ion of Linear Systems It is well known that the dynamics of a nonlin ear system can be approximated by that of a linear system in a small neighborhood of an equilibrium point (Kuznetsov, 1998). Here, we will perform the analysis with hyperbolic tangent PEs and approximate the ESN dynamics by the dynamics of the linearized system in the neighbo rhood of the current system state. Hence, when the system operating point varies over time, th e linear system approximating the ESN dynamics will change. We are particularly interested in the movement of the poles of the linearized ESN. Consider the update equation for the ES N without output feedback given by )) ( ) 1 ( ( ) 1 ( n n ninWx u W f x Linearizing the system around the current state x(n) one obtains the Jacobian matrix, J (n+1) defined by W F W J ) ( )) ( ( ... 0 0 ... ... ... ... 0 ... )) ( ( 0 0 ... 0 )) ( ( )) ( ( ... )) ( ( )) ( ( ... ... ... ... )) ( ( ... )) ( ( )) ( ( )) ( ( ... )) ( ( )) ( ( ) 1 (2 11 1 2 1 2 2 22 2 21 2 1 1 12 1 11 1n n net f n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f w n net f nN NN N N N N N N N (2-5)

PAGE 36

36 Here, neti(n) is the ith entry of the vector () ( ) 1 ( n ninWx u W ) and wij denotes the ( i j )th entry of W The poles of the linearized system at time n +1 are given by the eigenvalues of the Jacobian matrix J(n+1). When the amplitude of each PE in put changes, the local slope changes, and so the poles of the linearized system are time varying, although the parameters of ESN are fixed. In order to visualize the m ovement of the poles, consider an ESN with 100 states. The entries of the internal weight matrix are chosen to be 0, 0.4 and -0.4 with probabilities 0.9, 0.05 and 0.05. W is scaled such that a spectr al radius of 0.95 is obtained. Input weights are set to +1 or -1 with equal probabilities. A sinusoidal signa l with a period of 100 is fed to the system and the echo states are computed according to Equa tion 2-1. Then, the Jacobian matrix and the eigenvalues are calculated usi ng Equation 2-5. Figure 2-2 shows the pole tracks of the linearized ESN for different input values. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locatio ns, hence many different time constants that modulates the richness of the re servoir of dynamics as a function of input amplitude. Higher amplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the zplane (decreases the spectral radi us), which results in a system with large stability margin. When the input is close to zero, th e poles of the linearized ESN are close to the maximal spectral radius chosen, decr easing the stability margin. When compared to their linear counterpart, an ESN with same number of states re sults in a detailed coverage of the z-plane dynamics ( -coverage), which illustrates the power of nonlinear systems. Similar results can be obtained using signals of di fferent shapes at the ESN input.

PAGE 37

37 Figure 2-2. Pole tracks of the linearized ESN when the input goe s through a cycle. A single ESN with fixed parameters implements a combin ation of many linear sy stems with varying pole locations. A) One cycle of sinusoidal si gnal with a period of 100. B)-E) show the positions of poles of the linearized systems wh en the input values are at B, C, D, and E in panel A. F) Cumulative pole locations show the movement of the poles as the input changes. Theorem: The eigenvalues of the linearized system have the largest radius when the system state is zero.

PAGE 38

38 Proof: Assume W has nondegenerate eigenvalues a nd corresponding linearly independent eigenvectors. Then, consider the eigendecomposition of W where1 PDP W P is the eigenvector matrix and D is the diagonal matr ix of eigenvalues ( Dii) of W Since F (n) and D are diagonal, 1 ) 1) ) ( ( )( ( ) ( ) 1 ( P D F P PDP F W F J n n n n is the eigendecomposition of J (n+1). Here, each entry of F(n)D, ii in net f D )) ( (, is an eigenvalue of J ii ii in net f D D )) ( ( since ) 0 ( )) ( ( f n net fi A key corollary of the above analysis is that the spectral radius of an ESN can be adjusted using a constant bias signal at the ES N input without changing the recurrent connection matrix, W The application of a nonzero constant bias will move the operating point to regions of the sigmoid function closer to saturation and always decrease the spectral radius due to the shape of the nonlinearity. This property will be exploited in the design of ESNs which will be discussed in the next chapter. Average State Entropy as a Measure of the Richness of ESN Reservoir The concept of rich dynamical reservoir (Jaeger, 2001) has not been quantified with a well-defined metric. Our framewor k of bases and projections lead s to a new metric to quantify the richness of ESN reser voir. Here, we propose the instantaneous state entropy to quantify the distribution of instantaneous amplitudes acros s the ESN reservoir states. Entropy of the instantaneous ESN states is appropriate to quantify perfor mance in function approximation because the ESN output is a mere weighted combin ation of the value of the ESN states. If the echo states instantaneous amplitudes are concentr ated on only a few values across the ESN state dynamic range, the ability to approximate an arbitr ary desired response by we ighting the states is limited (and wasteful due to redundancy between the different states) and performance will

PAGE 39

39 suffer. On the other hand, if the ESN states prov ide a diversity of instantaneous amplitudes, then it is much easier to achieve th e desired mapping. Hence, the inst antaneous entropy of the states appears as a good measure to quan tify "richness" of dynamics with instantaneous mappers. Due to the time structure of signals, the average state entropy (ASE), defined as the state entropy averaged over time will be the parameter us ed to quantify the diversity in the dynamical reservoir of the ESN. Moreover, entropy has been proposed as an appropriate measure of the volume of the signal manifold (Cox, 1946; Amari, 1990). Here, ASE measures the volume of the echo state manifold spanned by the trajectories. Renyis quadratic entropy is employed here be cause it is a global measure of information and efficient nonparametric estimator that avoi ds explicit pdf estimation has been developed (Principe et al., 2000;Erdogmus et al., 2003; Erdogmus and Principe, 2002). Renyis entropy with parameter for a random variab le X with a pdf fX(x) is given by (Principe et al., 2000) )] ( [ log 1 1 ) (1X f E X HX The Renyis quadratic entropy is obtained for =2 (for 1). Shannons entropy is obtained). Given N samples {x1, x2, xN}drawn from the unknown pdf to be estimated, Parzen windowing approximates the underlying pdf by ) ( 1 ) (1 N i i Xx x K N x f where K is the kernel function with the kernel size Then the Renyis quadratic entropy can be estimated by (Principe et al.,2000; Erdogmus et al., 2003). ji i jx x K N X H ) ( 1 log ) (2 2 (2-6)

PAGE 40

40 The instantaneous state entropy is estimated using Equation 2-6 where the samples are the entries of the state vector T Nn x n x n x n ) ( ),..., ( ), ( ) (2 1x, of an ESN with N internal PEs. Results will be shown with a Gaussian kernel with kernel size chosen to be 0.3 of the standard deviation of the entries of the state vector. We will show that ASE is a more sensitive parameter to quantify the approximation properties of ESNs by experimentally dem onstrating that ESNs with the same spectral radius display different ASEs. Let us consider the same 100 unit ESN that we used in the previous section built with three different spectral radii 0.2, 0.5, 0.8 with an input signal of sin(2 n/20). Figure 2-3 A depicts the echo states over 200 time ticks. The instantaneous state entropy is also ca lculated at each time step using Equation 2-6 and plot ted in figure 2-3 B. First, not e that the instantaneous state entropy changes over time with the di stribution of the echo states as we would expe ct, since state entropy is dependent upon the input signal that also changes in this case. Second, as the spectral radius increases in the simulation, the diversity in the echo states increases. For the spectral radius of 0.2, echo states instantaneous amplit udes are concentrated only on a few values which is wasteful due to redundancy between different states. In practice, to quantify the overall representation ability over time, we will use AS E, which takes values -0.735, -0.007 and 0.335, for the spectral radii of 0.2, 0.5 and 0.8, resp ectively. Hence, ASE is affected by the spectral radius of the W matrix as we would expect. Moreover, even for the same spectral radius of 0.5, several ASEs are possible. Figure 2-4 shows ASEs from 50 different realiza tions of ESNs with same spectral radius of 0.5, which means that AS E is a finer descriptor of the dynamics of the reservoir. Although we have presented an experime nt with sinusoidal signal, similar results are obtained for other inputs as long as the input dynamic ra nge is properly selected.

42 0 10 20 30 40 50 0.08 0.1 0.12 0.14 0.16 0.18 0.2 0.22 0.24 0.26 0.28 TrialsASE Figure 2-4. The ASE values obtained from 50 realiz ations of ESNs with the same spectral radius Maximizing ASE means that the diversity of the states over time is the largest and should provide a basis set that is as uncorrelated as possible. This condition is unfortunately not a guarantee that the ESN so designed will perform the best, because the basis set is independent of the desired response and the application may requi re a small spectral radius. However, when the desired response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the defau lt strategy should be to maximize the ASE of the state vector.

PAGE 43

43 CHAPTER 3 DESIGN OF THE DYNAMICAL RESERVOIR In ESNs, reservoir weights are selected random ly with the constraint of the echo state condition (Jaeger, 2001). However, such a random selection scheme is not optimal as it is demonstrated in chapter 2. In this chapter, we propose a design scheme for the design of reservoir weights that will lead to small approxima tion errors for a variety of tasks. For optimal approximation with a finite number of basis f unctionals, both the projection space (bases) and the projector (readout) require knowledge from the input and de sired responses. However, in ESNs, the projection space is determined solely by the architecture of th e ESN reservoir and the input signal (since W is fixed), without any knowledge of the space spanned by the desired target signal. The selection of basis f unctions with the knowledge of i nput signal only is an ill-posed problem. However, the selection of the reservoir weights must still be done using some rule, and here we hypothesize that a good design strategy is to let the ESN states cover with equal resolution the projection space to anticipate a ny possible mapping requirement (dictated by the unknown desired response). Design of the Recurrent Connections According to the interpretation of ESNs as coupled linear systems, the design of the internal connection matrix, W will be based on the distribution of the poles of the linearized system around zero state. Our proposal is to design the ESN such that the linearized system has uniform pole distribution inside th e unit circle of the z-plane. With this design scenario, the system dynamics will include uniform coverage of time constants arising from the uniform distribution of the poles, which al so decorrelates as much as po ssible the bases functionals. This principle was chosen by analogy to the identifi cation of linear systems using Kautz filters (Kautz, 1954) which shows that th e best approximation of a give n transfer function by a linear

PAGE 44

44 system with finite order is achieved when pol es are placed in the neighborhood of the spectral resonances. When no information is available about the desired response or when would like to use the same reservoir for a variety of tasks, we should uniformly spread the poles to anticipate good approximation to arbitrary mappings. We again use a maximum entropy principle to di stribute the poles in side the unit circle uniformly. The constraints of a circle as boundary conditions for discrete linear systems and complex conjugate locations are easy to incl ude for the pole distribut ion (Thogula, 2003). The poles are first initialized at random locations; th e quadratic Renyis entropy is calculated using Equation 2-6 and poles are moved such that the entropy of the new distribution is increased over iterations (Erdogmus et al., 2003). This method is e fficient to find a uniform coverage of the unit circle with an arbitrary number of poles. However, any other method can be used to find the location of poles with uniform c overage. Notice that this operati on has to be done only once for a given number of poles, which is equal to the number of hidden unit PEs. The system with the uniform pole locations can be interpreted using linear system theory. The poles that are close to the unit circle corre spond to many sharp band pass filters specialized in different frequency regions wh ereas the inner poles realize filte rs of larger frequency support. Moreover, different orientations (angles) of the poles create filters of different center frequencies. Now, our problem is to construct a weight matrix from the pole locations. This is equivalent to the problem of designing W when its eigenvalues are known. In principle we would like to create a sparse matrix, so we star ted with the sparsest matrix (with an inverse) which is the direct canonical st ructure given by (Kailath, 1980)

PAGE 45

45 0 1 ... 0 0 ... ... ... ... 0 0 ... 1 0 0 0 ... 0 1 ...1 2 1N Na a a a W (3-1) The characteristic polynomial of W is ) )...( )( ( ... ) det( ) (2 1 2 2 1 1 N N N N Np s p s p s a s a s a s s s l W I (3-2) where pis are the eigenvalues and ais are the coefficients of th e characteristic polynomial of W Here, we know the pole locations of the linear system obtained from th e linearization of the ESN, so using Equation 3-2, we can obtain the characteri stic polynomial and construct W matrix in the canonical form using Equation 3-1. We will name the ESN constructed based on the uniform pole principle ASE-ESN. All other possible solu tions with the same eigenvalues can be obtained by Q-1WQ where Q is any nonsingular matrix. To corroborate our hypothesis, we would like to show that the linearized ESN designed with the internal connection weight matrix havi ng the eigenvalues uniformly distributed inside the unit circle creates higher ASE values for a gi ven spectral radius compared to other ESNs with random internal connection weight matrices. We will consider an ESN with 30 states, and use our procedure to create the uniformly distri buted linearized ASE-ESN matrix for different spectral radius between [0.1, 0.95]. Similarly, we constructed ESNs with sparse random W matrices with different sparsene ss constraints. This corresponds to a weight distribution having the values 0, c and c with probabilities p1, (1p1)/2 and (1p1)/2, where p1 defines the sparseness of W and c is a constant that takes a specific value depending on the spectral radius. We also created W matrices with values uniformly dist ributed between -1 and 1 (U-ESN), and scaled to obtain a given spectral radius (Jaeger and Hass, 2004). Then, for different Win matrices, we run the ASE-ESNs with the sinusoidal input of period 20 and calculate ASE.

PAGE 46

46 0 0.2 0.4 0.6 0.8 1 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 Spectral radiusASE ASE-ESN U-ESN sparseness=0.2 sparseness=0.1 sparseness=0.07 Figure 3-1. Comparison of AS E values obtained for ASE-ES N with U-ESN and randomly generated ESNs with different sparseness. As observed from the figure, the ASE-ESN with uniform pole distribution generates higher ASE on the average for all spectral radii compared to ESNs with random connections. Figure 3-1 compares the ASE va lues obtained using differen t ESNs over 1000 realizations. As observed from the figure, the ASE-ESN with uniform pole distributio n generates higher ASE on the average for all spectral radii compared to ESNs with sparse and uniform random connections. This approach is indeed conceptu ally similar to Jeffreys maximum entropy prior (Jeffreys, 1946): it will provide a consistently good response for the largest class of problems. Concentrating the poles of the linearized system in certain regions of the space, only provide good performance if the desired response has energy in this part of the space, as is well known from the theory of Kautz filters (Kautz, 1954).

PAGE 47

47 Design of the Adaptive Bias In conventional ESNs, only the ou tput weights are trained optim izing the projections of the desired response onto the basis func tions (echo states). Since, th e dynamical reservoir is fixed, the basis functions are only input dependent. However, sin ce function approximation is a problem in the joint space of the input and de sired signals, a penalty in performance will be incurred. From the linearization analysis that shows the crucial importance of the operating point of the PE nonlinearity in defini ng the echo state dynamics, we pr opose to use a single external adaptive bias to adjust the effective spectral radius of an ESN. Notice that, according to linearization analysis with tanh nonlinearity, bias can only reduce spectral radius. The information for adaptation of bias is the MSE in training, which modulates the spectral radius of the system with the informati on derived from the approximation error. With this simple mechanism, some information from the input output joint space is incorpor ated in the definition of the projection space of the ESN. The beauty of this method is that the spectral radius can be adjusted by a single parameter that is external to the system wit hout changing reservoir weights. The training of bias can be easily accomplishe d. Indeed, since the parameter space is only one dimensional, a simple line search method ca n be efficiently employed to optimize the bias. Among different line search algorithms, we will use a search that uses Fibonacci numbers in the selection of points to be eval uated (Wilde, 1964). The Fibonacci search method minimizes the maximum number of evaluations needed to reduce the interval of uncertainty to within the prescribed length. In our problem, a bias value is picked according to Fibonacci search. For each value of bias, training data is applied to the ESN and the echo st ates are calculated. Then, the corresponding optimal output weights and the objec tive function (MSE) are ev aluated to pick the next bias value.

PAGE 48

48 Alternatively, gradient based methods can be utilized to optim ize the bias, due to simplicity and low computational cost. System upda te equation with an ex ternal bias signal, b is given by )) ( ) 1 ( ( ) 1 ( n b n nin inWx W u W f x The update equation for b is given by ) ) ( ( ) ( ) 1 ( ) 1 (1 in n out outb n net b n b n O W x W f W e x W e Here, O is the MSE defined previously. This al gorithm may suffer from similar problems observed in gradient based methods in recurrent networks training. However, we observed that the performance surface is rather simple. Mo reover, since the search parameter is one dimensional, the gradient vector can only assume one of the tw o directions. Hence, imprecision in the gradient estimation should affect the spee d of convergence, but normally not change the correct gradient direction. Experiments This section presents a variety of experiment s in order to test the validity of the ESN design scheme proposed in the previous section. Short-term Memory Structures Modeling a dynamical system with input-outpu t data requires access to a sufficiently long input history as the output of the dynamical syst em depends not only the current value of input signal but also the history of it. In signal pr ocessing, the common approach to overcome this difficulty is to embed the input signal using a ta pped delay line (in FIR filter or TDNN). RNNs

PAGE 49

49 provide a different type of embedding through the recurrent connectio ns between the PEs. However, the length of the memory achieved is not as clear as in tapped delay line. This experiment compares the short term memo ry (STM) capacity of ESNs with the same spectral radius using the framework presented in (Jaeger, 2002). Consider an ESN with a single input signal, u(n) optimally trained w ith the desired signal u(n-k) ,for a given delay k Denoting the optimal output signal yk(n) the k -delay STM capacity of a network, MCk, is defined as squared correlation coefficient between u(n-k) and yk(n) (Jaeger, 2002). The STM capacity, MC of the network is defined as 1 k kMC STM capacity measures how accurately the delayed version of the input signal is r ecovered with optimally trained out put units. It has been shown in (Jaeger, 2002) that the memory capacity for recalli ng an independent identically distributed (i.i.d) input with an N unit RNN with linear output units is bounded by N Consider an ESN with 20 PEs and a single i nput unit. ESN is driven by an i.i.d random input signal, u(n) that is uniformly distributed over [-0. 5, 0.5]. The goal is to train the ESN to generate the delayed ve rsions of the input, u(n -1), u(n -2), u(n -40),. We used four different ESNs, R-ESN, U-ESN, ASE-ES N and BASE-ESN. R-ESN is a randomly connected ESN used in (Jaeger, 2002) where the entries of W matrix are set to 0, 0.47, -0.47 with probabilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse c onnectivity of 20% and a sp ectral radius of 0.9. The entries of are uniformly distri buted over [-1, 1] and scaled to obtain the spectral radius of 0.9. ASE-ESN also has a spectral radi us of 0.9 and is designed with uniform poles. BASE-ESN has the same recurrent weight matrix as ASE-ESN and an adaptive bias at its input. In each ESN, the input weights are set to 0.1 or -0.1 with equal probability and direct c onnections from the input to the output are allowed whereas Wback is set to 0 (Jaeger, 2002). The echo states are calculated using Equation 2-1 for 200 samples of the input signal and the first 100 samples corresponding to

PAGE 50

50 initial transient are eliminated. Then, the output weight matrix is calculated using Equation 2-4. For the BASE-ESN, the bias is tr ained for each task. All networks are run with a te st input signal and the corresponding output and MCk are calculated. 0 5 10 15 20 25 30 35 40 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 DelayMemory Capacity R-ESN U-ESN ASE-ESN BASE-ESN Figure 3-2. The k -delay STM capacity of each ESN for delays 1, . ,40 computed using the test signal. The STM capacities of R-ESN, UESN, ASE-ESN and BASE-ESN are 13.09, 13.55, 16.70 and 16.90, respectively. Figure 3-2 shows the k -delay STM capacity (averaged over 100 trials) of each ESN for delays 1 for the test signal. The STM capacities of R-ESN, U-ESN, ASE-ESN and BASEESN are 13.09, 13.55, 16.70 and 16.90, respectively. First of all, ESNs with uniform pole distribution (ASE-ESN and BASE -ESN) have MCs that are much longer than the randomly generated ESN given in (Jaeger, 2002) in spite of all having the same spectral radius. In fact, the STM capacity of ASE-ESN is close to the theoretical maximum value of N =20. A closer look at the figure shows that R-ESN performs slightly better than ASE-ESN for delays less than 9.

PAGE 51

51 Indeed, for small k large ASE degrades the performance because the tasks do not need long memory depth. However, the drawback of high ASE for small k is recovered in BASE-ESN which reduces the ASE to the appropriate level re quired for the task. Overall, the addition of the bias to the ASE-ESN increases the STM capacity from 16.70 to 16.90. On the other hand, UESN has slightly better STM compared to R-ES N with only 3 different weight values although U-ESN has more distinct weight values than R-ES N. It is also significant to note that the MC will be very poor for an ESN with smaller spectral radius even w ith an adaptive bias since the problem requires large ASE and bi as can only reduce ASE. This experiment demonstrates the need for maximizing ASE in ESN design. Binary Parity Check The effect of the adaptive bias was marginal in the previous e xperiment since the nature of problem required large ASE values and long shor t-term memory. However, there are tasks in which the optimal solutions require smaller ASE values and smaller spectral radius. Those are the tasks where the adaptive bias becomes a cr ucial design parameter in our design methodology. Consider an ESN with 100 internal units and a single input unit. ESN is driven by a binary input signal, u(n) that assumes the values 0 or 1. The goal is to train an ESN to generate the m bit parity, where m is 3,,8 Similar to the previous experiments, we used the R-ESN, ASE-ESN and BASE-ESN topologies. R-ESN is a random ly connected ESN where the entries of W matrix are set to 0, 0.06, -0.06 with probabilities 0.8, 0. 1, 0.1, respectively. This corresponds to a sparse connectivity of 20% and a spectra l radius of 0.3. ASE-ESN and BASE-ESN are designed with a spectral radius of 0.9. The input weights are set to 1 or -1 w ith equal probability and direct connections from the input to the output are allowed whereas Wback is set to 0. The echo states are calculated using Equation 2-1 for 1000 samples of the input signal and the first 100 samples corresponding to ini tial transient are eliminated. Then, the output weight matrix is calculated

PAGE 52

52 using Equation 2-4. For ESN with adaptive bias, th e bias is trained for eac h task. Binary decision is made by a threshold detector that compares th e output of the ESN to 0.5. Figure 3-3 shows the number of wrong decisions (averaged over 100 di fferent realizations) made by each ESN for m =3,,8. 3 4 5 6 7 8 0 50 100 150 200 250 300 350 mWrong Decisions R-ESN ASE-ESN BASE-ESN Figure 3-3. Number of wrong d ecisions made by each ESN for m = 3, . ,8 in the binary parity check problem. The total number of wrong decisions for m = 3, . ,8 of R-ESN, ASE-ESN and BASE-ESN are given by 722, 862 and 699. The total number of wrong decisions for m =3,,8 of R-ESN, ASE-ESN and BASE-ESN are 722, 862 and 699, respectively. ASE-ESN perf orms poorly since the nature of the problem requires short time-constant for fast response but ASE-ESN has large spectral radius. For 5-bit parity, the R-ESN has no wrong decisions wh ereas ASE-ESN has 47 wrong decisions. BASEESN performs a lot better than ASE-ESN and slightly better than the R-ESN since the adaptive bias reduces the spectral radius effectively. Note that for m = 7 and 8, the ASE-ESN performs

PAGE 53

53 similar to the R-ESN, since the task requires a ccess to longer input history, which compromises the need for fast response. Indeed the bias in the BASE-ESN take s effect when there is errors ( m >4) and when the task benefits from smaller spectral radius. The optimal bias values are approximately 3.2, 2.8, 2.6 and 2.7 for m = 3, 4, 5, and 6, respectively. For m = 7 or 8, there is a wide range of bias values that result in similar MSE values (between 0 and 3). In summary, this experiment clearly demonstrates the power of the bias signal to configure the ESN reservoir according to the mapping task. System Identification This section presents a function approximation task where the aim is to identify a nonlinear dynamical system. The unknown system is defined by the difference equation )) ( ( ) 1 ( 6 0 ) ( 3 0 ) 1 ( n u f n y n y n y where ) 5 sin( 1 0 ) 3 sin( 3 0 ) sin( 6 0 ) ( u u u u f The input to the system is chosen to be sin(2 n/25). We used an ESN with 30 internal units and a single input unit. Again, we used three different ESNs, R-ESN, AS E-ESN and BASE-ESN. The W matrices of all ESNs are generated as described in chapter 3.3.1 except that they are s caled to have a spectral radius of 0.95. In each ESN, the input weights are set to 1 or -1 with equal probability and direct connections from the input to the output are allowed whereas Wback is set to 0. The optimal output signal is calculated for each ESN as described in section 3.3.1. The MSE values (averaged over 100 realizations) for R-ESN, ASE-ESN are 1.23x10-5 and 1.83x10-6, respectively. The addition of the adaptive bias to the ASE-ESN reduces the MSE value from 1.83x10-6 to 3.27x10-9. This experiment illustrates the superiority of the proposed design sc heme for function approximation.

PAGE 54

54 CHAPTER 4 DESIGN OF THE READOUT In the previous chapter, we have discussed the design of the ESN reservoir and the bias. This chapter deals with the design of the readout s for ESNs. The problem is studied for function approximation and pattern recognition separately and different solutions are proposed for each case. Design of the Readout for Function Approximation The usual readout is the linear regression ne twork. Throughout this dissertation, we will use the linear readout. The equation for the linear readout is given by ) 1 ( ) 1 ( n noutx W y Alternative readouts are non linear readouts such as mu ltilayer perceptron or local nonlinear modelling. This issue is outs ide the scope of this dissertation. Design of the Readout for Dynamical Pattern Recognition The applications of ESNs to problems such as system identification, control of dynamical systems, and time-series prediction has been exte nsively studied both in this work and in the literature (Jaeger, 2001; Jaeger and Hass, 2004; Ozturk et al., 2007; Prokhorov, 2005). However, the use of ESNs for dynamical pattern recogniti on has been limited. In the literature, RNNs and other neural architectures such as time-delay ne ural network have been utilized for dynamical pattern recognition tasks by mini mizing the mean-square error (MSE) with respect to a label (Haykin, 1998; Bishop, 1995; Is o and Watanabe, 1991; Tebelskis, 1995). If the pattern to be classified is a time signal, the classification proce ss is different from the cl assification of a static pattern. The dynamical classification problem can be reduced to a static one by treating each sample of the time series individually. In this cas e, the time series is not treated as a whole and temporal information between samples is ignored. Moreover, one class label is associated with

PAGE 55

55 each time instant; therefore, post classification su ch as majority voting has to be applied. An alternative to this approach is embedding the time series to popul ate a short-term history of the pattern. Again, a static mapper can be used to cl assify the embedded pattern. In this approach, classification becomes extremely hard, if the dimensionality of the input space becomes too large. Alternatively, temporal pr ocessing of the time series can be done inside the classifier. In this approach, the classifier, which has a builtin memory, can accept single elements of time series. Although the classifier doe s not require external temporal processing, the design of the classifier may be complicated. Another major diff iculty in dynamical patte rn recognition is how to design the label, which should also be a time series. The obvious method is to enforce a class label (+1 or -1) as the desire d response throughout the pres entation of the input pattern. However, the difficulty with this method is that the desired response is forced to be constant independent of the input signal dynamics. An alte rnative powerful techni que is to create a onestep predictor for each class in order to captu re the dynamics of the class generating the timeseries (Zahalka and Principe, 1993) Then, during testing, the test pattern is presented to all predictors and the label of the pr edictor with the least prediction e rror is assigned as the class of the input pattern. This method is a smart way of converting the pattern recognition problem into a regression problem where many t echniques are available. This me thod can be easily applied to ESNs with an adaptive linear readout and used fo r classification of tempor al patterns, but it is still dependent upon input normali zation and it is sensitive to pattern duration changes. In this chapter, we propose an alterna tive readout for ESNs for temporal pattern recognition and compare it with the standard lin ear readout of ESNs and other conventional techniques used in the literature. The goal is to design an ESN/LSM read out that will recognize a class of inputs that differ by some quantity (e.g. amplit ude or shape considered as a distortion

PAGE 56

56 parameter). The proposed readout is based on a biologically plausi ble linear associative memory (LAM) that implements a correlat or specially trained for high sp ecificity. The readout, called the minimum average correlation energy (MACE) filter, is adopted from opti cal pattern recognition literature, where it is used for recognition of a given object in 2-D images in the presence of noise, geometric distortions and pose changes (Mah alanobis et al., 1987). The MACE filter is a correlator whose weights are determined in clos e form by solving a minimization problem with a constraint, and it has been shown that the met hod is equivalent to a cascade of a pre-whitening filter for the image class followe d by a LAM (Fisher, 1997). The recognition of a time pattern in a single time series is achieved by feeding the tim e series to an ESN/LSM, and utilizing a time window of states which can be interpreted as a 2-D image, one dimension being time and the other processing element number (space). Several MACEs are trained, one for each class. During testing, the unknown time signal is fed to the ES N and the states are correlated with each template image and the class label of the template with the highest correlation is assigned to the input pattern. An interesting impli cation of the MACE is that it can also be used for LSMs with proper regularization of the MACE solution. In LSMs, the proposed readout operates directly in the spike domain without conver ting the liquid state spike train outputs into continuous signals by binning or low-pass filtering. In this chapter, we first present a brief background on linear associative memories followed by the detailed explanation of th e MACE filter. Then, the use of MACE filter as the readout for ESN/LSMs is discussed. Finally the experiments are presented to compare the proposed readout with the conventional methods. Linear Associative Memories A linear associative memory is a multiple input multiple output linear system that implements an association (mem ory) between input and desired outputs (Hinton and Anderson,

PAGE 57

57 1981). The output is automatically created by the presentation of the proper input, which is referred as content addressability LAMs are quite different from the concept of memory used in digital computers, which uses a memory locati on whose contents are acce ssible by its address. Unlike digital memory, LAMs store informati on in a more global way by having several PEs instead of local bit storage. Therefore, they are also robust to noise (Haykin, 1998; Hinton and Anderson, 1981). LAMs recall by asso ciation, hence the pattern that is closest to the input is recalled. LAMs are biologically plausible just like ESN/LSM and they stand as the most likely model for human memory (P rincipe et al., 2000). Forced Hebbian rule can be used to train a LAM that associates a P dimensional input p to a Q dimensional desired output q. The output of the LAM is given by q=WLAMp, where Q x P matrix WLAM can be computed using the outerproduct rule, WLAM=qp (Principe et al., 2000; Haykin, 1998). Multiple input -output vector pairs (pk,qk ,for k =1, K ) can also be stored in the LAM by repeated presentation of each input. Us ing the superposition pr inciple for the linear network, the final weight matrix is given by K k k k LAM 1p q W For an input vector pl, the output of the LAM is given by K l k k l k k l l l l LAM k 1p p q p p q p W y. The first term in the LAM output is the desire d output and the second term, which is called crosstalk is the interference of the other stored pattern s to the true pattern. If the crosstalk term is small, the LAM can retrieve the pattern corresponding to the i nput. Notice that if the stored patterns are orthogonal to each other, the crosstalk term will be zero resulting in a perfect recall. Therefore, the number of patterns that can be st ored perfectly in a LAM is limited by the size of

PAGE 58

58 input space, which is one of th e limitations of LAMs for information processing (Principe et al., 2000; Haykin, 1998). The output patterns qk for the pairs (pk, qk) k =1,, K can be interpreted as the desired response for the LAM. The existence of a desired response allows the application of supervised learning to train LAMs using unconstrained optimization tec hniques such as least squares. For instance, the weights can be modified using the least-mean-squares algorithm by an amount given by k k k k k k LAMep y p q p W This equation is a combination of the desired forced Hebbian ( k kp q) and an anti-Hebbian term (k kp y) that decorrelates the present output yk from the input pk. The anti-Hebbian term reduces the crosstalk at each iteration improving th e performance of LAMs in terms of crosstalk. A LAM trained with LMS is called an optimal li near associative memory (OLAM) (Principe et al., 2000). One question follows naturally: what is the difference between a LAM and a linear regressor? The linear regressor passes a single hyperplane by all the desired samples, so we normally want more patterns than the size of the input dimension to generalize but it looses information about the covariance on the data space (Principe et al ., 2000). On the other hand, in the LAM, the output is as close as possible to each of the training samples. This different behavior is controlled by the numbe r of data points used for training, which must be must be less than the input dimensionality to minimize crosst alk. The MACE filter that will be discussing next is a cascade of a preprocessor (a whitening filter over the class of training data) followed by a LAM (Fisher, 1997).

PAGE 59

59 The MACE Filter The technique of matched spatial filters (MSF) or correlators for optical pattern recognition has been well studied and used for the rec ognition of 2-D objects (Vanderlugt, 1964; Kumar, 1986). The goal is to design an optimal correlator filter that represents an object class under noise and/or distortions using a set of training images. The re ason it is possible to match a single template to a class of objects is because of the huge number of weights (degrees of freedom) that the correlator has in 2-D ( N2 weights for an N x N input image). Moreover, MACE filter can incorporate the covariance information of the class in its weights. The MACE filter is the most widely used MSF due to superior discrimina tion properties (Mahalanob is, 1987; Casasent and Ravichandran, 1992). The formulation of the MACE will be presen ted in the frequency domain for convenience and ease of implementation (Mahal anobis, 1987). Consider a set of training images for a single object class. The ith training image ( i =1,.., K ) is described as a 1-D di screte sequence (obtained by lexicographic ordering the columns of the image) denoted by T i i i id x x x )] ( ),..., 2 ( ), 1 ( [ x, where d and K are the number of pixels in the image and the number of training images for the class, respectively. The discrete Fourier transform of ixis denoted by iX. Define the overall d x K training matrix by ] ,..., [2 1 KX X X X The d x1 column vector h denotes the coefficients of a filter in space domain and H its Fourier transform. Th e correlation function of the ith image with the filter h is given by i ix h g The filter output g defines a 3-dimensional space where x and y axes are the indices (lags) of the 2-dimensional correlation function and z-axis is the value of the correlation for the corresponding indices. We will call it the correlation space. The energy of the ith correlation plane in the frequency domain is H D Hi iE where iD is a d x d diagonal

PAGE 60

60 matrix whose elements are the magnitude squares of the associated elements of iX. The average correlation plane energy over the tr aining images is defined as H D H H D H ) / 1 ( ) / 1 ( ) / 1 (1 1K K E K EK i i K i i av(Mahalanobis, 1987). The MACE filter solution designs a correlati on filter that minimizes the output energy while providing predetermined values (] ,..., [2 1Kc c c cT) at the origin of the correlation plane. Each ci is the ith peak correlation value obt ained by correlating the ith image with the filter Hat zero lag. More specifically, the problem is to solve the constraint optimization problem (Mahalanobis, 1987): H D H) / 1 ( min KH subject to XHc The solution to this problem can be found analytically using the method of Lagrange multipliers (Mahalanobis, 1987) and is given by c X D X X D H1 1 1) ( (4-1) The selection of the exact value of c is somewhat arbitrary sin ce it simply scales the peak correlation amplitude. Usually, the entries of c are chosen to be 1 for in-class training images and a smaller value (around 0.1) for out-of-class im ages if at all used (Mahalanobis, 1987). The MACE filter solution achieves superior discri mination properties when compared to other existing correlation filters such as synthetic di scriminant function filt ers (Mahalanobis, 1987; Casasent and Ravichandran, 1992). The ESN/LSM MACE for Dynamical Pattern Recognition Pattern recognition is normally formulated as the design of a nonlinear static system trained discriminantly with the information of a set of labels, with data that are vectors in Rn. An alternative to this approach is obtained by crea ting models for each class which are only trained

PAGE 61

61 with the class data (nondiscriminant training). Wh en the number of classes to be classified is unknown at the training time (open set classifica tion), the latter provi des robust results. The correlator or matche d filter is a very well known example of a system that recognizes a single class data (Helstrom, 1995; Proa kis, 2001), and can be easily extended to multiple dimensions. Dynamical pattern recognition can also be formulated in a similar manner by using dynamical systems as classifiers and training them discriminantly with an appropriate desired signal. Instead, in this study, we propose to us e the MACE filter as a r eadout for ESN/LSMs for dynamical pattern recognition tasks, nondiscriminan tly trained for each class. In ESN/LSM, the proposed associative memory considers a time window of states which can be interpreted as a 2D image, one dimension being time and the other th e space of the states (see Figure 4-1). Let us explain the operation of ESN-MA CE in mathematical terms. Assume that we have P different classes of temporal pa tterns. The goal is to compute one MACE filter, ph for each class. Assume also that for each class, the training set consists of K input patterns each with a particular temporal shape of length T The procedure is to compute each phfor p=1,, P as follows. 1. The ith training input pattern for the pth class )] ( ),..., 2 ( ), 1 ( [ Ti i i iu u u u of dimension M x T is used to calculate the echo states using Equation 1-1. 2. The resulting N x T matrix of echo states forms the e quivalent of a 2-D image in section 2.2 (Figure 4-1). The echo states are then lexico graphically ordered by the columns to get the 1D column vector T T i T i T i iT ] ) ( ,... ) 2 ( ) 1 ( [x x x x with N x T elements. Here each xi( n ) is an N x1 vector with the value of the echo state at time n for the it h training sequence.

PAGE 62

62 3. The discrete Fourier transform of ixis denoted by iX. Define the overall d x K training matrix in the frequency domain by ] ,..., [2 1KX X X X 4. The optimal coefficients of the LAM for the class are computed using Equation 4-1 in the frequency domain and the corresponding MACE filter weights ph are obtained by inverse discrete Fourier transform. The output of the MACE for the ith training input pattern for the pth class can be obtained by Tp ixh. 0 5 10 15 20 25 0 0.5 1 Input signal 0 5 10 15 20 25 -1 0 1 Overlay plot of echo states time Echo state imageState Index 2 4 6 8 10 12 14 16 18 20 20 40 60 80 100 Figure 4-1. Interpretation of echo states as a 2-d imag e. A) The triangular input signal. B) The echo states of a 100 unit ESN. C) The echo state image where a poi nt in state index and time is interpreted simila r to a pixel in an image. The same procedure is repeated for the training sequences of other classes to obtain an optimal filter for each class.

PAGE 63

63 During testing, the input signal can be either a continuous stream (asynchronous operation) or a series of frames (synchronous operation). De pending on the mode of operation; the choice of the initial conditions for ESN in step 1 varies. In synchronous operation, the timing of the signal and the frames of interest are already known bot h in training and testing. Therefore, ESN is initialized with the same zero initial condition bo th in training and testing. The zero initial condition is chosen such that the system dynami cs is not biased and the system states are controlled by the input. During testing, the MACE output is calculated for each frame by correlating the filter coefficients with the echo states generated by the frame at zero lag and with a zero initial condition. When the signal timing is unknown (asynchronou s), the MACE filter generates an output at each time instant n by correlating the filter coefficients with the echo states between time instants n T +1 and n and the maximum in time above a certain pre specified threshold will be picked to represent the occurrence of the pattern. In case of a continuous stream, the initial conditions for each frame are dictated by the histor y of the input signal during testing; therefore they are not necessarily zero. In order to mimic the test conditi ons while training the MACE, the initial conditions of the ESN are not set to zero but determined by the history of the input. In other words, the whole input signal is used to create echo states and then the resulting NxT dimensional echo state images are extracted and used to train the MACE filter using the above algorithm. The Design of ESNs for Dynamical Pattern Recognition One of the drawbacks of LAMs is the storag e capacity that is lim ited by the input space dimension (Haykin, 1998). However, the ESN/LS M with a LAM readout becomes very powerful since echo/liquid states provide the LAM with a user defined high dime nsional input space increasing the number of patterns that can be stored in the LAM with minimal cross talk. In

PAGE 64

64 ESNs, the reservoir states can be interpreted as a set of bases functionals (representation space) constructed dynamically and nonlinearly from th e input (Ozturk et al., 2007). Due to the nonlinearity, it is easy to get states that are linearly independe nt of each other (Ito, 1996; Ozturk et al., 2007). Therefore, the matrices in the MA CE filter solution are typically full-rank, allowing the computation of inverses. Alt hough linearly independent, the stat es are not orthogonal to each other since they are connected through predefin ed weight values. An information-theoreticmetric, called average stat e entropy (ASE), has been recently proposed to describe the richness of ESN dynamics (Ozturk et al., 2007). ASE quantif ies the distribution of state values across the dynamic range. When ASE is maximized, the states of the system are evenly distributed across the dynamic range instead of being populated only on a few values. The ASE measure is related to the system design by uniformly distributing the poles of the linearized system around the origin (eigenvalues of W) (Ozturk et al., 2007). We propose to use the same design methodology to construct the representation layer of ESNs with the LAM readout. This approximates for ESNs the well-known property of orthogonal ba sis and eases the LAM training. In the experiments performed for this study, we have observed that the numeri cal rank (the condition number) of the inverted matrices is well-conditioned for ESNs. Likewise for LSMs, the value of each liquid state at each time instant is interpreted by the MACE as 1 or 0 depending on the ex istence or absence of a spike. This creates a very sparse state matrix which likely leads to singular matric es in the MACE filter computation. Therefore, the matrices have to be regularized in order to compute appropriate MACE filter weights with the outlined formulation. The regularization can be done by adding zero mean Gaussian noise with a small variance into the states be fore computing the MACE solution.

PAGE 65

65 Experiments This section presents a variety of experiments to compare the MACE filter readout proposed for ESN/LSM with the conventional techniques for dynamical pattern recognition. Classification of temporal si gnals with unknown peak values In this experiment, the aim is to classify a te mporal signal into one of two possible classes. The two classes are represented by a triangular and a step function of 20 samples. The interesting property of the signals in this experiment is that the peak values are unknown and can assume any value between 0.5 and 1. In other words, each class is not a signal but a family of signals The signals are also corrupted with additive, zero mean, white Gaussian noise. We compare the performance of three different ESNs, denoted by R-ESN1-MACE, RESN2-MACE, ASE-ESN-MACE, with MACE filte r readout. All ESNs have 30 PEs and are constructed using different t echniques. R-ESN1-MACE is a randomly connected ESN where the entries of W matrix are set to 0, 0.47, -0.47 with proba bilities 0.8, 0.1, 0.1, respectively. This corresponds to a sparse connectiv ity of 20% and a spectral radius of 0.9. R-ESN2-MACE is a randomly connected ESN where the entries of the W matrix are chosen from a Gaussian distribution with zero mean. The wei ghts are scaled such that the spec tral radius is set to 0.9. The internal weight matrix, W, of the ASE-ESN-MACE is designe d such that the eigenvalues of W (poles of the linearized system around origin) ar e uniformly distributed inside the unit circle. This principle is introduced in (Ozturk et al., 200 7) and it distributes the states of the system evenly across the dynamic range instead of popula ting them only on a few locations. The entries of Win are set to 1 or -1 with equal probability for all three ESNs.

PAGE 66

66 0 0.1 0.2 0.3 0.4 0.5 0.6 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.1 Noise standart deviationClassification accuracy ASE-ESN-MACE R-ESN1-MACE R-ESN2-MACE Predictive ESN Discriminative ESN MACE readout Figure 4-2. Performance comparisons of ESNs for th e classification of signals with unknown peak values. ESNs with the MACE filter readout performs better for all noise levels compared to discriminative and predictive ESNs. The classification accuracy of ASEESN-MACE is almost perfect up to a noise standard deviation of 0.3. ASE-ESNMACE performs better than randomly connect ed ESNs with MACE readout since it distributes the echo states uniformly a nd increases separation between states corresponding to different classes. To train the MACE filter readout for the ESNs 100 different realizati ons for each class of signals are generated and the echo states are ca lculated using Equation 2-2 with zero initial conditions. This creates an image of dimension 30x20, where 30 is the number of PEs and 20 is the signal length. One MACE filter is trai ned for each class using only data from the corresponding class. Output correlation peak amp litudes are assigned to be 1.0 for the training data. The two MACE filters are synthesized in the frequency domain using Equation 4-1 and the

PAGE 67

67 corresponding image plane filters are obtained by i nverse discrete Fourier transform. For a given test signal, each filter output is calculated by correlating the ec ho states of dimension 30x20 with the filter coefficients at zero la g. Figure 4-2 shows the correct cl assification rate of the 3 methods as the standard deviation of th e noise signal varies. The results are averaged over 100 realizations of each ESN. We also compare the performance of ESNs with MACE filter readout to the ASE-ESN utilizing the conventional linear readout. When the conventional readout is used for dynamical pattern recognition, the pr oblem is the design of the desired si gnal. One solution is to train a discriminative readout with constant target output signal of amplitude 1 for one of the two classes and a constant signal of am plitude -1 for the other class. Th e duration of the target signals is the same as the duration of the input signal. During testing, echo states, the output signal and the corresponding error signals one for each label are computed for a given input signal. Then, the two MSE values, integrated over the duration of the input pattern, are compared to assign the class of the input pattern to th e lowest MSE. In this solution, the readout has to generate a constant signal independent of the input signal dynamics. An altern ative also tested is to train two independent predictive readouts, one for each class to predict th e next sample of the input signal. During testing, the echo states, the output signal and the error si gnal are computed for each readout. Then, the label of the readout w ith the lower MSE value, integrated over the duration of the input pattern, is assigned as th e class of the input signa l. The predictive readout seems a more natural choice since the desired re sponse is not independent of the input dynamics. The same training data used for the MACE filter is employed to cal culate the readout weights for the discriminative and predictive ESNs.

PAGE 68

68 Figure 4-2 shows the correct classi fication rate as the standard deviation of the noise signal varies. The results are averaged over 100 trials. As observed from the figure, ESNs with the MACE filter readout perform better for all noise le vels compared to the ESNs with conventional or predictive readouts. The classification accura cy of ASE-ESN-MACE is almost perfect up to a noise standard deviation of 0.3. The correct classification rate of the discriminative ESN for the no noise case, is only 0.92, while fo r the predictive ESN is 0.95, whic h is slightly better than the discriminative ESN. Since all three systems have the same reservoir, th e better performance of the MACE readout is attributed to the fact that the problem requires modeling a class of functions all at once in a single template and the MACE filter is able to build such a template containing covariance information about the full training set. Th e conventional and predictive readouts train for the best compromise of the i nput class peak amplitude of the training set. Being a nonlinear system, ESN dynamics are coupled to the input signal. Therefore, the response of the system to input signals with varying ampl itudes can be quite different making the task of projecting the states to a consta nt value very challenging. The wa ve shape is also challenging, since for the pulse (a sequence of zero values fo llowed by a sequence of ones) the linear readout has to transform the highly irregular states ge nerated by this pattern to a constant value. Similarly, for the predictive ESN the variation in the input amplitudes affects the ESN dynamics, and the performance drops fast as the noise stan dard deviation is increased. Morever, the MACE filter readout is expected to be more robust w.r.t selection of W and Win in the ESN design compared to the conventional readout design met hodology where the states have to be carefully selected in order to estimate the desired si gnal with a linear readout. In contrast, the LAM readout does not try to estimate a desired response but rather generates a template from the states.

PAGE 69

69 Among the ESNs with MACE readout, the ASE-ESN performs the best compared to randomly generated ESNs. This was an expected result, since ASE-ES N distributes the echo states uniformly and increases separation betw een states corresponding to different classes. Therefore, ASE-ESN should be the preferred design for ESNs with MACE filter readout. Robust signal detection in a digital communication channel We applied ESN-MACE to detection of a known waveform transmitted through a digital communication channel. Please refer to Ch apter 6 for details of this study. Classification of odors with an electronic nose In this experiment, the goal is to classify odors using signals collected with a Cyranose 320 electronic nose. The Cyranose 320 has an array of 32 sensors which change their resistance in response to odors and produce a unique response patte rn for different odors (Figure 4-3). For our experiment, the spices, basil, cinnamon, ginger, a nd rosemary form the 4 classes of odors to be classified. Each of these spices is mixed with ai r and exposed to the elec tronic nose in 4 different concentrations; 25%, 50%, 75% and 100%. For each concentration of each spice, data corresponding to 10 different trials is collected. There are a total of 160 trials, 40 for each class of spice. We submit two approaches to solv e the problem. The first approach is based on obtaining static features that are the steady-st ate values of all 32 sensors as given by the Cyranose. We obtain a 32x160 matrix containing 32 features for all 160 trials of the 4 classes. From the feature matrix, we extract the part co rresponding to 40 trials (10 trials for each class with representatives from 4 differe nt concentrations) as the training set to train a linear classifier with 4 outputs, each representing one of the 4 classe s. The linear classifier is used due to the availability of an analytical solution. Each output of the linear classifier is trained to output 1 when the input feature is from the representati ve class and 0 otherwise. The 4x32 weight matrix of the linear network is computed using least squares. During test ing, the 32x1 feature vector is

PAGE 70

70 0 50 100 150 -0.05 0 0.05 0.1 Response of the 32 electronic nose sensors over time time Figure 4-3. Time-series representing the res ponse pattern of the 32 electronic nose sensors exposed to rosemary. used to compute the 4 outputs and the label of the output with the maximum value is assigned as the class of the input vector. The classification ac curacy of the linear clas sifier evaluated on the 160 feature vectors is found to be 78.7% and the confusi on matrix is found to be 24 16 0 0 0 40 0 0 0 14 22 4 0 0 0 40 The entry aij in row i and column j of the confusion matrix is the number of odors that should have been classified in cl ass i but were classified as j. The correct decisions are given on

PAGE 71

71 the main diagonal of the confusion matrix. The 4 classes are presented in the order basil, cinnamon, ginger, and rosemary. The second approach is based on the propos ed ESN-MACE. ESN exploits the dynamical response of the sensors which also contain in formation about the odor class. We create a template MACE filter for each class from the states of the ESN which represents the variations among different spice concentrations. Again, 40 trials (10 trials for each class with representatives from 4 different concentrations) are used as the training set. Unlike the static case, each training trial consists of a 32 dimensional time series representing the response of the electronic nose. We use an ESN with 100 internal units and 32 inputs. The internal weight matrix, W, of the ESN is designed with uniform eigenva lue distribution and a spectral radius of 0.9. The entries of Win are set to 5 or -5 with equal probab ility. In the previous two experiments, the selection of the length of the patterns was dictated by the known signal lengths. However, in this experiment, we do not have pre-defined sign als making the selection of the pattern length, T a design parameter (Figure 4-3). Here, we choose T to be 80 so that the te mplate pattern contains the rising and falling edges of the sensor respon se. This selection uses the transient sensor dynamics of the particular odor in the design of the MACE readout. For each input pattern, the echo states are calculated using Equation 2-2 with zero initial conditions. This creates an image of dimension 100x80, where 100 is the number of PE s and 80 is the signal length. 10 in-class patterns are used to train each one of the MA CE filters corresponding to 4 classes. Output correlation peaks are assigned to be 1.0. During testing, echo states are calculated for each pattern and 4 MACE filter outputs are calculated. The label of the MACE filter with the highest value is assigned as the class of the input patter The classification accuracy of the ESN-MACE evaluated on the 160 input patterns is found to be 63.1% and the confusion matrix is found to be

PAGE 72

72 28 11 1 0 0 17 16 7 0 0 35 5 0 1 18 21 The results are, indeed, worse than the linear classifier trained on st atic features. When looked closely, the MACE filter gives correla tion values around 1.0 for the in-class data. However, the response of the MACE filter corresponding to 2nd class (cinnamon) for out-of-class data is bigger than 1 for most of the patterns wh ich results in the false classifications. The MACE filter trained nondiscrimantly does not constrain th e response of the MACE for out-of-class data. One solution to this problem is to use out-of-clas s data in the MACE filte r training as is done in (Mahalanobis et al., 1987). Different from the previous experiments, both in-class and out-ofclass training data are used to train each one of the 4 MACE filters corresponding to 4 classes. For each MACE filter, 10 in-class and 30 out-of-class (10 for each class) echo state images are used for discriminative training. Output correlation peak am plitudes are assigned to be 1.0 for the in-class and 0.1 for the out -of-class training data. Using the training data, four MACE filters ar e synthesized in the frequency domain using Equation 4-1 and the corresponding image plane filte rs are obtained by inverse discrete Fourier transform. For a given test signal, the output of the each MACE filter is calculated by correlating the echo states of dimension 100x80 with the filter coefficients. The label of the filter with the largest correlation is assi gned as the class of the input patter n. Out of the 160 trials, the correct classification rate is found to be 93. 1 %. Note that this is much be tter than the performance of the classifier based on static features. Similarl y, the confusion matrix is found to be

PAGE 73

73 39 1 0 0 0 35 5 0 0 3 37 0 0 0 2 38 The addition of out-of-class da ta for training of the MACE improves the rejection of the out-of class patterns and hence decreases the fals e alarm rate. However, it also converts the problem to a classification problem (closed set cla ssification) where the number of classes has to be known a priori. It is desirable, whenever possibl e, to avoid the use of out-of-class data in the training data for detection of signa ls where the out-of-class is either not available a priori or can be any possible signal shape excluding the in -class signal shape (the entire signal space excluding the in-class signal space). Alternatively, we trained MACE filters in th e input space instead of the state space of the ESN. In the input space, the images are of dimension 32x80. The procedure of training MACE filters for each class is repeat ed in the 32 dimensional input space. Out of 160 trials, the correct classification rate is found to be 85.6%. The confusion matrix of the MACE filter is 39 1 0 0 0 32 7 1 0 5 32 3 0 1 5 34 In order to understand the increase in classi fication rate by the ESNs, let us compare the outputs of ESN-MACE filter with the MACE filter trained direc tly at the input space for the rosemary class (Figure 4-4). As observed from the figures, the ESN-MACE fi lter output correspond ing to the rosemary class displays values around one consisten tly for the in-class da ta except for the 37th testing sample.

75 The ESN-MACE filter outputs co rresponding to the othe r classes are very close to 0.1 as specified by the training procedure. However, th e MACE filter outputs trained directly in the input space have very inconsistent outputs for the out of class samples. The margin between the in-class and out-of-class outputs of the MACE filte r is increased by the ESN reservoir. There are mainly two reasons for the increase in the distance between the images with the addition of ESN. The first reason is the projection of the signals to a higher dimension by the ESN which increases the separation between the images. Secondly, ESN functions as a nonlinear filter that helps separate the responses to different input classes. We also compared the results obtained to the TDNN trained in discriminative mode. 4 TDNNs, one for each class, are used where the desi red target signal is +1 for in-class data and 0 otherwise. Each TDNN has a delay line of 81 samp les (equal to the signal duration) for each one of the 32 sensors, 3 hidden layer PEs and a sing le output unit. The same data (10 in-class and 30 out-of-class) for discriminative ESN-MACE is used to train the weights of TDNN using the Levenberg-Marquardt algorithm with a step size of 0.01. Training is stopped after 100 epochs. A cross-validation set cannot be used due to limite d number of training da ta. During testing, an input pattern is fed to each one of the 4 TDNNs and the label of the TDNN with the largest output is assigned as the class of the input pattern. Out of 160 trials, the correct classification rate is found to be 65.7%. The confusion matrix of the filter is 30 2 2 6 6 21 10 3 1 7 29 3 0 2 13 25 The performance of the TDNN is very poor compared to discriminative ESN-MACE and even the static classifier. The main reason for th e poor performace is the l ack of sufficient data

PAGE 76

76 for training the TDNN. The number of parameters of each TDNN is 32*81*3+3=7779 whereas only 40 training patterns are available for training. Spike train classification with LSM The associative memory readout presented for ES N can in fact also be utilized as readout for a LSM. Similar to the ESN case, the liquid states are interpreted as 2-D images, one dimension time and the other space. An optimal MACE filter, which associates LSM states with class labels, can be analytically computed for each class using training data. The value of the liquid state is either 1 or 0 depe nding on the existence or absence of a spike. Note that the MACE filter readout does not require lo w-pass filtering the spike trains in order to get continuous amplitude signals for further processing. We use a spike train classification experiment, where the goal is to classify a spike train presented to the LSM as one of the two classes (Maass et al., 2002). Two classes are represented by randomly generated Poisson distributed spike trains over 100ms time interval. Other members of each class are generated by moving each spike of the representative spike train by an amount drawn from a zero mean Gaussian dist ribution with a variance of 4ms. A randomly connected LSM with 50 integrate-an d-fire neurons with 20% inhibitory (I) and 80% excitatory (E) connecti ons is used. The membrane ti me constant, reset voltage, threshold voltage, background current and input resistance of the ne urons are chosen to be 30ms, 13.5mV, 15mV, 13.5nA, and 1 M respectively. Absolute refract ory periods for excitatory and inhibitory neurons are 3ms and 2ms, respectiv ely (Maass et al., 2002). The probability of a synaptic connection between neurons a and b is given by 2 ) ) ( (b a DCe, where D(a,b) denotes the distance between two neurons and is a parameter which controls both the average number of connections and the average distance between neurons that are synaptically connected. The

PAGE 77

77 neurons are placed on a 10x5x1 grid. C is set to 0.3 ( EE ), 0.2 ( EI ), 0.4 ( IE ), 0.1 ( II ). Dynamic synapses are used based on the model proposed in (Markram et al., 1998), with the synaptic parameters U (use), D (depression time constant), F (facilitation time constant) randomly chosen from Gaussian distributions. Mean values of U D F are set to .5, 1.1, .05 ( EE ), .05, .125, 1.2 ( EI ), .25, .7, .02 ( IE ), .32, .144, .06 ( II ) and SD of each parameter was chosen to be 50% of its mean. The mean of the scaling parameter W (in nA) is chosen from a Gamma distribution with mean values 30 nA (EE), 60 nA (EI), -19 nA (IE ), -19 nA (II) (Maass et al., 2002). For input synapses, the parameter W has a value of 18 nA or 9.0 nA for destinations excitatory and inhibitory neuron, respectively. SD of W is 100% of its mean. The postsynaptic current was modeled as an exponential decay exp(t /t s ) where t s is 3ms and 6ms for excitatory and inhibitory synapses, respectively. The transmission delays be tween neurons are 1.5 ms (EE), and 0.8 for the other connections. The initial membrane voltage of each neuron at the start of each simulation is drawn from a uniform distri bution between 13.5mV and 15.0mV (Maass et al., 2002). Fifty different realizations of spike trains fo r each class are generated and the liquid states are calculated for each realization. One readout MACE filter is trained for each class using only data from the corresponding class. Output correlation peak amplitude is assigned to be 1.0 for the training data. Notice that the state matrix is very sparse resulting in singular matrices in the MACE filter computation. Therefore, the matrices have to be regularized in order to compute inverses. Here the regularization is done by adding zero mean Gau ssian noise with a variance of 0.02 to the states before computing the MACE solution. The two MACE filters are synthesized in the frequency domain using eq 4.1 and the corresponding image plane filters are obtained by an inverse discrete Fourier transform. For a gi ven test signal, the output of each filter is calculated by correlating the liquid states with the filter coefficients.

PAGE 78

78 Spike train classification with LSMs is conve ntionally done by firs t low-pass filtering the spike output of the liquid and using a linear re gressor network (Maass et al., 2002). Here, we used a low-pass filter with a time constant of 30m s. A linear readout is tr ained with the constant target signal of amplitude +1 for one of classe s and the constant signal of amplitude -1 for the other class using the same data as MACE filte r training. During testing, the echo states, the output signal and two error signals one between the output and the constant signal of amplitude +1 and another between the output and the constant signal of amplitude -1 are computed for a given input signal. Then, two MS E values, integrated over the dur ation of the i nput pattern, are compared to assign the class of the input pattern. Figure 4-5 shows the correct classification rate for both methods as the parameter which controls the synaptic connections in the LSM, varies from 0.2 to 4. The results are averaged over 100 trials. As observed from the figure, the MACE filter readout gives sl ightly better accuracy compared to the linear readout for all values. This is due to the fact that the linear readout tries to approximate a constant signal independent of the liquid state values. This creates difficulties for the simple linear readout especially when the i nput spike train is sparse In fact, under closer analysis, the output of the linear readout fluctuat es. Therefore, the decision made by the linear readout has to be carefully averaged over time in order to reach a final de cision about the class of the input spike train. For example, if the decision is based solely on the ou tput of the readout at the final time instant, the classifi cation rate is very close to theo retical minimum of 0.5. On the other hand, the MACE filter rea dout provides the ability to operate in the spike domain without the need to low-pass filter the liquid state spike output. However, we note that the MACE filter computation in Equation 4-1 requires extra cauti on since the liquid state has many zero values mostly leading to singular matrices.

PAGE 79

79 0 0.5 1 1.5 2 2.5 3 3.5 4 0.8 0.85 0.9 0.95 1 1.05 LambdaCorrect classification rate Linear readout MACE filter readout Figure 4-5. Comparison of the correct classification rates of LSM-MACE and LSM with linear readout trained in the spike train cl assification problem as the parameter varies. The MACE filter readout gives slightly better accu racy compared to the linear readout for all values. This is due to th e fact that the linear rea dout tries to approximate a constant signal independent of the liqui d state values. Moreover, MACE readout operates in the spike domain eliminating the need to convert spike trains to continuous valued signals by low-pass filtering. This experiment demonstrates the use of MACE filter as a readout for LSMs. A more detailed study has to be carried out to expl ore the advantages and disadvantages of MACE readout compared to the standard techniques.

PAGE 80

80 CHAPTER 5 IMPLICATIONS OF ESN ON THE DESIGN OF OTHER NETWORKS Echo state networks introduced a new type of co mputation that has been proven to be very powerful. In this chapter, we study how the ES N idea can be utilized for the design of other networks. First, we investig ate the echo state condition (|W|<1) and approach the echo state condition in terms of the effect of system stability on the power of dynamical system for computation. In particular, we relax the echo st ate condition and allow some of the poles to be larger than 1. This introduces a new dynamical regime, called transiently stable computation, where function approximation with a readout is still possible eventhough the system is not globally stable. Second, we invest igate the biologically plausible model of the olfactory cortex, Freeman Model (FM), and propose to use a readout for FM to be able to use it for useful computation. Without the proposed readout, the use of FM is limite d to simple digit recognition. An interesting property of FM is the nonlinear f unction used in the model. FM nonlinearity does not have its largest slope at ze ro operating point unlike the sigm oidal nonlinearity used in ESNs. We will demonstrate with experiments that FM coupled with a readout can process continuous valued signals. Transiently Stable Computation Linear filters are the simplest dynamical systems employed to process signals with temporal structure as discussed in chapter 1. Th ey can be classified according to the impulse response, which fully characterizes the filter, as finite impulse response (FIR) and infinite impulse response (IIR) filters (H aykin, 2001). For the IIR case, stability becomes an essential constraint in the design. Without stability, the system response will diverge independent of the input signal. Global asymptotic stability of linea r filters can simply be guaranteed by selecting the poles of the system in the proper region of the frequency doma in, open left half plane for the

PAGE 81

81 continuous-time systems and inside the unit circle for the discrete-time systems (Kailath, 1980). TDNN and RNN can be considered as the nonlinear counterparts of the FIR and IIR filters, respectively. Similar to the linear case, stab ility becomes an issue for RNNs, although the definition of stability is no l onger BIBO (bounded input bounded out put) stability. In the ESNs, stability issue is addressed with the echo state condition (Jaeger, 2001) and has been investigated in detail in chapter 2. Echo state condition is deri ved from the stability of the linearization of the system around zero equilibrium and guarantees that the states are strongly coupled with the input signal allowing the system to compute an input-out put map. In this chapter, we introduce a new computational mode that emerges when we rela x the echo state condition and allow the spectral radius to be slightly greater th an 1. First, let us consider an ESN that obeys the echo state condition. Conventional Echo State Networks A simple demonstration of the typical res ponse of an ESN with echo state condition is given in Figure 5-1. A randomly connected 100unit ESN without output f eedback connections is constructed. The entries of the in ternal connection weight matrix W are set to 0.4, -0.4 and 0 with probabilities of 0.025, 0.025 and 0.95 resulting in a spectral radius of 0.9. The input weight matrix, Win has values +1 or with equal probabilitie s. The input to the sy stem (Figure 5-1 A) has 3 different regions, a zero si gnal for the first 100 steps, a ramp signal for the following 100 steps and again a zero signal of length 100. The sy stem is initialized rand omly and run with the given input.

83 Figure 5-1 B depicts all the 100 echo states on top of each other over the given time interval. As seen from the figur e, during the first 100 steps, the system state converged to zero (after an initial transient of length about 30 steps caused by the random initial condition) since the input to the system is zero and the spectral ra dius is less than 1. Then, the echo states are constructed by the ramp signal between the time interval 100 and 200. The states again converge to zero due to the echo state condition, when the input is removed. The resulting echo states can be used to generate any desired response relate d to the input signal. He re, desired signal is chosen to be the seventh power of the input sign al and The readout weights are calculated using Equation 2-4. Figure 5-1 C displays the overlay plot of the optimal ES N output and the desired signal. As observed from the figure, the syst em output very accurately estimates the desired signal. Transiently Stable Computation with Echo State Networks Engineering systems are mostly designed with a global stability restric tion. We have seen a similar constraint, echo state condition, in the design of echo state netw orks. We have also demonstrated with a function approximation expe riment that ESNs with the echo state condition can do useful computation. On the other hand, biological computation ma y not possess a similar stability constraint (Freeman 1975; Freeman, 1993; Yao and Fr eeman, 1990). Indeed, biological systems do not have fixed point dynamics but rather wide variety of collective behavior in the form of oscillations and even chaos (Freeman 1975; Yao and Freeman, 1990). Inspiring from principles of biological computa tion, we would like to explore the networks re sponse without the restriction on the spectral radius of W. Hence, we will remove the echo state condition and let the spectral radius to be slightly greater th an 1. Obviously, this will introduce instability for the autonomous system whereas the response of th e system with an app lied input is yet not obvious.

PAGE 84

84 0 50 100 150 200 250 300 350 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 time The states of the 100-unit transiently stable ESN A 0 50 100 150 200 250 300 350 -0.2 0 0.2 0.4 0.6 0.8 1 1.2 time The overlay plot of the ESN output and desired signal Desired signal ESN output B Figure 5-2. Demonstration of a typi cal response of transiently stable ESN with a spectral radius of 1.1 and application to f unction approximation. A) 100 echo states of th e transiently stable ESN. B) The overlay plot of the ESN output and the desired signal. In order to observe the ESN response wit hout the echo state cond ition, we will use a similar experimental set up that we used in previous section. We use the same W matrix and scaled it to obtain a spectral radius of 1.1. The same input signal (Figure 5-1 A) is fed to the ESN

PAGE 85

85 that is initialized to a random initial condition. We plot the resulting echo states in Figure 5-2 A. There are a few observations to be made at this poin t. First of all, we see that the system does not converge to zero during the first 100 steps although the i nput is zero. Instead, the echo states exhibit a nonconvergent dynamical response that differs from the ESN with the echo state condition. In fact, this was the expected response, sin ce the system is designed to be unstable (spectral radius is greater th an one) around zero equilibrium. A similar response can also be observed during the last 100 time steps when th e input is again zero. The second and more important observation is the response of the system between time steps 100 and 200 where the ramp signal is applied as the input. The echo st ates during this interval become more regular (after some initial transient between time steps 100 and 130) compared to the states when there is no input. In fact, after the transi ent, the echo states look similar to the ones in Figure 5-1 B where the ESN satisfies the echo state condition. Thirdly, there is a transition period between time steps 100 and 130 where the states switch from a disordered mode to a more regular mode. In summary, the system responds according to its own dynamics when there is no input. As the input amplitude gradually increases, the system response is determined by a competition between the system dynamics and the input amplitude. When the input amplitude is sufficiently large, the system dynamics becomes transiently stable an d is determined by the input signal. We would like to find out if we can utilize the transientl y stable ESN for the same function approximation problem we had in the previous section. The weights of the linear readout network are again computed using Equation 2-4 and the corresponding output is genera ted in Figure 5-2 B. As seen from the figure, we got a good match between the output and the desired signal. Similar results can be obtained even if the system is initialized to a different initial condition or the time instant at which the ramp signal is applied (this will ch ange the state value when the ramp signal is

87 Now, let us examine the movement of poles of the transiently stable ESNof the previous section. Figure 5-4 shows the movement of the pol es for the same input signal. First of all, the poles of the system move in and out of the un it circle during the first and last 100 time steps (hence zero state is not asymptotically stable). This explains the complex echo state signal when there is no input to the system (Figure 5-2 B). Th e poles of the ESN with spectral radius 0.9 stays inside the unit circle during th e same time interval. Secondly, when the ramp input is -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary A B -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 -1 -0.8 -0.6 -0.4 -0.2 0 0.2 0.4 0.6 0.8 1 RealImaginary C D Figure 5-4. Movement of the poles for ESN with sp ectral radius 1.1 are plot ted in figures A, B, C, D, when the input is at points labeled by A, B, C, D in Figure 5-1 A, respectively. applied, the poles start to shrink towards the unit ci rcle (after some transient) as is observed in the stable ESN. This phenomenon is again due to the sigmoid nature of nonlinearity, which saturates with large input signals reducing the slope of the operating point in Equation 2-5. With the movement of the poles towards the origin, the system stabilized transiently allowing the system state to be controlled by the input. This is critical for the use of the system for

PAGE 88

88 computation (function approximation). Finally, we observe a transient time, when the ramp signal is applied, before the poles start shrinki ng. In this period, there is a competition between the system dynamics and the input signal. There are a few extra comments about the selection of the spectral radius to be made at this point. First of all, the tran sient response time after the ramp is applied is determined by two f actors, the spectral radius and the slope of the ramp signal. The transient time will increase if the spectral radius of W is increased or the sl ope of the ramp signal is reduced. Because as the spectral radius is increased, the force re quired to bring the poles inside the unit circle is increased, he nce a larger signal is required. Therefore, there is a balance between the system dynamics determined by the sp ectral radius and the input signal required to stabilize the system. Secondly, it is important to quantify the uns table behavior of the signals present in the transiently stable ESN when the in put is absent. We have observed that when the spectral radius is slightly greater than 1, the signals are peri odic oscillations whereas for larger values of the spectral radius, the signals are more irregular and even chaotic. Freemans K Sets for Dynamical Computation Understanding information processing in br ains remains a challenge. Many different hypotheses on how the brain might process the ma ssive bombardment of information brought by sensory systems have been advanced, ranging from artificial intelligence, neural networks and more recently machine learning and statistics (Freeman, 1975; Yao and Freeman, 1990; Chua and Yang, 1988; Wang, 1999).Walter Freeman devel oped a biologically realistic mathematical model for the olfactory cortex that captures at a cell assembly level (mesoscopic level), some of the physiological properties of the olfactory sy stem. Freeman model (FM) is based on nonlinear recurrent dynamical interactions representing th e response of thousands of cells, the neural assemblies, instead of single neuron responses. In order to derive his model, Freeman used neurophysiological recordings a nd neuroscience concepts and th e K set hierarchy (Freeman,

PAGE 89

89 1975) to build a model that mimics the behavior of olfaction (Freeman 1975). This model is a drastic departure of other neural network models such as Hopfie lds because system response is produced by interactions of second order nonlinear sets. Freemans major drive is to understand how olf actory system works; while here we aim at using FM as a computational framework for the in teraction of animats with the real world. Real worlds are complex dynamical environments that unfold unpredictably w ith lots of redundancy but also with many unrelated cues. In this set up, the fundamental operat ions are the recognition of prior situations that will i nduce prototypical responses dependent upon the animats goals. The constraints of the experimental setting are also rather important. The operation is intrinsically real-time in the sense that only the present value of the continuous stream of information from the sensors is available and must be processed at each time step, while the full understanding of the situation is most often related to the input history. So we favor signal processing methodologies instead of statistics where the behavi or in time is always ra ther constrained. The animat may also need to learn how to recognize new situations wh en they become important for its function, therefore it must have internal representational struct ures that expect outcomes and change behaviors. Brains process information for the sake of the animals survival, while animats have external goals preprogrammed by hu mans, and as such they must possess an extra level of functionality that translates internal represen tations back and forth to the outside, i.e. it is necessary to define input and outputs (in the anim al, brains only output is the motor system). Therefore we have to emphasize this translati on between distributed dynamical representations and outputs in the form of design principles for optimal readouts. The framework we are pursuing is based on distributed dynamics instead of machine learning. Since Hopfield, the role of dynamics for associative memo ries is well understood

PAGE 90

90 (dynamical systems with fixed point attractors). In fact, FM has al so been utilized as a dynamical auto-associative memory (Tavares, 2001; Xu et al., Ozturk et al., 2004; Xu et al., 2004). FM weights are trained with Hebbian learning while two binary patterns to be stored are presented to the system successively. During testing, a noise corrupted version of one of the two stored patterns is presented to the system. Then, the energy of the oscillatory system response is computed and compared to a threshold to choose the correct stored patt ern. We see two major shortcomings with the energy base d readout. First, utilizing just the energy is wasteful since it ignores the information embedded in the dynamics of the signals. Second it does not exploit the global information in the states since it is a local method computing th e energy of individual processing elements. In this chapter, we demonstrate that FM can be considered in the same framework of ESN/LSM. In fact, the KI and KII networks of FM are conceptually similar to the reservoir/liquid in an ESN/LSM with a proper se lection of the parameters. The big difference is that the KII layer is highly stru ctured with a coupled set of non linear oscillators We will derive conditions on the system parameters for the FM similar to the echo state property of ESNs. With the proposed framework, it becomes evident that th e OB layer alone lacks a readout in order to be used as a universal computing machine for time series. Therefore, we propose to use an adaptive linear networ k to implement the readout from the OB layer. We present experimental results to show the power of this framework for th e FM of the olfactory cortex. In particular, we show that it is not necessary to drive Freeman s model with 0 and 1s, as is presently done. An Overview of Freeman Model Freeman Model is a biologically plausible math ematical model of the olfactory cortex. The dynamical behavior of the FM mimics the physiologi cal signals obtained from the cortex. It is a

PAGE 91

91 hierarchical model of different levels ( K0 KI KII and KIII ), where simpler structures at the bottom of the hierarchy combine to form the more co mplex structures at the top of the hierarchy. The basic building block of the model is the K0 set, which is depicted in Figure 5-5 A. Every K0 unit can accept several spatial inputs that are weighted and summed, and then convolved with a linear time invariant system defined by the second order dynamics whose transfer function, H(s) is given by ) )( ( ) ( b s a s ab s H (5-1) where 1/a and 1/b are real time constants de termined experimentally (Freeman, 1975). The output of the linear dynamics is then shaped by the nonlinear sigmoid function, which is experimentally determined to be 0 0 ) ) 1 ( (if 1 if ) 1 ( ) ( x x x x e Q Q x Qm xQ e m m (5-2) The sign of the connection strength from a K0 set will define the type of the K0 set (excitatory for positive weights and inhibitory for negative we ights). The second level in the hierarchy is the KI network. K0 sets with a common sign (either excitatory or inhibitory) are connected through forward lateral feedback to construct a KI ne twork (Figure 5-5 B). No autofeedback is allowed in the networ k. KI network is generally used as an input layer for the higher levels of Freeman model. It preprocesses th e input to transform it into a space which is compatible with OB dynamics. In this study, we will use KI network as an echo state network containing a representation of the input history.

93 The third level in the hierarchy, the KII set (Figure 5-5 C), is the most interesting and important building block of the olfactory system, si nce it is an oscillator controlled by the input. The response of the KII set to an impulse input is a damped oscillation whereas with a sustained input, the output oscillation is maintained as long as the input remain s. The architecture of KII set is shown in Figure 5-5 C, where the circles denote K0 sets and the sign i ndicates the type of connection. Figure 5-5 D shows a reduced KII set, where the two K0 units are denoted by M (for mitral cell) and G (for granular cell). In this model, the mitral cell takes the external input P(t) and the coupling strengths between M and G are controlled by the two weights Kmg > 0 (excitatory) and Kgm < 0 (inhibitory). If the coupling weights are select ed properly, the RKII set is, similar to the KII set, an oscillato r controlled by the input (Freeman, 1975). A KII network can be formed by interconnecting a number of KII sets with excitatory connections between the excitatory cells (top circ les in Figure 5-6) and inhibitory connections between inhibitory cells (bottom circles in Figure 5-6). An RKII network is formed similar to KII network by connecting a number of RKII sets. We will use RKII network in our simulations instead of KII networks since various analytic tools are ava ilable for RKII and it has been demosntrated that RKII is functionally very si milar to the KII (Freeman 1975, Xu et al., 2004). This interconnected structure represents a key stage of learning and memory in the olfactory system. Input patterns through M cells are mapped into spatially distributed outputs. Excitatory and inhibitory interconnections enable cooperative and competitiv e behaviors, respectively, in this network. The RKII network functions as an associative memory (Tavares, 2001; Xu et al., 2004; Principe et al., 2001; Ozturk et al., 2004).

PAGE 94

94 Figure 5-6. A full KII network. The final Katchalsky level is the KIII network and represents the olfactory cortex. In a KIII network, two/three KII sets and a KII network are tightly c oupled through dispersive connections, mimicking the diffe rent lengths and thicknesses of axons. Since the intrinsic oscillating frequencies of each one of the KII sets in different layers are incommensurate among themselves, this network of coupled oscillator s will present chaotic behavior. For a detailed description of the KIII network, we refer the reader to (Freeman, 1975; Xu et al, 2004). Now, we will formulate a standard form for th e state (outputs of KO sets) equations of the Freeman model valid for all hierarchical sets of FM). Consider a Freeman network with M input units, N KO sets and L output units. State equations fo r this Freeman model can be expressed as )) ( )) ( ( ( ) ( ) ( ) ( ) ( t W t Q W ab t ab t b a tinu x x x x (5-3) Here, x(t) = ( x1(t), x2(t),, xN(t))T is the state vector where e ach entry is the output of a KO set at time t and u(t) = ( u1(t), u2(t),, uM(t))T is the input vector. W is an N*N matrix defining the interconnection weight values between the KO sets and Win is an N*M matrix defining the interconnection weights between the inputs and the KO sets. For example, consider the RKII set depicted in Figure 5-5 D. The governing equations for the RKII set are given by

PAGE 95

95 )) ( ( ) ( ) ( ) ( ) ( )) ( )) ( ( ( ) ( ) ( ) ( ) ( t m Q abK t g ab t g b a t g t p t g Q K ab t m ab t m b a t mmg gm (5-4) In a RKII set, the number of K0 sets is two and the number of inputs is one. Hence, in the standard form, M is one and N is two. With this information, Equation 5-4 can be restated in the standard form of Equation 5-4 with the state vector x(t) = [ m(t) g(t) ]T, input vector u(t) = [ p(t) ], the input weight vector Win = [1, 0]T and the weight matrix 0 0mg gmK KW Dynamical Computation with F reeman Model with a Readout In order to be able to com pute with a dynamical system, one has to be aware of how the information is represented in the system stat es. Once the form of representation is known, desired information can be extracted from the system states with an appropriate readout mechanism. For instance, in Deliang Wangs network (Wang, 1999), the phase difference between the coupled oscillators is the source of information. For FM, it has been argued the information is encoded as amplitude modulated signals and hence energy of the KO output was used as the relevant information from the FM in practical application. For example, the RKII network has been proposed as an auto-associative memory that can store static binary patterns (Tavares, 2001; Xu et al., 2004; Ozturk et al., 2004). In (Xu et al., 2004) an RKII network of 64 reduced KII sets is constructed with fixed Kmg, Kgm and Kii values (determined experimentally from biological data). Excitato ry weights of RKII network are trained with Ojas unsupervised learning rule (a stable version of the Hebbian learning rule) while two binary patterns to be stored are presented to the system successively. During testing, a noise corrupted version of one of the two stored patterns is presented to the system. Each RKII set creates limit cycle oscillations of low or high ener gy depending on the stored and th e input pattern applied. Then,

PAGE 96

96 the energy (averaged over a time interval) of th e mitral KO set of each RKII set is computed and compared to a threshold to decide the binary system output. We see two major shortcomings with the energy based readout for FM. First, utili zing just the energy is wa steful since it ignores the information embedded in the dynamics of th e system state. Second, it does not exploit the global interdependencies among the RKII sets since it is a local method computing the energy of individual processing elements. Hence, the energy based readout can not be optimal and can not be used for continuous dynamic patterns. It is clear that a more principled approach is necessary in order to use FM to process signals that are not necessarily binary or static. With the analogy to the LSM/ESN framework, we propose to use a readout which is a linear or nonlinear projection of all the RKII states. The role of the readout is to extract specifie d desired information from the FM states. A

PAGE 97

97 B Figure 5-7. Freeman Model as an associative me mory. A) Stored patterns. B) Recall of the corrupted patterns: The first, second and their columns are the input to the KII network, the output before th e threshold detector, and th e output after the threshold detector, respectively. We will also demonstrate that Freemans K sets provides a complete system, that is to say, the readout can also be chosen from the Freeman hierarchy, namely K0 set or the KI network depending on the output layer dimension. When a linear readout is used, the adaptation of the overall system reduces to a linear regression prob lem, which we can be solved by either Wiener filter equation offline or gradient-based learning rule online. We will use the KI network and KII network architectures analogous to liquid in LS M or the reservoir in ESN. The use of KIII network as the dynamical reservoir is outside th e scope of this study. In summary, with the

PAGE 98

98 adoption of LSM/ESN framework, the states of the FM are dynamical basis functionals (representation space) created from nonlinear comb inations of the input and the readout simply finds the best projection of the desired response in this representation space. We will provide two theorems for the echo st ate property of the Freeman RKII network. The first theorem states the sufficient conditions on the parameter values of a RKII set resulting in echo states. In the second theorem, we will give a sufficient condition for the nonexistence of echo states for all Freeman networks. Unfortunately we do not have a general theoretical result for the sufficient conditions resulting in echo stat es for all Freeman networks in standard form given by Equation 5-3. Theorem 4.1: Consider a reduced KII set with th e governing equations given by equation 5-4. Let the parameter values Kmg and Kgm satisfy 2 max 2) ( ) ( Q b a ab K Kgm mg where maxQdenotes the maximum value of the derivative of the nonlinear function Q and a and b are the filter coefficients. Then the ne twork has echo states for all inputs u and for all states x. The proof follows from the linear stability analysis of the RKII set, which has been extensively studied in (Xu and Principe., 2005). Proof: Consider the governing equations for th e RKII set given in Equation 5-4. An equilibrium point of the nonlinear di fferential Equation 5-4 satisfies ) ( ) (0 0 0 0m Q K g p g Q K mmg gm (5-5) Using the nullclines defined by Equation 5-5, we can see that for a given set of parameters and the input, the system has a uniqu e equilibrium point (Figure 5-8). Now consider the state space representation given by Equation 5-4.

100 ) ( ) ( ) (0 0 2g Q m Q b a ab K Kgm mg The system does have a bifurcation point if ) ( ) ( ) (0 0 2g Q m Q b a ab K Kgm mg (Xu and Principe., 2005). Now, assume that the parameters Kmg and Kgm satisfy 2 max 2) ( ) ( Q b a ab K Kgm mg Then the unique equilibrium point of the system will be stable. This means that independent of the initial conditions, the system will approach to th e same trajectory for a given set of parameter values and the input sequence. Moreover, the states will converge to zero when there is no input. Theorem 4.2: Consider a Freeman network containing N KO sets with standard form given in Equation 5-3. Define the effective weight matrix W I b a W I ab I ) ( ) ( 0W (5-8) where I denotes the N*N identity matrix and 0 denotes N*N zero matrix. Let the effective weight matrix Wsatisfy real( max)>0, where max is an eigenvalue of Wwith largest real part. Then the network has an asymptotically unstable null state. This implies that it has no echo states for any input set containing 0. Proof: Consider a Freeman network with governing equations in standard form given by Equation 5-4. In state-space form Equati on 5-3 can be expressed as )) ) ( ( ) (1 1 2 2 2 1u W x W x x x x xin Q ab ab b a (5-9) where x1 and x2 are both N dimensional vectors. For zer o input, Equation 5-9 has the trivial null solution. Now, consider the Jacobian matrix of Equation 5-9 from lin earization around origin,

PAGE 101

101 I b a W I ab I ) ( ) ( 0W (5-10) If the matrix Whas an eigenvalue with real part greate r than zero, the linearized system is not asymptotically stable. This implies the exis tence of another solution of Equation 5-9 besides the null solution. This implies that the network has no echo stat es for any input set containing 0. This theorem, which is valid for all Freeman networks in standard form, will provide a heuristic way to obtain a useful weight matrix W in practical applicatio ns. Whenever we create an interconnection weight matrix W, we will check the eigenvalues of the effective weight matrix W. If the real parts of all eigenvalues of Ware less than zero, we will assume the states are controlled by the input. Experiments This section demonstrates a variety of expe riments in order to demonstrate the proposed framework for Freemans K sets. In all the experi ments, we use the discrete-time Freeman model developed in (Tavares, 2001; Ozturk et al., 2003). The discretization exploits the fact that K0 set is composed of a separate linear dynamical pa rt and a static nonlinea r part. Continuous-time linear filter of K0 is di scretized by the method called impulse invariance transformation resulting in an equivalent discrete-time filter. Function approximation It has been demonstrated that FM can be used as an auto-associative memory (Xu et al., 2004; Tavares, 2001). However, the stored patterns are binary (0s and 1s) and static. We would like to first show that FM can handle continuous valued signal s within the proposed framework with a function approximation experiment.. Consider an RKII network of 50 RKII sets and a single input cha nnel connected to all excitatory cells with unity weights. RKII ne twork is driven by the input signal, sin(2 t/1000 )

PAGE 102

102 and the target signal is 0.5sin7(2 n/1000 ). The weight values Kmg and Kgm of all RKII sets and all inhibitory weights, Kii ar e set to 4, -1 and -0.002, re spectively. These values are experimentally determined from biological data and same values are used in (Xu et al., 2004). The excitatory weight values are chosen from a gaussion distribution with zero mean and unity variance. A B Figure 5-9. Function approximation with Freeman M odel. A) States of the Freeman model. B) The overlay plot of the ou tput and desired signals

PAGE 103

103 The network is initialized to a random initial co ndition and the states of the RKII network (K0 outputs) are calculated. At this point, it is intere sting to observe the echo states depicted in Figure 5-9 A. This plot shows the signals observed at ou tputs of some of the KO sets. As seen from the figure, due to the nonhomogenous nature of the sy stem, there is a significant diversity in the signals at the nodes of the KII network creating the rich dynamics require d in the echo states. Now, the next step is to construct the desire d signal from the echo states using a memoryless readout network. For this example, we do not ha ve direct connections from input to output. We will use a simple linear readout whose optimal weights can be calculated using Equation 2-4. The output of the optimal linear network is de picted in Figure 5-9 B along with the desired response. As seen from the figure, we get an almost perfect match between the desired response and the output of the Freeman m odel with the linear readout. K0 set as the readout for function approximation An alternative to the linear ne twork as the FM readout is to use one of the Freemans K sets. For systems with only one output, we will use the K0 set as the readout. In this section, we will derive the gradient based update equations for the K0 ne twork using MSE as the cost function. This will allow us to use one of the components of the Freeman network indicating that the Freeman K sets contain all the ingredient s necessary for computation in the echo state framework. In the K0 set, the lin ear filter and the stat ic nonlinear function ar e fixed and the input weights will be adaptable. We will derive the gradient-based update equations for the weight vector using the MSE as the cost function. Note that we do not have an analytic solution for the optimal weight vector since K0 comprises a line ar dynamical part which is an infinite impulse response filter and also a static nonlinearity. Consid er the K0 set depicted in Figure 5-5 A. Let us first fix the notation. The input to the K0, the in put weight matrix, the si gnal at the input of the

PAGE 104

104 linear part, the signal at the output of the linear part and the output of the K0 will be denoted by x(n), wout(n), z(n), l(n), y(n), respectively. Then the governing equations will be given by ) ( ) ( ) ( )) ( ( ) ( ) 1 ( ) 2 ( ) 1 ( ) ( ) ( ) ( ) (3 2 1n y n d n e n l Q n y n z c n l c n l c n l n x n w n zout The constants c1, c2 and c3 are the coefficients of the discrete-time filter obtained from the impulse invariance transformation of the original continuous-time filter. The details of the derivation can be found in (Oztur k et al., 2003). The instantaneous cost function is defined as e2(n) = (d(n)-y(n))2. Then the weights will be optimized according to out out outw e n w n w ) ( ) ( ) 1 (2. In order to calculate the sensit ivity of the cost with respect to the weights, we will use backpropagation over the topology, i.e. x c l Q w z z l l y y e w e3 2 2) ( 2 ) ( ) ( ) ( ) ( ) ( Note that, although K0 includes a recurrent filter, the sensitivity of the cost w.r.t the weights can be calculated instantaneously. Beca use we do not need to update the filter coefficient but we only need to calculate the sens itivity of the filter output w.r.t the filter input which is just a constant, c3. With this derivation, K0 can be employed as the readout for one output systems. For multi-output systems, we will use the KI network where the connection weights between the K0 sets of the KI network is fixed and the weights between the echo states and KI network is adaptable as above. The extension of the gradient update rule for the multi output case is trivial.

PAGE 105

105 Figure 5-10. Overlay plot of the desired signal an d output signal of the Freeman Model with K0 readout. To demonstrate the idea of using K0 set as the readout network, we will take the same regression experiment. We will use the same paramete rs as in the previous section. Now, we will use the K0 set as the readout instead of the lin ear network and adapt the K0 as described above. Figure 5-10 shows the overlay plot of the desired signal and the out put of the K0 set. As seen from the plot, we get a good match between th e desired signal and the output of the Freeman network. However, the results are not as good as th e simple linear readout since we do not have an optimal solution for the K0 readout. Autonomous squence gneration with output feedback This experiment demonstrates the ability of the system to autonomously generate a sequence of numbers. We will use a reduced K II network without any input units but with feedback from the output of the readout. The de sired signal to be autonomously generated by the

PAGE 106

106 system is 0.5sin3(2 n/1000 ). An RKII network of 50 K0 sets without input units but with output feedback connections is constr ucted. The weight values Kmg and Kgm of all RKII sets and all inhibitory weights, Kii, are the same as the ones in the previous experiment. The excitatory weight values are chosen from a gaussion distribution with zero mean and unity variance. The output feedback connections from the linear readout to the KII network are chosen to be 0.1 and -0.1 with equal probability. During training, the states of the RKII set ar e driven by the desired response via output feedback. 0 500 1000 1500 2000 2500 3000 3500 -4 -3 -2 -1 0 1 2 The Desired and Output Signals samplesFigure 5-11. Autonomous signal ge neration with Freeman Model. The network is initialized to a random initial condition and the states of the RKII network are calculated for 2000 samples of the desired response. The first 1000 samples corresponding to initial transient are eliminated and the optimal output weights are calculated using Equation 2-4.

PAGE 107

107 The MSE value obtained dur ing training is 1.4350x10-7. During testing, the system is first run with for 1000 steps with the correc t desired signal. After 1000 steps, the system is let run freely. Figure 5-11 shows the desired and output signals for the first 3000 steps of testing. As observed from the plot, there is a transient time interval fo r which the system tries to lock into the desired signal. But once it is locked, the system conti nues generating the signal autonomously. The MSE value for the last 1000 steps of testing is found to be 7.934x10-7. Binary parity check This experiment investigates the use of Freeman model for bi nary logic computation. In particular, a parity check bit calculation will be carried out using a reduced KII set of 100 KO sets. We will attach one input unit and a linear readout with one output unit to the reduced KII network. The input will be a sequence of binary va lues, s and s and the desired signal will be the parity bit corresp onding to the last 3 bina ry input values. More precisely, the desired signal will be the addition of the last 3 bit values modulo 2. In our experiment, we will represent each bit by a sequence of 100 samples. In other words, will be represented by a sequence of a hundred 0s, 0..0, and similarly . We use a reduced KII network of 100 KO set. The weight values Kmg and Kgm of all RKII sets and all inhi bitory weights, Kii are set to 0.5, -0 .4 and -0.002, respectively. Kee values are set to be 0.1, 0.2 or 0 with probabilities 0.05, 0.05 and 0.90, respectively. We generate a sequence of 300 random binary numbers as the in put and the corresponding parity bit sequence as the training data. This corresponds to a trai ning sequence of 30000 data points since each bit comprises 100 samples. Again, the weights of th e linear readout are calculated using the Wiener solution after discarding the first 1000 samples. In order to test the system, a random sequence of 300 binary numbers is generated and fed to the FM. Figure 5-12 shows a segment of the desired and output signals during testing. As we observe from the plots, the system output converges to

PAGE 108

108 the desired response for all input bits. If the linear r eadout output is compared to a threshold value of 0.5 to make binary decisions, the syst em converges to the true binary value for the whole testing data without an error. 7500 8000 8500 9000 9500 10000 10500 11000 0 0.2 0.4 0.6 0.8 1 Desired and Output Signals Figure 5-12. Binary parity check with Freeman Model. Multi-attractor learning In this experiment, we will train the readout to switch between multip le point attractors depending on the input (Jaeger, 2001). The networ k will have 20 input units and 20 output units attached. The system will have 20 point attractor s that are controlled by the input signal. The input signals are in the form of rare pulses (w ith values 0 or 1, see Figure 5-13). At any given time, only one of the input channels can have no nzero value (at most one pulse at a time). Each output unit is associated with an input unit and when a pulse arrives at an input unit, the corresponding output unit value beco mes 0.5 whereas the other output units take the value -0.5. The values of the output units remain the same un til a new pulse arrives at another input unit. At

PAGE 109

109 the arrival of this new pulse to an input unit, the output unit corresponding to that input unit activates to 0.5 and the remaining to -0.5. Figure 5-13 shows sample input signal and the corresponding output signal for a two input-two output system. 500 1000 1500 2000 0 0.2 0.4 0.6 0.8 1 INPUT-1 500 1000 1500 2000 0 0.2 0.4 0.6 0.8 1 INPUT-2 500 1000 1500 2000 -0.4 -0.2 0 0.2 0.4 0.6 OUTPUT-1 500 1000 1500 2000 -0.4 -0.2 0 0.2 0.4 0.6 OUTPUT-2 Figure 5-13. Sample input-output sign als for multi-attractor learning. We use a KI network of 100 K0 sets with 20 inputs units. 100 K0 sets are randomly connected having the weight values of 0.2, 0.1 and 0 with probabilities of 0.95, 0.025 and 0.025, respectively. This corresponds to a sparse connectivity of 5%. Th e input units are connected to the K0 sets with connection strengths of 5 or -5 with equal probability. The output units are connected to the states via output feedback havi ng the weight values of 1, -1 and 0 with probabilites of 0.1, 0.1, 0.8. We generated a sequence 8100 input samples and the corresponding output sequence for training. Starting from the time step one, a pulse is assigned to input chan nel 1 to 20 in order at every 400 steps. At time step 8001, a pulse exists in the input channel 1. The pulse width is

PAGE 110

110 chosen to be 100 samples. The output sequence is assigned to 0.5 or -0.5 according to the rule specified above. The system is run with the i nput data generated while the output is teacherforced by the output signal generated. The stat es are calculated for 8100 time steps using the discretization. Then, the output weights are calculated using E quation 2-4 after disgarding the first 1000 samples corresponding to the transient signal. Figure 5-14 shows the output and desired re ponse for 4 channels during training. The system is tested with an input-output sequence of 80000 samples. The syst em is initialized with the first 500 samples of desired signal forced as the output. Figure 5-15 shows the output and desired response for four channels during testing. The Freeman model is able to model all the switches without any errors during testing. 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 0 2000 4000 6000 8000 -1 -0.5 0 0.5 1 Figure 5-14 Desired and out put signals during training for mul ti-attractor learning with Freeman Model.

112 CHAPTER 6 APPLICATIONS This chapter explores the use of ESN for various real-wor ld applications. Complex Echo State Networks Most of the signal processing tools have been developed to operate on signals with real values. However, there exist areas where signal processing in the complex domain is required. For example, in digital communication, the sy mbols sent through a communication channel are usually implemented as points in complex domai n. In phase shift keying (PSK) the binary information is conveyed by changes in the phase of the reference signa l (Proakis, 2001). In quadrature PSK, four different phase values are used to encode two bits of data. In quadrature amplitude modulation (QAM), both the amplitude and the phase of the reference signal are utilized to encode information more efficiently with increased data transfer rate. The most commonly used QAM are 32QAM, 64QAM, 128QAM and 256QAM where 5 to 8 bits of data are transferred per symbol. Phase plays an importa nt role in these systems and the symbols are represented by complex numbers. The straightforward method to deal with complex numbers is to process the real and comple x parts of the signal separately However, this is suboptimal because it neglects the interdepe ndency between the real and comp lex part of the signal (Kim and Adali, 2002). Alternatively, the existing tools of signal pr ocessing, which are originally designed to deal with real signal s, have been effectively modified to operate in complex domain. In the neural networks literature, the popular back propagation algorithm (BP) for multilayer perceptron (MLP) was originally developed in re al domain and in the past two decades it has been effectively used in many diverse area s (Widrow and Lehr, 1990; Haykin, 1998). Kim and Adali have rigorously developed the complex back propagation (CBP) algorithm for complex MLP (CMLP) (Kim and Adali, 2002). CMLP uses fully complex activation functions instead of

PAGE 113

113 split complex functions that treat real and imagin ary parts of the signal separately, and achieve consistently better performance. However, th e difficulty with the MLP training in complex domain arises mostly from the selection of act ivation functions that can operate in complex domain. The nonlinear functions have to be bot h analytic and bounded since the backpropagation algorithm requires derivatives of the nonlineariti es to calculate the weight updates. Moreover MLPs are trained in an iterativ e approach which is usually slow and computationally intensive. The model parameter such as the input dimensi on and number of hidden units must be selected through exhaustive testing as they play a crucial role in faster convergence and minimal training error. In this section, we propose ESNs for comp lex signal processing. The use of ESNs for complex domain is very convenient since syst em training is equivalent to simple linear regression, which is trivial in the complex dom ain. The derivatives of the nonlinear activation functions are never necessary since the recurrent part is fixed apriori. The nonlinear activation functions have to be modified similar to (Kim and Adali, 2002) to ensure bounded output. However, being analytic is not anymore required since ESN does not requ ire the computation of derivatives for training. We compare CESNs with CMLP and a linear network trained with complex least mean squares (CLMS) algorithm on a complex channel equalization experiment. We demonstrate that CESN can achieve lower symbol error rates with a simple and fast learning. Channel Equalization The performance of the CESN is tested on two benchmark nonlinear channel equalization problems. Fig. 6-1 depicts the block diagram of an additive noise nonlinear communication channel. The modulated signal corrupted by addi tive noise is passed thro ugh a nonlinear channel with memory. An equalizer is used to remove the effects of channel and noise before the signal is demodulated. The equalizer ideally behaves like an inverse system of the nonlinear channel. The

PAGE 114

114 noise is considered to be addi tive white Gaussian. The goal is to implement the equalizer using adaptive signal processing techniques. Here, both the signals and the channel are complex valued. Figure 6-1. Block diagram of a communication system with non linear dispersive channel. First problem is from satellite communication domain and is also used by Kim and Adali [2, 8]. The nonlinear channel is modeled by a Volterra series expansion given by 0 ) 1 2 ( 1 2 ,..., 1 1 2 1 11 2 2 1.. 1 ... ... H a a a a a xk k k k k k nn n n n n n knn n n n n n n where o is the down-link noise a nd Gaussian in nature, an is the nth information symbol and H is Volterra series coefficient that describes the effect of channel nonlinearity. We used a channel described by the following Volterra series coe fficients (Benedetto a nd Biglieri, 1982; Kim and Adali, 2002). Linear coefficient: H0 (1) = 1.22 + i0.646, H1 (1) = 0.063 i0.001, H2 (1)= 0.024 i0.014, H3 (1) = 0.036 + i0.031 3rd order coefficient: H002 (3)= 0.039 i0.022, H330 (3)= 0.018 i0.018, H001 (3)= 0.035 i0.035, H003 (3) = .04 i0.009, H110 (3)= .01 i0.017 5th order coefficient: H00011 (5)= 0.039 i0.022. We use a rectangular 16QAM modulation sc heme where the 16 symbols are equally spaced in the complex domain and the farthest point from origin has a magnitude of 1. The Nonlinear Channel Equalize r AWGN Input Decision

116 Figures 6-2 A and 6-2 B show the constellati on diagrams of the input symbols (under 12dB signal-to-noise-ratio [SNR]) and the output symbol s of the nonlinear channel, respectively. The symbol energy of the 16QAM system is given by 2.5d2 where d is the horizontal or vertical distance between two adjacent points. The bit ener gy is calculated by dividing the symbol energy by 4. SNR is calculated by dividing the b it energy by twice the noise variance. We compared the performance of three diffe rent networks: a linea r filter, a CMLP and CESN. All networks are tr ained with 20000 data points generated from the 16QAM configuration with equal probability. The optimal we ights of the 10 tap linear filter are computed using Equation 2-4. We used a CMLP with 10 input units, 5 hidden units a nd a single input unit. We use arcsinh as activation function. The de rivative of the arcsinh(z) is given by (1+z2)-1. The weights of the CMLP are trained us ing the complex backpropagation algorithm derived in (Kim and Adali, 2002).We used a 30 unit CESN. The input weights are selected from values -0.1 and +0.1 with equa l probability. The internal connection weight matrix is designed with uniform pole distribution prin ciple (Ozturk et al., 2007). ESN designed with this principle is denoted by ASE-CESN while the ESN initialized with random weights is denoted by CESN. The activation function of the internal units is the complex sigmoidal functi on,tanh(x). For CESN and CMLP the network is trained 10 times for ev ery SNR value and the best performance is recorded. All three networks are tested with 106 points input sequence where each point is selected from the 16 points with equal proba bility. We trained CE SN with only 1000 data samples. Figure 6-3 shows variation of the symbol error rate (SER) for different SNR values for four networks. As observed from the figure, AS E-CESN shows the best performance compared to the other networks. The main reason is th at ESN avoids backpropagation through the complex nonlinear functions. The CLMS performs very similar to CMLP since the CLMS also uses the

PAGE 117

117 optimal solution and the problem is only mildly nonlinear. SER of ASE-CESN drops to zero after 12dB SNR while the SER for CLMS and CMLP still decreases slowly. CESN that used much fewer points than CLMS and CMLP can stil l perform as well due to the simple training. Figure 6-3. SER vs SNR pl ot for four networks. The second problem is another channel equa lization problem from digital communication domain and is also used in (Cha and Kassam, 1995). Here the nonlinear channel is described by the following equations. n n n n nv x x x y 3 205 0 1 0 2 1) 21 0 34 0 ( ) 43 0 87 0 ( ) 27 0 34 0 ( n n n nu i u i u i x where n is independent identically dist ributed white Gaussian.noise, We use a rectangular QPSK modulation scheme where the 4 symbols are equally spaced in the complex domain and the farthest point from origin has a magnitude of 1. Figures 6-4 A and 6-4 B show the constellation diagrams of th e input symbols (under 12dB signal-to-noise-ratio [SNR]) and the output symbols of the nonlinear ch annel, respectively. The symbol energy of the QPSK system is given by 0.5d2 where d is the horizontal or vertical distance between two

119 We compared the performance of three differe nt networks: a linear filter, a CMLP and a CESN. All networks are trained with 10000 data points generated from the QPSK configuration with equal probability. The optimal weights of the 10 tap linear filter are computed using Equation 2-4. We used a CMLP with 3 input uni ts, 10 hidden units and a single input unit. The same ESN models used in the previous sxam ple are utilized. Here, CESN uses 500 training samples. Fig 6-5 shows variation of the symbol error rate (SER) for different SNR values for three networks. Similar to the previous experi ment, ASE-CESN shows the best performance. CESN can still perform very close to CLMS and CMLS eventhough CESN utilizes a small number of training points. Figure 6-5. SER vs SNR pl ot for four networks. Brain Machine Interfaces This section of the dissertat ion is a result of the joint study with Aysegul Gunduz (Gunduz et al., 2007). We have provided th e expertise in ESNs while they provide the data and expertise in brain machine interfaces (BMI). The resu lts of the study are accepte d for publication (Gunduz et al., 2007).

PAGE 120

120 In BMI, the goal is to model the relationshi p between a kinematic variable such as hand position of a primate and the elect rophysiological data collected from its brain. BMIs provide new rehabilitation options to patients who have lo st their motor functions due to damage to the nervous system. Therefore, much effort has been put into the research on neuroprosthetics for motor control and BMIs. In order to model the relationship between brain signals and hand position in BMIs, simultaneous recordings have been collected from the brain and the primates hand position. The brain signals can be from acquire d from microarray reco rdings (action potentia ls and local field potentials), on the cortex via subdural grids (electrocorticogram [E CoG]) and on the scalp (electroencephalogram [EEG]). The levels of abst ractions of these signals are referred to as microscopic, mesoscopic, and macroscopic respectively. In the BMI literature, most of the studies fo cus on modeling the hand position of a primate from the activities of individual neurons collected from various sites in the sensorimotor cortex (Chapin et al., 1999; Serruya et al., 2002). However, the main difficulty in the practical use of these systems is the crude invasiveness of microe lectrodes, which preven ts their use in human subjects. On the other hand, non-invasive scal p EEG contains a summation of dendritic and axonal currents of millions of neurons over the cortical impedance (Sanchez et al., 2007). The macroscopic EEG lacks the spatial resolution of microscopic recordings and have very low signal-to-noise ratio. In this chapter, our fo cus will be on the mesoscopic level recording called electrocorticogram (ECoG). ECoG when compared with EEG, has the great advantage of being more spatially specific (smaller electrodes), mu ch less contaminated by noise, and can display better frequency resolution because of its closer proximity to the neural tissue (Sanchez et al., 2007).

PAGE 121

121 Current BMIs based on ECoG recordings eith er require movement related averages of trials for extraction of features (Mehring et al., 2004; Rickert et al., 2005) or imagery association of another task to the kinematic behavior (e .g. protruding tongue) (L euthard et al., 2004). Movement related potentials, though a valuable analysis tool, are not suitable for building real time BMIs. In our experimental paradigms, we undertake the prediction of the kinematic behavior of the patients wit hout employing any other task. The parameters that can be extracted from EC oG to predict hand trajectories is still under investigation (Gunduz et al., 2007). Still, since amplitude modulat ion plays a key role in both neuronal activation and rate coding, this study seeks to explore preprocessing modalities that emphasize amplitude modulation in the ECoG above the level of noise and background fluctuations in order to derive the commands for complex motor control tasks (reaching and grasping) (Sanchez et al., 2006). Mapping amplitude modulated control features to motor behavior requires models of low orders that are easy to train. Simp le linear filters, such as Wiener f ilters, are very easy to train but have limited performance in accuracy. Time delayed neural networks (TDNNs) require parameters of very high orders especially wh en the system is a multiple-input multiple output system as in BMIs. Recurrent neural networks (RNNs) which have high accuracy in modeling are very hard to train. In this chapter, we study and compare mappings from amplitude modulated ECoG features to hand movements i nvolving reaching and pointing tasks using linear filters, echo state networks and leaky echo state networks. The data for this study was obtained from a patient undergoing extraoperative subdural grid electroencephalographic monitoring for the treat ment of intractable complex partial epilepsy at Shands Hospital, University of Florida. Th e experimental paradigms were approved by the

PAGE 122

122 University of Florida Institutional Review Boards.The patient, a right-handed 15 year old female, underwent a presurgical work-up that included scalp EEG, formal neuropsychological testing, and MRI. The patients IQ and motor functions were verified to be nonfocal by the absence motor or sensory deficits on neurological examination (Sanchez et al., 2007) The patient was implanted with subdural grid electrodes according to established protocols (Lesser et al., 1990). The grids consisted of a 1.5mm thick silastic sheet embedded with platinumiridium electrodes (4mm diameter with 2.3mm diameter exposed surface) spaced at 1cm center-to-center distances. The anatomical lo cation of the grids was based upon the medical teams recommendation for epilepsy evaluation. The primary motor cortex was determined by evoked potentials and direct elec trical stimulation of the subdur al grids (Jasper and Penfield, 1954) and was found to be far from the seizure focus. The 32-channel electrode grid was covering premotor (PMA), primary motor (M1), and somatosensory (S1) cortices based on the patients cytoarchitecture (Sanchez et al., 2007). Experimental Setup Extraction of control features from ECoG reco rdings for neuroprosthes is is facilitated by continuously time synchronizing neuronal modulation with the well defined behavioral paradigm (Sanchez et al., 2006). The behavioral tasks used in this neuroprosthetic design focus on arm reaching and pointing toward a visual input (S anchez et al., 2007). The patient was cued to follow with her index finger a predefined cursor trajectory presented on an LCD screen with an active area of (20 x 30cm) while neuronal modul ations from the implanted ECoG electrodes were simultaneously being recorded (Sanchez et al., 2007). The trajectory consisted of a widely performed center-out task (Geor gopoulos et al, 1982) and a target selection task (Desmurget et al., 1998). This behavior mimics a computer users movement to se lect an icon on the screen. In

PAGE 123

123 a single session, the patients we re required to repeat the entire task six times and this was repeated for each trial (Sanchez et al., 2006). Spectrally preprocessed cortical recordings have been shown to correl ate with a variety of visual, auditory, and motor tasks in frequency bands comprised of slow potentials (1-60Hz) (Pfurtscheller et al., 2003),the gamma band (60100Hz) (Crone et al., 1998), fast gamma band (100-300Hz) (Sinai et al., 2005 ) and ensemble depolarization (300-6kHz) (Engel et al., 2005). Therefore we define the band specific am plitude modulated control features, u( t ) as the integrated power of the ECoG voltage V (t) in non-overlapping bins of 100 msec given by 100 2 0()()ims nni tutVtt (6-1) where tn+1 = tn+100 ms Methods The simplest method to map the neuronal activity to hand movement is the linear Wiener filter topology. There are 32 input and 2 output channels. For each i nput channel, a tap delay line with 25 taps is used. The filter order was optimized by scanning tap delays from 5-30. At each time sample, the filter computes the outputs usi ng 2.5 milliseconds of the past inputs. The model was trained on 3 minutes of recordings and tested on a data segment of 1.5 minutes. We use an ESN with 500 internal un its. The 32x500 input weight matrix Win is fully connected with values of +1 or -1 with equal probability. A spectral radi us of 0.9 is chosen. In BMI problem, the neuronal activity that is input to the system is a fast changing signal whereas the hand position is at a relatively slow pace. Therefore, we alternatively used a leaky ESN network with a built-in low-pass filter to fu rther increase the memory depth of the system and slow-down the input signal. The update equation of the leaky ESN is given by )) ( ) 1 ( ( ) ( ) 1 ( ) 1 ( n n C n Ca ninWx u W f x x

PAGE 124

124 Here C and a are the leakage parameter, time constant and decay factor, respectively. They define the level of low-pass f iltering the system introduces. Th e number of processing elements for the leaky ESN is experimentally determined to be 200. The spectral radius was kept at 0.9 and the leakage parameters C and a were experimentally determined for each band. In both architectures one second of data was discarded as transients. Averaged correlation coefficients (CC) betw een the actual hand trajectory and the filter outputs computed over non-overlapping windows of 5 seconds are presented in table 6-1. Figure 6-6 shows the variation of CC over time. The performance of each model drops when the patient was switching between tasks. The performance along the y-direction is evidently better for all topologies and all frequency bands compared to x-direction. Across spec tral bands the most accurate reconstruction was obser ved in the ensemble activity (300-6kHz). Overall, the leaky ESN performed better than the basic ESN and Wiener filter, increasing the CC performance of the Wiener filter by 15% in th e y-direction in the 300-6kHz band. Improvement in results with ESNs compared to the linear filt er decreased as the input spect ral activity slowed down. This may be an indication that the slower frequency bands contribute less to the modulation of hand kinematics. Table 6-1. Averaged correlation coefficients be tween the actual hand trajectory and the model outputs computed over non-overlapping windows of 5 seconds

127 Forecasting Water Inflow Using the Echo State Network This section of the dissertati on is a result of the joint study with Rodrigo Sacchi (Sacchi et al., 2007). We have provided the expertise in ESNs while they provide the data and expertise in water inflow forecast. The results of the st udy are accepted for publicati on (Sacchi et al., 2007). We use the ESN for the prediction of the hydr opower plant reservoir water inflow. Being able to predict the water inflow is fundamental in terms of planning the hydrothermal power system operation. In this study, we used a databa se of average monthly wa ter inflows of Furnas plant, one of the Brazilian hydropow er plants. The goal is to predic t the value of the water inflow for the next month using previous data. The perf ormance of the ESN is compared with SONARX network, RBF network and ANFIS model. Water Inflow Problem The greatest difficulties to forecast water inflow arises due to the nonstationary, nonGaussian nature of the time series due to ch anging weather conditions (wet and dry seasons) in different periods of the year across years. Forecasting models based on Box & Jenkins methodology have been largely used in streamflow forecasting problems (Salas et al., 1982; MecLeod, 1994). However, those parametric auto-regressive models assume linear relationship (Box et al ., 1994). This is a major limitation since the nature of the water in flow problem requires nonlinear modeling. Alternatively, ANNs have been proposed for time se ries analysis due to thei r ability to deal with nonlinear input-output relationshi ps (Haykin, 1998). MultiLayer Perceptrons (MLP) with a backpropagation algorithm, the most popular of them, have been a pplied to forecast streamflow with promising results (Atiya, 1999). The exam ples of the successful application of ANN architectures for water inflow prediction ar e (Coulibaly and Anctil, 1999;Valenca et al., 2005;Sacchi et al., 2004).

PAGE 128

128 The main problem with ANNs is the number of parameters which is usually big due to the difficulty of the problem. Moreover, a tapped de lay line has to be used as a preprocessor to provide short term memory for st atic architectures such as ML P or RBF. The selection of the optimal embedding that defines the most important previous measured samples which must be considered as input data of the predictor is not trivial. ESNs provide nonlinearity with training reduced to simple linear regression. Moreover, since the system has internal bu ilt-in memory resulting from the feedback connections, it is not necessary to embed the input signal before furthe r processing. The training of the linear readout weights can be done analytically; hence very fast. In this sectio n, we use the ESN as a one-stepahead predictor of the monthly average hydr opower plant reservoi r water inflow. The performance of ESN is evaluated and compar ed to the Self-Organizing Nonlinear AutoRegressive model with eXogenous input (SONA RX) model (Barreto a nd Arajo, 2001), the Radial Basis Function (SONARX-RBF) network (Sacchi et al., 2004) and the Adaptive NeuroFuzzy Inference System (ANFIS) model (Jang, 2003). Results In this case study, the data collected from Furnas hydroele ctric power plant (HPP) was used since it is without oper ative influence due its relative position on the cascade. Monthly water inflow data from 1931 to1994 is availa ble. A five year pe riod from 1972 to 1976 is selected as the testing data. The remaining data is used to train the models. Since this is a dynamical modeling problem, mo st of the models require embedding a short term history of the time series by using a tappe d delay line. The states are formed by delayed versions of the input signal, i.e., x1(t) = u(t), x2(t) = u(t-1), x3(t) = u(t-2), where t denotes the current month and u(t) is the value of the water inflow sample for the month t and x1(t), x2(t), x3(t) are the states which are the inputs of the predictor model. Alternatively, we can use two

PAGE 129

129 short periods, one from the immediate past a nd the other from the pr evious year, i.e., x1(t) = u(t), x2(t) = u(t-1), x3(t) = u(t-2), x4(t) = u(t-10), x5(t) = u(t-11), x6(t) = u(t-12). This allows the use of not only the monthly information but also s easonal information from previous year. The desired response at month t is defined by u(t+1), which is the value of the water inflow for the next month. The performances of the pr oposed methods are compared w.r.t. 4 different metrics: MSE, root MSE (RMSE), mean absolu te error (MAD), and mean percentage error (MPE). MAD and MPE are defined as T tt d t y T MAD1) ( ) ( 1 T tt d t d t y T MPE1) ( ) ( ) ( 100 where T is the total number of samples and y(t) and d(t) are the model out put and desired signals, respectively. Table 6-2 shows the performances of th e models (ESN, SONARX, SONARX-RBF and ANFIS) w.r.t. the error cr iterion specified above. Table 6-2. Performance comparison of the models for water inflow prediction ModelsEmbedding MSE (x104) RMSEMADMPE (%) Echo State Networkbuilt-in5.97244.45177.4121.04t,t-1,t-221.78466.73320.1337.94t,t-1,t-2,t-3,t-4,t-512.20349.30253.1828.78t,t-1,t-2,t-10,t-11,t-1213.54367.90261.0428.32t,t-1,t-29.24303.94233.2429.35t,t-1,t-2,t-3,t-4,t-56.35252.02187.5322.41t,t-1,t-2,t-10,t-11,t-127.13266.94202.5624.35t,t-1,t-29.45307.47235.6729.89t,t-1,t-2,t-3,t-4,t-57.07265.97198.5224.27t,t-1,t-2,t-10,t-11,t-126.72259.25205.9925.02 SONARX SONARX-RBF ANFIS

PAGE 130

130 We can see that ESN, SONARX-RBF and ANFIS performed significantly better than SONARX model, which has an intrinsic error si nce it is based on Vecto r-Quantized Temporal Associative Memory (VQTAM). ESN performs slightly better th an all other models, even tho ugh it presents a significantly simpler, and faster training al gorithm. SONARX-RBF with 5 taps performs very close to ESN whereas with other selection of taps, ESN pe rforms significantly better. This shows the importance of the selection of the number of taps for performance. The prediction results of the water inflow during training are show n in Figure 6-7. The blue line refers to the desired time series, and th e red one refers to the forecasted values water inflow. Based on the figure, the predictor model was able to learn and capture most of the behavior variability present on this time series. Figure 6-8 shows the water inflow forecast during the testing period of 1972-1976. The red line, representing water inflow forecast, shows a behavior very close to the blue one, the desired time series. 0 100 200 300 400 500 600 700 0 500 1000 1500 2000 2500 3000 3500 4000 Monthsm3/sWater Inflow (Training Samples) Observed Forecasted Figure 6-7. Training performance of ESN for the water inflow prediction at Fumas Hydroelectric power plant.

PAGE 131

131 0 12 24 36 48 60 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Monthsm3/sWater Inflow (Testing Samples) Observed Forecasted Figure 6-8. Testing performance of the ESN for the water inflow prediction at Fumas Hydroelectric power plant. Spike Detection with Echo State Networks In this section, we propose to use ESN-MACE for detecting neural action potentials in neural recording. Detection of action potentials is a very ch allenging task due to the great variability in the spike shapes due to the differe nce in spike shapes (bot h in amplitude and width of the spikes), as well as th e intrinsic noise, which is nons tationary and most likely nonGaussian, collected in neural recordings. The sor ting of spikes from single neurons is classically defined as a two-step process. First, the spikes mu st be detected using eith er a threshold detector or through power analysis (Obeid and Wolf, 2004). Once the spikes are detected, a variety of algorithms such as template matching with lin ear matched filter, or Principal Components Analysis (PCA) can be used to isolate the wave forms of individual dist inct neurons (Lewicki, 1998; Obeid and Wolf, 2004). Howe ver, the performance of this serial process is greatly dependent upon the accuracy of the first step. In threshold detection, the background noise, which can be instantaneously at a high value, is c onfused with spikes leading to false alarms. In

PAGE 132

132 addition, spikes with low power are hard to detect. The main difficulty with the matched filtering method is the absence of a well-defined template signal, which can represent the spike class, due to great variability in signal shapes. Furthermore, noise statis tics in neural recordings is nonGaussian and nonstationary impairing the optimality of matched filter. The proposed ESN-MACE aims at addressing th e difficulties posed by the spike detection problem by utilizing the high dimensional states of a nonlinear dynamical system to train an associative memory that can represent a class unde r noise and distortions with a single filter. The advantage of the LAM over the c onventional linear rea dout is that it explicitly utilizes the covariance information of the input training se t, unlike linear readouts (Ozturk and Principe, 2007). It also simplifies the training of the ESN r ead out since there is no need to train for a desired response for the LAM. ESNMACE is particularly powerf ul when the problem requires modeling a class of functions since ESN-MACE f ilter is able to build a single template containing covariance information about the full training set. Such a sy stem will improve the ability to correctly identify neural action potentia ls because it is able to explicitly utilize the variability seen during the training phase. Various experiments using in vivo neural recordings were used to compare the performance of the ESN-MACE with thresholds and a matched filter detector. The performance metric is the receiver operating characteristics curve (ROC), which is a standard measure for evaluating the performance of a detector by dem onstrating the trade-off between probability of detection and probability of false alarms (Hel strom, 1995). The ROC curve for each filter is obtained by varying the threshold va lue and calculating the detection rate and false alarm rate for each value of the threshold. We demonstrate th at the ESN-MACE is ab le to detect action potentials with lower false alarm rates compared to established methods of spike detection.

PAGE 133

133 Problems in Spike Detection A neural recording cannot be characterized as a deterministic signal, not only because of the intrinsic noise present in the recording instru ments, but also because of the many sources of uncertainty that make a neuron fire in an vivo experimental condition (Srnmo and Laguna, 2005). The extracellular spikes, recorded from a microelectrode array, have amplitudes ranging from a few microvolts up to a few hundred microvolts. The amplitudes of the spikes, as well their varying shapes, make the pr ocess of distinguishing the spik es from instrumentation noise difficult. As such, accurate identif ication of spikes has been probl ematic (Lewicki, 1998; Srnmo and Laguna, 2005). However, the wave shape that is registered in a conve ntional extracellular neural recording is basically determined by th e relative distance and position of the cell body with respect to the microelectrode. We will briefly review below several spike detection algorithms. Threshold detection The most common used technique to detect th e presence of spike activity is through the use of a threshold. This system decides the exis tence of a spike whenever the amplitude of the neural recording exceeds a predetermined thres hold (Lewicki, 1998). The beauty of this method is that it provides a quick and inexpensive way to identify and discern neural activity. However, threshold detection is ideally suited only for spikes which are easily discernible from the background noise (Lewicki, 1998). Threshol d detection is also prone to missing spikes (false negatives) that are present in a dataset, but do not exceed the established threshold. While it is possible to lower the threshold in order to cap ture all spikes present, doing so increases the probability of falsely misidentifying noise for an action potential (false positives), thus reducing the overall accuracy of the system.

PAGE 134

134 Matched filter Another approach to spike detection is through the use of filters, in particular by using a matched filter (MF) (Lewicki, 1998). Matched filters are widely used in the receiving end of the digital communication systems to detect binary signals. Matched filter provide optimum filtering in terms of minimum bit error rates for an a dditive white Gaussian noise channel (Helstrom, 1995). The fundamental problem with the matched filter for spike detection is that it assumes that there is a template that represents the spike class. Matched filters were originally developed for radar and communication a pplications, where the signals of interest were designed by humans, and the difficulty was simply one of discovering th em in high noise backgrounds. In fact in these applications the template is known a priori and so it is trivial to de sign the matched filter. Moreover, it is known that the MF is optimum under very restric tive conditions of white noise backgrounds, uncorrelated with the signal and wh ere the noise can be well characterized by Gaussian statistics (Proakis, 2001). In neural spike detection, the use of a matched filter is very different, due to the fact that the spike amplitude and shape are variable, the nois e is not stationary, and its statistics are most likely not Gaussian. But the determining shortcomi ng is really the variability in spike shapes. Figure 6-9 shows a very regular spike recording from a singl e neuron under high signal-tonoise-ratio (SNR) conditions. Figure 6-10 shows a more natural recording of activity collected by a single electrode in the rat motor cortex where several neurons are being recorded. The waveforms are aligned by their maximum slope, and show the normal variability of neural recordings. The first step in the design of a matched filter for spike detection is the design of the spike template. Each dataset was initially spike sorted using a program designed by Cambridge

PAGE 135

135 Electronic Design called Spike 2 (version 5) in orde r to generate markers for the instance of time when the spike is first detected, commonly know n as spike times. Using these spike times, each spike was then extracted and sorted accordi ng to its variance from the original dataset. Sorting each spike by its variance allowed us to determine how much variability was present within all spikes, and as such select ed a number of spikes that encompassed the variability present in the dataset. Numerous experiments with different datasets showed that we were able to accurately describe the variability of all spikes present by randomly selecting a minimum of 50 spikes. This will be the training set for all the experiments presented in this section. Once the necessary number of spikes has b een extracted, an average of all spikes present is generated. In order to ensure uniformity with in the template, all spikes were aligned according to their initialization time, as recorded in the da tasets spike times. By doing this, a template is generated that characterizes the spike waveform that is most predominant in the dataset. This approach is best when the neural spikes detected are easily discernable from the background noise and each spike varies insignific antly from each other. However, when the dataset possesses spikes with high variability, the average proves to be inaccurate, reducing the overall accuracy of the system. Spike Detection with ESN-MACE The high dimensional internal states of ES N architecture coupled with the MACE filter creates an excellent medium for dynamical patter n recognition. It has been demonstrated in (Ozturk and Principe, 2007) that ESN-MACE is particularly pow erful when the problem requires modeling a class of functions since ESN-MACE f ilter is able to build a single template containing covariance information a bout the full training set. In th is chapter, we propose to use the ESN-MACE for spike detection problem where th e spikes can be considered as a family of signals due to great variability in sp ike amplitude and width (see Figure 6-10).

PAGE 136

136 An ESN with 28 internal units and a single input unit is used for the simulations in this work. W matrix is designed according to the ASE metric presented in chapter 3 and scaled so that its spectral radius is set to 0.9. The entries for Win are set to -1or 1 with equal probability. Two distinct neural recordings were used in the experiment s. The first one was acquired from a single neuron with high SNR. The second da taset was collected from a rat neural cortex neuron while the animal performed a lever pressing task in order to achieve a water reward. Both datasets were sampled at 24414.1Hz and were pre-filtered between 300Hz. Both datasets were normalized, and spike sorted in order to generate their respective spike-times files. Fifty representative spike signals were select ed as the training data for each one of the neural recordings. Figure 6-9 de picts the selected spike signals for the high SNR neural recording. It is evident from the figure that the spike shapes has a limited variability since they all are generated from a single neuron. On the other hand, the differences in amplitude and width of the depolarization/repolarization phases in th e spike shapes in Figure 6-10 clearly shows that the rat dataset has more than one type of neuron present in the dataset. For each data set, the algorithm described in chapter 4 is used to train the ESN-MACE filter. Fifty representative spike signals (each with length 21) are fed to the ESN and the echo states are calculated using equation 1-1 with zero initial conditions. This creates an image of dimension 28x21, where 28 is the number of PEs and 21 is the signal length. Output correlation peak amplitudes are assigned to be 1.0 for the tr aining data. The MACE filter is synthesized in the frequency domain using equation 4-1 and th e corresponding image plan e filters are obtained by inverse discrete Fourier transform. For a give n test signal, MACE filter output is calculated by correlating the echo states of dimension 28x21 w ith the filter coefficien ts at zero lag. MACE

PAGE 137

137 filter output is then compared to a threshold to decide the absence or presence of a spike. Figure 6-9. Spike shapes recorded with high SNR. Figure 6-10. Spike shapes in rat dataset. Numerous distinct spike waveforms are present in this dataset. Different threshold values are used to obtain the ROC figure. Figure 6-11 shows a section of the rat dataset where we used the ESN-MACE filter to identify several spikes. As the ESNMACE filter is designed to provide a maximum response after a spike ha s been convolved with

PAGE 138

138 the filter, we shifted our system response by the width of our filte r, in order to use the spike times as a means of identifying whether a spike was properly detected, as seen in Figure 6-11. The results obtained with ESN-MACE are co mpared to thresholds and matched filter detectors. The matched filter is obtained from the same data set of fifty spike signals according to the method explained in the previous section In the first dataset, the ESN-MACE filter, the threshold detector and the matched filter correctly identify 300 spikes present in a 100,000 samples with no missed identifications, demonstrating that the ESN-MACE filter performs as well as both the matched filter system the threshold detector. This was an expected result sin ce the class of spikes present in this dataset is easily discernible from the background noise and come from a single class of spikes. In fact, this is a very simple dataset created by a si ngle neuron with very low noise levels. On the other hand, the results for the more interesting and more challenging rat data set are not as perfect since the variab ility in the spike signals are tremendous. Figure 6-12 depicts the ROC curves for the ESN-MACE, threshold detector and the matched filter. As observed from the figure, ESN-MACE gives the best results out of the three methods presented. ESN-MACE filter identifies 273 spikes within 100,000 data samples of the rat dataset, corresponding to approximately 4 seconds of recorded data. In or der to correctly identify all 273 spikes, the ESNMACE filter has a cost of 42 falsely identified spikes; whereas the matched filter and the threshold detector have a cost 262 and 210 falsely identified spikes, respectively, in order to achieve this level of identification. Obviously, th e variability in the rat dataset makes detection much more challenging compared to the first dataset. The ESN-MACE filter handles the variability in the spike shapes better compared to the simple threshold detector and the matched filter detector.

PAGE 139

139 Figure 6-11. Segment of neural recording from rats brain. Figure 6-12. Comparison of ROC cu rves for the rat dataset. Each grid division corresponds to a division of 50 spikes. The leftmost dotte d curve corresponds to our ESN-MACE filter system, whereas the solid like corresponds to a matched filter system and the dashed curve corresponds to a thre shold detector system.

PAGE 140

140 Robust signal detection in a digital communication channel The detection of a known waveform transmitte d through a channel in a noisy environment is a critical problem in signal processing w ith applications in areas such as digital communication, radar, pattern re cognition and biomedical engineer ing (Helstrom, 1995; Proakis, 2001). Linear matched filtering is the popular solution to this problem since it is optimal in terms of maximizing signal to noise ra tio (SNR) at its output under a linear additive white Gaussian noise (AWGN) channel assumption (Proakis, 200 1). However, matched filters lose their optimality if the noise is non-Gaussian or if th e channel introduces nonlinear distortions on the transmitted signal. For instance, in the case of an impulsive noise (with infinite variance) in a linear channel, the noise power at the output of matched filter is still infinite, which creates erratic performance due to the finite sample size produced by the filt er impulse response. In mathematical terms, the problem of detecting a known deterministic signal (s( n )) corrupted by a zero mean additive white noise (v( n )) is equivalent to the binary hypothesis testing problem (Helstrom, 1995): H0: u( n ) = s( n ) + v( n ) n = 1,, T H1: u( n ) = v( n ) n = 1,, T The matched filter is defined by the impulse response s( T n ) and the filter output is obtained by convolving th e received signal u( n ) with the filter impulse response. SNR is maximized if the filter output is sampled at lag T since there is maximal correlation between s( n ) and filter impulse response at this lag. SNR ha s been defined as the ratio of the total signal energy to the noise va riance (Helstrom, 1995): )) ( var( ) (1 2n v n s SNRT n

PAGE 141

141 The output of the matched filter is sampled at the optimal time instant T and compared to a threshold to detect the presence or absence of the signal s( n ). The performance of the matched filter is comp ared with that of the ESN-MACE and the discriminative TDNN for a linear channel under Gaussian and impulsive noise distributions. The performance metric is the receiver operating ch aracteristics curve (ROC), which is a standard measure for evaluating the performance of a de tector by demonstrating the trade-off between probability of detection and proba bility of false alarms (Helstrom, 1995; Proakis, 2001). The ROC curve for each filter is obtai ned by varying the threshold valu e and calculating the detection rate and false alarm rate for each value of the threshold. The transmitted signal is chosen to be a Gaussian pulse s( n )=exp(n2), where n is varied between -5 to 5 with a time step of 0.25 units. The noise distribution is chosen to be either Gaussian or impulsive. Impulsive noise is generated using a Gaussian mixture model with probability distribution function ) 0 ( ) 0 ( ) 1 ( ) (2 2 2 1 N N n p In our simulations, we used 15 0 ,2 1 2 250 and 2 10.07 The ASE-ESN with 30 internal units and a single input unit from section 3.1 is used The TDNN uses a window of 41 samples (equal to the signal duration), 5 hi dden layer PEs and a single output unit. In a communication system, one of two types of operation is possible. We will consider these two cases separately. a. Synchronous operation In synchronous operation, the timing of the received signal and th e optimal sampling time are known during testing. Therefore, the frames are already available for further processing.

PAGE 142

142 In order to calculate the MA CE filter weights for synchronous operation, 100 realizations of the received signal (s( n ) + v( n )) are generated and fed to the ESN one sample at a time. For each realization, the echo states are calculated using Equation 2-1 wh ere the initial state is set to zero. The echo state creates an image of dimens ion 30x41, where 30 is the number of PEs and 41 is the signal length. Output correlation peak am plitudes are assigned to be 1.0 for the training data. The MACE filter is computed in the frequency domain using Equation 4-1 and the corresponding correlator weights are obtained by inverse discrete Fourier transform. During testing, since the optimal decision time is known, the received signal u( n ) corresponding to a transmitted symbol can be proce ssed in a batch. The received signal of length 41 is fed to the ESN with zero initial conditions (a s in the training), the ec ho states are computed, and the decision made once at sample 41. Here, echo states create an image of dimension 30x41. The output of the MACE filter is computed by 2d correlation of the echo states with the filter coefficients. The TDNN is used in discriminative mode where the desired target signal is +1 when the received signal is (s( n ) + v( n )) and 0 when the received signal is v( n ). 500 realizations of the received signal (250 from each class) are gene rated to train the weights of TDNN using the Levenberg-Marquardt algorithm with a step size of 0.01. A crossv alidation set of 200 realizations is used to stop the training. Notice th at we use more data and an extra crossvalidation set for TDNN training compared to ESN-MACE tr aining. During testing, a received signal of length 41 is fed to the TDNN and the output obtained from the TDNN is compared to different threshold values to plot the ROC curve. We used the neural networks toolbox of Matlab to simulate and train the TDNN.

PAGE 143

143 For each noise distribution and a 0.5 proba bility of signal transmission, 10,000 Monte Carlo simulations are performed to compute th eROC curves of the matched filter ESN-MACE and TDNN. Figure 6-13 A shows the ROC curves under Gaussian and impul sive noise with SNR value set to 10dB for synchronous operation. U nder Gaussian noise, the matched filter is the optimal detection filter and hence results in th e optimal ROC. However, the ROC curve of the ESN-MACE and the TDNN are still close to that of the matched filter. The performance of the matched filter degrades tremendously under impulsive noise whereas the ESN-MACE performance is robust to impulsive noise. Th e TDNN performance under impulsive noise is better than the matched filter bu t worse than the ESN-MACE ev en though TDNN requires more data samples and a cross validation data set fo r training. If the training of the TDNN is done without a cross-validation set and stopping is done after 50 epochs the generalization performance suffers tremendously for both noise distributions. We have also tested a TDNN trained in predictive mode, where the desired targ et signal is the next value of the input. The results are very poor mostly due to the noi se level involved in the experiment. b. Asynchronous operation In asynchronous operation, the timing of th e received signal (and the optimal sampling time) is unknown during testing. Therefore, in asynchronous operation, the frames should be constructed for time n by extracting the echo states from time n -40 to n Notice that the echo states from time n -40 to n will also be affected from the value of the state at time n-40 (initial state). In synchronous operation, we did not have this problem since the timing of the received signal is known and the frames can be proces sed separately using zero initial state.

145 To be able to mimic the testing conditions of asynchronous operation while training the MACE filter, the initial conditions of the ESN are not set to zero but determined by the history of the input. 100 different realizat ions of the received signal are generated and inserted into the noise signal of 10000 samples at random time instan ts. The resulting signal is fed to the ESN and echo states are calculated. The echo states duri ng the application of the received signal are extracted. Echo states create an image of di mension 30x41, where 30 is the number of PEs and 41 is the signal length. Output correlation peak amplitudes are assigned to be 1.0 for the training data. The MACE filter is computed in the frequency domain using Equation 4-1 and the corresponding correlator weight s are obtained by inverse di screte Fourier transform. During testing, the received signal is fed to the ESN and the echo states are computed. Then, at every time instant n the echo states from time n -40 to time n are fed to the MACE filter and its 2-d correlation with the MACE is computed. The correlation result is compared to various threshold values to obtain the ROC curve. We also trained a TDNN with the specifications from section 3.1.a. We created a data set of 500 realizations of the received signal (250 for each class). The generated data is used to train the weights of TDNN using the Levenberg-Marquar dt algorithm with a step size of 0.01. A crossvalidation set of 200 realizations is used to stop the training. The TDNN is trained in the discriminative mode with the same target signal fr om section 3.2.a. During testing, the received signal is fed to the TDNN. At every time instant n the input signal from time n -40 to time n is used to produce the TDNN output which is then co mpared to various threshold values to obtain the ROC curve. Figure 6-13 B shows the ROC curves for the matched filter, the ESN-MACE and the TDNN under Gaussian and impulsive noise with SNR value set to 10dB. Similar to the

PAGE 146

146 asynchronous case, the matched filter results in the superior ROC curve for Gaussian noise whereas the ESN-MACE and TDNN performance are still robust to impulsive noise. ESNMACE performs the best under impulsive nois e similar to the synchronous operation. The performance of all filters in asynchronous opera tion degrades slightly compared to synchronous operation since neither the timing of the receive d signal is available in asynchronous operation, nor the initial condition.

PAGE 147

147 CHAPTER 7 CONCLUSIONS The great appeal of echo state networks (ESN ) is their ability to construct arbitrary mappings of signals with rich and time varying te mporal structures wit hout requiring adaptation of the free parameters of the recurrent layer. The echo state condition allows the recurrent connections to be fixed and only a linear output layer need s to be trained. In a way, the difficulty associated with RNN training is avoided yet preserving the well-know n power of recurrent topology. However, fixing the recurrent connections imposes important constraints in the overall architecture that have not yet been fully studied. In this dissertation, we addressed some of the issues with the state of the art by proposing a signal processing framework to understand the operation of ESNs, a system-theoretic approach to quantify ESN dynamics and a metric to quantify the richness of reservoir for computati on. The proposed framework led to a systematic design procedure for the fixed reservoir weights of ESN. We further investigated the use of ESNs for temporal pattern recognition and compar e several approaches to tackle the problem. A novel associative memory readout adopted from im age processing literature called MACE filter, was proposed for ESNs to be used in pattern re cognition problems. ESNs with the linear readout were utilized to solve various real-world pr oblems such as brain machine interface design, prediction of water inflow, and channel equalizat ion. ESN combined with the MACE readout led to a detector for action potentials in neural r ecordings that outperform the standard techniques such as threshold detector and the matched filter. A nonlinear matched filter that can detect signals in a digital communicati on channel under different noise conditions was also proposed. The ESN idea has implications on the understand ing and design of othe r architectures. We investigated the use of readout s for Freeman model of the olfa ctory cortex. With the proposed readouts, we showed that Freeman Model can pr ocess continuous valued, time varying signals.

PAGE 148

148 In addition, we investigated the effect of ec ho state condition for comput ation. By allowing the spectral radius to be slightly greater than 1, we showed that a system without a global stability constraint can still perform usef ul computation. The detailed conclusions for each of the topics are given below. Understanding Echo State Networks In this dissertation, we first proposed a fr amework for ESNs, which is a signal processing interpretation of bases and proj ections in functional spaces to describe and understand the ESN architecture. According to this interpretation, echo states implement a set of functional bases formed by fixed nonlinear combinations of the input. The linear readout at the output stage simply computes the projection of desired ou tput space onto this representation space. We further introduced an information-theoretic crit erion, average state entropy (ASE), for better understanding and evaluating the cap ability of a given ESN to construct such a representation layer. The entropy of the distribu tion of the echo states at each time step is calculated and averaged over time to obtain the ASE measure. ASE quantifies the volume of the state manifold. As such this volume should be the largest to achieve the sma llest correlation among the bases and be able to cope w ith arbitrary mappings. We also interpreted the ESN dynamics as coupled linear systems obtained from the linearization of the ESN nonlinea r PE in a small local neighborhood of the current state. This system-theoretic approach allowed the visualiz ation of movement of the system dynamics in response to the input signal. A single ESN with fixed parameters implements a combination of many linear systems with varying pole locatio ns, hence many different time constants that modulate the richness of the reservoir of dynami cs as a function of input amplitude. Higheramplitude portions of the signal tend to saturate the nonlinear function and cause the poles to shrink toward the origin of the z -plane (decreases the spectral ra dius), which results in a system

PAGE 149

149 with a large stability margin. When the input is cl ose to zero, the poles of the linearized ESN are close to the maximal spectral radius chosen, decr easing the stability margin. When compared to their linear counterpart, an ESN w ith the same number of states re sults in a detaile d coverage of the z -plane dynamics, which illustrates the power of nonlinear systems. Moreover, the spectral radius of an ESN can be adjusted using a consta nt bias signal at the ES N input without changing the recurrent connection matrix,W. The application of a nonzero constant bias moves the operating point to regions of the sigmoid function closer to satu ration and always decrease the spectral radius due to the shape of the nonlinearity. Designing Echo State Networks We proposed a systematic design procedure fo r ESNs. When the desi red response is not accessible for the design of the ESN bases or when the same reservoir is to be used for a number of problems, the default strategy should be to maximize the ASE of the state vector. However, not all function approximation problems require the same memory depth, which is coupled to the spectral radius. The effective spect ral radius of an ESN can be optimized for the given problem with the help of an external bias signal th at is adapted using the joint input output space information. The interesting property of this method when applied to ESN built from sigmoidal nonlinearitieis is that it allows the fine tuning of the system dyna mics for a given problem with a single external adaptive bias i nput and without changi ng internal system parameters. Moreover, the bias can be easily trained either with a line search method or a gradient based method since it is one dimensional. We illustrated experime ntally that the design of the ESN using the maximization of ASE with the adaptation of the spectral radi us by the bias has provided consistently better performance across tasks that require differe nt memory depths. This means that these two parameters design methodology is preferred to the spectr al radius criterion proposed by Jaeger, and it is still easil y incorporated in the ESN design.

PAGE 150

150 Computation at the Edge of Chaos Experiments demonstrated that the ASE fo r ESN with uniform linearized poles is maximized when the spectral radius of the recurren t weight matrix approaches one (instability). It is interesting to relate th is observation with the computa tional properties found in dynamical systems at the edge of chaos (Packard, 1988; Langton, 1990; Mitchell et. al., 1993; Bertschinger and Natschlger, 2004) Langton stated that when cellular automata rules are evolved to perform a complex computation, evolu tion will tend to select rules with critical parameter values, which correlate with a phase transition between ordered and chaotic regimes. Recently, similar conclusions were suggested fo r liquid state machines (LSMs) (Bertschinger and Natschlger, 2004). Langtons interpretation of edge of chaos was questioned by Mitchell et al (Mitchell et. al., 1993). Here, we provided a system theoretic view and explain the computational behavior with the diversity of dynamics achieved w ith linearizations that have poles close to the unit circle, exactly at the edge of chaos. According to our results, the spectral radius of the optimal ESN in function approxima tion is problem dependent and in general it is impossible to forecast the computational performan ce as the system approaches instability (the spectral radius of the recurrent weight matrix ap proaches one). However, allowing the system to modulate the spectral radius either by the output or intern al biasing may allow a system at the edge of chaos to solve various proble ms requiring different spectral radii. Transiently Stable Computation We introduced a new computati onal mode, transient ly stable computation, that can be observed in nonlinear systems. Linear system theory shows clearly that stab ility is necessary to obtain useful responses from linear systems. Ho wever, this argument does not necessarily apply to nonlinear systems as observed in transiently stable computa tion. A nonlinear system can be unstable and still be able to have dynamics that are controlled by th e input, when it is applied.

PAGE 151

151 We showed that transiently st able systems can still comput e input output mappings. From a computation point of view, this regime is interesting because it toggles between internal dynamics and external dynamics. This makes tran siently stable computation biologically more realistic compared to conventiona l echo states, since biological systems do not have fixed point dynamics but rather wide variety of collective be havior in the form of oscillations and even chaos. Echo State Networks for Temporal Pattern Recognition We argued that the high dimensional internal states of ESN/LSM architecture create an excellent medium for dynamical pattern recognition. We approached the problem without the use of a desired label as conventi onally done, instead by designing a readout for each class based on a specially trained architecture, called the linear associative memory (LAM). Unlike computer memory, LAMs store information globally in syst em weights and access memory by its content. Among the family of LAMs, we employed the minimum average correlation energy (MACE) filter which is widely used in the recognition of objects in 2-D images due to its much higher rejection to outliers when compared with conve ntional LAM training. Th e MACE interprets the states of the ESN/LSM as a tw o dimensional image, one dimension being time and the other space. An optimal template image for each class can be analytically computed using training data. During testing, ESN states are correlated w ith each template image and the class label of the template with the highest correlation is assi gned to the input pattern. A single MACE readout is able to represent a class of temporal signals because it incorporates explicitly the covariance of the training data class. If the c onventional linear readout is utili zed it can only represent the mean of the posterior density of the class given the data as is well know n both in classification (Bishop, 1995) or in regression /prediction (Haykin, 2001). This was clearly demonstrated in

PAGE 152

152 classification of the triangular versus square waves of different amp litudes (i.e. a class of signals). The ESN-MACE combination is a nonlinear temp late matcher that is posed to replace the linear matched filter in digita l communication systems with a more robust performance under different noise distributions. The detection perf ormance of the linear matched filter, which is optimal under Gaussian noise, degrades treme ndously under impulsive noise, whereas the ESNMACE filter provided consistent results under bot h Gaussian and impulsive noise. We applied the ESN-MACE filter for the classi fication of odors from an electr onic nose and compared to the MACE filter trained on the input space. ESN incr eases the margin between the in-class and outof-class outputs of the MACE filter due to the us er-defined high dimensionality provided by the ESN reservoir. We also provided a preliminary experiment that extends the proposed MACE filter readout for LSMs. The MACE readout does not require the convers ion of the liquid state outputs (spike trains) into continuous valued signals by filtering Instead, the value of the liquid state at each time instant is interpreted by the MACE as 1 or 0 depending on the existence or absence of a spike. The liquid state is usually very sparse lead ing to singular matrices used in the MACE filter computation. Therefore, proper regularization of the singular matrices should be done before computing the LSM/MACE filter weights in orde r to avoid numerical errors. Detailed studies have to be performed to better understand the ad vantages/disadvantages of a MACE readout for LSMs. Channel Equalization with Complex Echo State Networks We proposed ESN for signal processing with co mplex numbers. ESNs are very suitable for complex domain signal processing since training is linear regression, whic h is very simple both in real and complex domains. ESNs do not requir e backpropagation of weights through complex

PAGE 153

153 nonlinear functions, which is problematic in the complex domain since it requires the estimation of complex gradients. We compared the complex ESN with the complex least mean squares and complex multilayer perceptron in a complex dom ain channel equalization problem and achieved superior results in terms of bit error rates. Freeman Model We demonstrated that FM can be considered in the same framework with ESN/LSM. In fact, the KI and KII networks of FM are concep tually similar to the reservoir/liquid in an ESN/LSM with a proper selection of the parameters. The big difference is that the KII layer is highly structured with a coupled set of nonlinear oscillators We de rived conditions on the system parameters for the FM similar to the echo stat e property of ESNs. With the proposed framework, it became evident that the OB la yer alone lacks a readout in orde r to be used as a universal computing machine for time series. Therefore, we proposed to use an adaptive linear network to implement the readout from the OB layer. We pr esented experimental resu lts to show the power of this framework for the FM of the olfactory cortex. With the proposed framework, we showed that it is not necessary to drive Freemans m odel with 0 and 1s, as is conventionally done. Applications We have also utilized ESNs for vari ous real-world engineering problems. Brain Machine Interfaces We used ESNs to model brain machine inte rfaces. In BMIs, the goal is to predict a kinematic variable such as the hand position fr om electrophysiological recording that are acquired through microarrays from the brain. Me soscopic level ECoG signals and simultaneous hand position of a human patient are used to build the models. We compared the performances of a Wiener filter, an ESN and a leaky ESN in te rms of modeling accuracy using the correlation coefficient metric. Both ESN and leaky ESN ach ieved superior performance compared to the

PAGE 154

154 Wiener solution since the task requires nonlinear modeling. The leaky ESN achieved the best performance since the input signals change rapidl y whereas the hand position is at a much slower rate. In problems where the desired signal is at a slower rate than the input signal, the built-in low pass filter in the leaky ESN is very useful to match the signal rates. Prediction of Water Inflow We utilized ESNs to forecast the water inflow of a hydropow er plant reservoir. Water inflow information is fundamental to planning of the hydrothermal power system operation. A database of average monthly water inflows of Furnas plant, one of the Brazilian hydropower plants, was used as source of training and test data.The results of the ESN forecasting were compared with other nonlinear models We have shown that the ESN performed significantly better than the other nonlinear models with the advantage of significantly simpler, and faster, training algorithm. Spike Detection Spike detection is a challenging task due to the variability in the neural recordings and nonstationary, non-Gaussian noise embedded in the signal. The conventional techniques such as threshold detector or matched filter fail especially when the neural signal comprises recordings from multiple neurons spaced at different distan ces relative to the micr oelectrode. We proposed the ESN-MACE filter to tackle th e variability in the spike detection problem. We showed in a challenging neural data recorded from a rats brain that ESN-MACE combination results in lower false alarm rates for perfect detection. The better results of the ESN-MACE is attributed to the fact that MACE filter can model a class of signals by incorporating the covariance information of the class whereas ESN provides a rich representation space as the input to the MACE.

PAGE 155

155 Matched Filtering The ESN-MACE combination is a nonlinear temp late matcher that is posed to replace the linear matched filter in digita l communication systems with a more robust performance under different noise distributions. The detection perf ormance of the linear matched filter, which is optimal under Gaussian noise, degrades treme ndously under impulsive noise, whereas the ESNMACE filter provided consistent results under both Gaussian and impulsive noise. Possible Future Directions Recurrent neural networks train all the weight s in the network using algorithms such as backpropogation through time r eal time recurrent learning wher eas echo state networks train only the readout network weights. A possibility in the recurrent network l earning paradigm is to find an intermediate step where the internal conn ection weights can be clev erly initialized with a simple algorithm. This initiali zation algorithm finds a good choice of weights for the internal connection weight matrix using both the input and desired respons e and optimally computes the output matrix. This intermediate network will address the big drawback in the state-of-the-art of the ESN design that ignores the desired response in the design of internal weight matrix. A better understanding of the effect of dynamics in terms of computational power has to be established. We have shown that a nonlinear system without globa l stability constraint can still do useful computation in terms of functional approximation. Si milarly, KII network of the Freeman Model can be considered in the same framework. However, we have not yet observed a concrete advantage of this regime compared to conventional fully stable regime. A thorough study that links stability with computational pow er could be very inte resting both from an engineering and biological systems perspective. The use of multiple linear readouts in the spirit of local linear modeling can considerably improve the performance of ESNs. In our opinion the main difficulty associated with this

PAGE 156

156 approach would be to divide the input space since the signals in consideration are not static but time-series. A practical solution to this problem either by embedding the input time series or yet better using the instantaneous value of th e echo states has to be developed.

166 BIOGRAPHICAL SKETCH Mustafa Can Ozturk was born in Antalya, Turkey, on November 23, 1980. He received his B.S.in electrical and electronics engineering and B.S.in mathem atics, both from Middle East Technical University, Ankara, Turkey, in 2002. He has been pursuing his graduate studies in electrical and computer engi neering department at the Un iversity of Florida under the supervision of Dr. Jose Principe. He received his masters degree in May 2004 and Ph.D. degree in May 2007. His research interests broadly include dynamical neural networks, adaptive systems, and signal processing.