This work is particularly motivated by the following facts.
* Adaptive filtering has wide application in the digital signal processing, communication,
and control fields. A finite impulse response (FIR) filter [1, 2] is a simple structure
for adaptive filtering and has been extensively developed. Recently researchers have
attempted to use IIR structures because they perform better than FIR structures
with the same number of coefficients. However, some 1 i, Pr drawbacks inherent
to adaptive IIR structures are slow convergence, possible convergence to a bias or
unacceptable suboptimal solutions, and the need for stability monitoring.

* Stochastic approximation methods [3] have the property of converging to the global
optimum with a probability of one, as the number of iterations tends to infinity.
These methods are based on a random perturbation to find the absolute optimum
of the cost function. In particular, the method of stochastic approximation with
convolution smoothing has been successful in several applications. It has been
empirically proven to be efficient in converging to the global optimum in terms of
computation and accuracy. The convolution smoothing function can "smooth out"
a nonconvex objective function by convolving it with a suitable probability density
function (pdf). In the beginning of adaptation, the variance of the pdf is set to a
sufficient large value, such that the convolution smoothing function can "smooth out"
the nonconvex objective function into a convex function. Then the variance is slowly
reduced to zero, whereby the smoothed objective function returns to the original
objective function, as the algorithm converges to the global optimum. Such variance
is determined by a cooling schedule parameter. This cooling schedule is a critical
factor in global optimization, because it affects the performance of the global search
capability.

* Convolution smoothing has been used exclusively with the mean square error (\!Si;)
criterion. MSE has been used extensively in the theory of adaptive systems because
of its analytical simplicity and the common assumption of Gaussian distributed
error. However, recently more sophisticated applications (such as independent
component a, i1 ,-i and blind source separation) require a criterion that considers
higher-order statistics for the training of adaptive systems. The computational neural
engineering laboratory studied entropy cost function [4]. Shannon first introduced

entropy of a given probability distribution function, which provides a measure of the
average information in that distribution. By using the Parzen window estimator,
we can estimate the pdf directly from a set of samples. It is quite straightforward
to apply the entropy criterion to the system identification framework [5]. As shown
in this thesis, the kernel size of the Parzen window estimator becomes an important
parameter in the global optimization procedure. Deniz et al. [6] conjectured that
for a sufficiently large kernel size, the local minima of the error entropy criterion
can be eliminated. It was -ii--:. -1.. that starting with a large kernel size, and then
slowly decreasing this parameter to a predetermined suitable value, the training
algorithm can converge to the global minimum of the cost function. The error entropy
criterion considered by Deniz et al. [6], however, does not consider the mean of the
error signal, since entropy is invariant to translation. Here we modify the criterion
and study the reason why annealing the kernel size produces global optimization
algorithms.

1.2 Literature Survey

We surveyed the literature in the areas of adaptive filtering, optimization method,

and mathematics used in the analysis of the algorithm.

1.2.1 Adaptive Filtering

Numerous algorithms of adaptive filtering are proposed in the literature [7, 8],

especially for system identification [9, 10]. Some valuable general papers on the topic

the characteristics of the most commonly used algorithms for adaptive IIR filtering in a

simple and unified framework. Recently a full book was published on IIR filters [15].

The major goal of an adaptive filtering algorithm is to adjust the adaptive filter

coefficients in order to minimize a given performance criterion. Literature about

adaptive filtering can be classified into three categories: adaptive filter structures,

adaptive algorithms, and applications.
* Adaptive filter structure. The choice of the adaptive filter structures affect the
computational complexity and the convergence speed. Basically, there are two kind of
adaptive filter structures.
-Adaptive FIR filter structure. The most commonly used adaptive FIR filter
structure is the transversal filter which implements an all-zero filter with a canonic
direct form (without any feedback). For this adaptive FIR filter structure, the

-Adaptive IIR filter structure. White [21] first presented an implementation
of an adaptive IIR filter structure. Later, many articles were published in this
area. For simple implementation and easy analysis, most adaptive IIR filter
structures use the canonic direct form realization. Some other realizations are
also presented to overcome some drawbacks of canonic direct form realization, like
slow convergence rate and the need for stable monitoring [22]. Commonly used
realizations are cascade [23, 24], lattice [25, 26], and parallel [27, 28] realizations.
Other realizations have also been presented recently by Shynk et al. [29] and Jenkin
et al. [30].

SAlgorithm. An algorithm is a procedure used to adjust adaptive filter coefficients
in order to minimize the cost function. The algorithm determines several important
features of the whole adaptive procedure, such as computational complexity,
convergence to suboptimal solutions, biased solutions, objective cost function
and error signal. Early local adaptive filter algorithms were Newton method,
QO ,-i-N. .-ton method, and gradient method. Newton's method seeks the minimum
of a second-order approximation of the objective cost function. OQ ,-i-N. .-ton is a
simple version of the Newton method using a recursively calculated estimate of the
inverse of a second-order matrix. The gradient method searches the minimum of
the objective cost function by tracking the opposite direction of the gradient vector
of the objective function [31]. It is well known that the step size controls stability,
convergence speed, and misadjustment [1]. For FIR adaptive fill, iiir local methods
were sufficient since the optimization was linear in the weights. However in IIR
adaptive filtering this is no longer the case. The most commonly known approaches
for adaptive IIR filtering are equation error algorithm [32], output error algorithm
[12, 11], and composite algorithms [33, 34] such as the Steiglitz-McBride algorithm
[35].
-The main characteristics of the equation error algorithm are unimodality of the
Mean-Square-Error (\!S1') performance surface because of the linear relationship
of the signal and the adaptive filter coefficients, good convergence, and guaranteed
stability. However, it comes along with a biased solution in the presence of noise.

-The main characteristics of the output-error algorithm are the possible existence of
the multiple local minima, which affect the convergence speed, an unbiased global
optimal solution even in the presence of noise, and the requirement of stability
checking during the adaptive processing.

-The composite error algorithm attempts to combine the good individual
characteristics of both output error algorithm and equation error algorithm
[36]. Consequently, many papers were written to overcome the problem mentioned
above.

SApplication. Adaptive filtering has been successful in many applications, such
as echo cancellation, noise cancellation, signal detection, system identification,
channel equalization, and control. Some useful information about adaptive filtering
application appears in the literature [1, 2, 43].

In this dissertation, we focus on adaptive IIR filter algorithms for system

identification.

1.2.2 Optimization Method

There are two adaptation methodologies for IIR filters: gradient descent and global

optimization. The most commonly used method is the gradient descent method, such

as least mean square (LMS) [1]. These methods are well established for the adaptation

of FIR filters and have the advantage of being less computationally expensive. The

problem with gradient descent methods is that they might converge to any local

minima. The local minima normally imply poor performance. This problem can be

In a strict sense, MSE is a theoretical value that is not easy estimated. In practice,
it can be approximated by the other two objective functions. In general, ISV is easily
implemented but it is heavily affected by perturbation noise. Later we present the
entropy of the error as another objective function, but first we must discuss MSE.
The adaptive algorithm attempts to minimize the mean square value of the output
error signal, where the output error is given by the difference between the unknown
system and the adaptive filter output signal.That is,

B(z-1) B(z-1)
E(n) = ]x() ) + v(n) (2-17)
A(z- 1) A'z1)

The gradient of the objective function estimate with respect to the adaptive filter
coefficients is given as

V[E2(n)] = 2E(n)V[E(n)] = 2E(n)V7[y(n)] (2-18)

with
0(/ -- i+ (n ) 1 \((n-k)
Vo[y(n) I y + )] a, -(n) (2-19)
x(n 4+ z a y(n- k)
[5"7k --k b) 1 E ( a)bj bj-lj(n)
where 0 is the adaptive filter coefficient vector.
This equation requires a relatively large memory allocation to store data. In
practice, a small step approximation that considers the adaptive filter coefficients
slowly varying can overcome this problem [64]. Therefore, by using the small step
approximation, the adaptive algorithm is described as

0(n + 1) 0 + p(n)Q(n) (2-20)

where O(n) = {y(n -i) x(n j)}T for i = 1, a j = 1, rb ,and p is a small step
size that satisfies the following property. The adaptive algorithm is characterized by the
following properties:

Property 1 [65] The Euclidean square-norm of the error parameter vector 1, f,' I by

|0|(n) 0(n) I is convergent if p .,ri .

0 < p < (2-21)

Property 2 [31, 66, 67] The stat.:- ',r,, points of the MSE performance -, if,,: are given
by

In practice, only the stable station., ,1 points, so called equilibria, are of interest and
/, ill.] these points are I-,-..l; W as

Degenerated point: The degenerated points are the equilibrium points where

SB(z-1, n)=0 : b < na
(2-24)
B(z-1, n) L(z-1)A(z-1, n) : nb > ha

where L(z-1) = oa Ik -k.
Nondegenerated points: All the equilibria that are not degenerated points.
The equilibrium points that influence the form of the error performance surface
have the following property.
Property 3 [12] If n* > 0, all ,1l. l',l minima of the MSE performance -, if'. : are given

by

A*(-l) A(z-1)C(z-1) (2-25)
B*(z-l)=B(z-1)C(z- )

where C(z-1) = 0 Ck -k. It means that all ,1, l' minimum solutions have included
the i" ,;,;. ,,,,.:,,l' describing the unknown system plus a comm factor C(z-1) present in the
numerator and denominator ', ./;;/,,.. 'l,,:l~ of the adaptive filter.

Property 7 [67] All degenerated equilibrium points are saddle points and their existence

implies r,,,ll.:,,:' I/,.;l/i (existence of stable local minimum) of the performance -, ifl;., if
either ha > nb = 0 or a = 1.

This property is also valid for the insufficient order cases.

In 1981, Stearns [70] conjectured that if n* > 0 and the input signal x(n) is white

noise, then the performance surface defined by MSE objective function is unimodal.

This conjecture -1 l, 1 d valid until Fan offered numerical counterexamples for it in 1989

[71].

The most important characteristic of IIR adaptation is the possible existence

of multiple local minima which can affect the overall convergence. Moreover, global

minimum solution is unbiased by the presence of zero-mean perturbation noise in the

unknown system output signal. Another important characteristic of IIR adaptation

is the requirement for stability checking during the adaptive process. This stability

checking requirement can be simplified by choosing an appropriate adaptive filter

realization.

2.3 System Identification with Kautz Filter

One of the in i Pr drawbacks in adaptive IIR filtering is the stability issue. Since

the filter parameters are changing during adaptation, a practical approach is to use

cascades of first and second order ARMA sections, where stability can still be checked

simply and locally. A principled way to achieve the expansion of general ARMA

systems is through orthogonal filter structures [72]. Here we uses Kautz filters, because

they are very versatile (cascades of second order sections with complex poles but

still with a reasonable number of parameters). The Kautz filter, which can be traced

back to the original work of Kautz [73], is based on the discrete time Kautz basis

functions. The Kautz filter is a generalized feedforward filter which produces an output

y(n) = p(n, ()T0, where 0 is set of weights and the entries of p(n, () are the outputs of

first order IIR filters with a complex pole at ( [74]. Stability of the Kautz filter is easily

guaranteed if the pole is located within the unit circle (that is |(| < 1). Although the

adaptation is linear in Oi, it is nonlinear in the poles, yielding a nonconvex optimization
problem with local minima.
The continuous time Kautz basis functions are the Laplace transform of continuous
time orthonormal exponential functions which can be traced back to the original works
of Kautz [73]. The discrete time Kautz basis functions are the Z-transforms of discrete
time orthonormal exponential functions [74]. The discrete time Kautz basis functions
are described as

,1( -ck k-' 1
4'2k (Zk,k) l k+ k
2 (1 6k-1)( 1 -

-1 (1 )(1 () (2-28)
10 (2-28)

1 ok k -1
2k+1(Zk, ) 1 kk I 1k
k-1 (_1 1)(1 (1)

H1 ) (2-29)
1-o (t- (z-)(1 (1z_1)

where (k ak + j3k, ((k(k) are the kth pair of complex conjugate poles, and (k < 1
because of its stability, and k is .i'. il,- even.
The orthonormality of the discrete time Kautz basis functions is represented as

1 dz
J (z, (,k)4q (l/Z, p,q (2-30)

where the integral unit circle tour is analytic in the exterior of the circle.
All pairs of complex conjugate poles can be integrated in real second order sections
to reduce the degrees of freedom. The resulting basis functions can be describes as
discrete-time 2-pole Kautz basis functions. The discrete-time Kautz basis functions can
be simplified as Figure 2-3, where

y(n) = (n)T0 (2-31)

p(n) [kco(n), ,d c ()]T (2-32)

K2k(z, )= K2k-2(z, )A(z, ) (2-33)

K2k+1(z, K) 2k- 1(z, )A(z, ) (2-34)

Figure 2-3: Kautz filter model.

Ko(z, )

Ki(z,

'(1

S(1

A(z- () (z-1 + *)
(A( ^=

(1 -( z-1)(1 (*Z-1)
1 -((*
Ko t1 + 2
F2'

t1 -((*
4 =|1 2
F2'

Here ( is a complex conjugate pole (that is ( = a + jp).

z-1- 1
(z- )(1 -*-1)
z-1 +1
(z- 1)(1 -

(2-35)

(2-36)

(2-37)

(2-38)

(2-39)

,~ >

CHAPTER 3
STOCHASTIC APPROXIMATION WITH CONVOLUTION SMOOTHING

3.1 Introduction

Adaptive filtering has become a ii i"r research area in digital signal processing,

communication and control, with many applications, such as adaptive noise cancellation,

echo cancellation, and adaptive equalization and system identification [1, 2]. For

simplicity, finite impulse response (FIR) structures are used for adaptive filtering and

have many mature practical implementations. However, infinite impulse response

We now assume the crucial condition on {Ot, a < t < b}, which makes the derivation of

the diffusion equation possible. Define for a positive e,

(0, t; ,A)- f (y 0)kdP(y, t + A 10, t)
Jly-el<
k =0,1,2 (3-29)

S(0, t; C, A)- (y )3dP(y, t + Ae0, t) (3-30)

We assume that the Markov process {Ot, a < t < b} satisfies the following conditions:

1 go
[1 4(0, t; c, A)] A 0 (3-31)
AA (3-3t)

-M(0, t; e, A) m(0,t) (3-32)
1(o, t; ,A) 0a2,t) (3-33)
1 -o

S11 (0,t; ,A) 0 (3-34)

It is clear that if 1 1,,(0, t; c, A) 0, then by dominated convergence,

/oo
p(t+A t >0= I [i- 11,(, t;c, A)l]dP(e,t) 0 (3-35)

In addition, suppose that the transition function P(0, t\00, to) satisfies the following

condition:

Assumption. For each (0, t), P(0, tl0o, to) is once differentiable in to and

three-times differentiable at 00, and the derivatives are continuous and bounded at

(0o, to).
Kolmogorov [79] has derived the Fokker-Planck equation

o 1 02
p(0, tlo, to) [0(0, t)p(0, tl ,,to)

a[m(0, t)p(0,t o, to)] b > t > to > a (3-36)
Of0

The initial condition to be imposed is

/oO
-00

that is p(O, tl80, to)
equations, we get

atP (0' t)

If p(0, t) is a product p(0, t)
quantities, then we have

6(0 0o). Substituting Equation (3-24) into the Fokker-Planck

1 02
2 2 [(t) Op (0)(0, t)]

g(t)W(0)p(O0) reflecting the independence among the

w ) dg(t)
W(0)(0) dg(t)
dt

d d)]
g(t)P(t)( { [E(0)W(0)p(0)]
dO 2 dO

(3-39)

-ve(o)w(M)l()O f

Let W(O) be any positive solution of the equation

ld
td d (0)W (0)]
2 dO

V(0)W(0)

dg(t)
W (0)(0) dg(t)
dt

1 dg(t)
g(t)p(t) dt

d dc(0)
= g(t)j i(t) ( [E (0) W (0)])
2 dO dO

1 1 d dc(0)
E((0)W(0) ])
W(o)(0() 2 dO ) dO

The two sides, being functions of different variables, must be constant in order for the
equality to hold. Set this constant as -A, then

1 dg(t) A
g(t)p(t) dt
g(t)= eX- Af(s)ds

P(0, t) = e-X fp ( 0ds()

(3-43)

(3-44)

(3-45)

Where A\(0) satisfies the Sturm-Liouville equations.

((e)w() d()] + Aw(O)(0)
dO

0 (3-46)

f(0)p(0, tl0, to)d0 e f(Oo)

VfC S

(3-37)

[ (ta V((0s, t)p(0, t)]
a 1 4 V O l O A I 0

(3-38)

then

Therefore

(3-40)

(3-41)

(3-42)

l d
2 dO

Under rather general conditions, it can be shown that every solution p(O, t) can be
represented as a linear combination of products. Since p(0, t Oo, to) is a function of
t, to, 0, 0o, it must have the form of

p(O, tlOo,to) = W(o) eCf (')" (o) (Bo)dA (3-47)

where j(Oo) is conjugate complex of \ (0o). Here we want to know the transition
probability of the process escaping from the steady-state solution 0*, in which V(0O*) =
0. From Equation (3-40), we obtain

From Equation (3-47), we get the transition probabilities of the process escaping out of
the valley as

1 1 (0 0*)2
p(0, tl*, to) exp(- 1 )
/27r (*) fjos)ds 2 (0*) jt Io(s)d

=G( 0*, (0*) ()ds) (3-52)

where G(0, o-2) is a Gaussian function with zero mean and variance a2.
Summary. Equation (3-52) is the final transition probability of the process
escaping out from the steady-state 0*. The conditional p(0, t 0*, to) is determined by
0 0*, p(n), and E(0*). Because we use a monotonically decreasing p(n), the algorithm

will decrease the probability of the process jumping out the valley over iterations.

From Equation (3-52) the transition probability of the process escaping out from
the local minimum 0* is larger than the one from the global minimum 0* because of

I(0S)\ < I(0k)|. Thus, the algorithm will stay most of its time near the global valley
and will eventually converge to the global minimum. Equation (3-52) also shows that
the larger the 0 0* is, i.e., the larger the valley around the steady state point 0*, the

less probable is the process from escaping out from this steady state point 0*.
0 is a vector.

Returning to the original case, in which 0 is a vector, we must solve the following
Fokker-Planck equation:

S1 2
p(, ) V 1(0))p(, t)] Vo[/(t)V(t, t)p(O, t)] (3-53)

Similarly, we want to know the transition probability of escaping from the steady-state

solution 0*, in which V(0*) = 0. Equation (3-53) will become

p(0, t)=t)E0)t))] (3-54)
at 2

Imposing strict constraint that p(O, t) is a product

p(0,t) g(t)(0) = g(t)1(01)y2(02) *... I f-1(ON+M-1) (3-55)

then we have

1 dg(t) E(0*) 72 (3-56)
V p(0) (3-56)
g(t)p(t) dt 2 p(0)

The two sides, being function of different variables, must be constant, set this constant
as -A, then

Substituting back the constraint of Equation (3-62) into Equation (3-67), we obtain

2
A 2 [d(n) OT(n)Voy(n)] (3-68)

Define the error E(n) = d(n) OT(n)Voy(n). We further simplify A as

A (2 () (3-69)
Voey(n) 12

By substituting above equation into Equation (3-66), we obtain

60k(n + 1) 2 Vey(n)(n) k = 0, 1,... ,N (3-70)

For the adaptive IIR filtering, the above equation can be formulated as

60(n + 1) Voy(R)E() (3-71)

or equivalently, we may write as

(n 1+) (n) + P Voy(R)E(n) (3-72)

This is the so called NLMS algorithm summarized in Table 3-1, where the initial
conditions are randomly chosen.
Computation complexity. The computational complexity of the NLMS
algorithm is (N + M)(M + 3). Compared to the computational complexity of the
original LMS algorithm which is (M + N)(N + 2), the NLMS algorithm is almost as
simple. It only requires a little extra computation.

3.7 Relationship between LMS-SAS and NLMS Algorithms
In this section, we show that the behavior of the NLMS algorithm is similar to that
of the LMS-SAS algorithm from a global optimization perspective. Here we follow the
lines of Widrow et al. [1] and assume that the algorithm will converge to the vicinity of
a steady-state point.
From Equation (3-18), we know that the estimated gradient vector is:

VM(n)) --en)Vey( n) (3-73)

Define N(n) as a vector of the gradient estimation noise in the nth iteration and

V(0(n)) as the true gradient vector. Thus

V((0(n)) = V((n)) + N(n)

N(n) V7(0(n)) V(0(n)) (3-74)

If we assume that the NLMS algorithm has converged to the vicinity of a local
steady-state point 0*, then V((0(n)) will be close to zero. Therefore the gradient
estimation noise will be

N(n) = V(((n)) -E(n)Voy(n) (3-75)

The covariance of the noise is given by

cov[N(n)] E[N(n)NT(n)] E[2 (n)Vey(n)VOyT(n)] (3-76)

We assume that E2(n) is approximately uncorrelated with Voy(n) (the same assumption
as [1]), thus near the local minimum

where A is an unit norm matrix. Thus the NLMS algorithm near any local or global
minima has the variance of the perturbing random noise determined solely by both p(n)

and E(n). This behavior is very different from the conventional LMS algorithm with

monotonic decreasing step size where the perturbation noise is determined by p(n), E(n)
and Voy(n). Therefore, in the LMS algorithm the variance near the steady state point
is small because of Voy(n) w 0. Hence the LMS algorithm has small probability of
escaping out of any local minima because of the small variance of the noisy gradient.

On the other hand, notice that the variance of the perturbing random noise
in the LMS-SAS algorithm is p(n)E(n)/3r which is also independent of the gradient
and controlled by both p(n) and E(n). Therefore, we can anticipate that the global

optimization behavior of the NLMS algorithm near local minima is similar to that of the

LMS-SAS algorithm. Far away from local minima, the behavior of LMS-SAS and NLMS
is expected to be rather different from each other.
3.8 Simulation Results

In this section, we compare the performances of the LMS, LMS-SAS, and NLMS
algorithms in terms of their capability to seek the global optimum of IIR filters in

a system identification framework. According to properties of adaptive algorithm
discussed in C'! lpter 2, we set up the system identification example where its MSE

criterion performance surface has one local and one global minima. In this example, we

The main goal is to determine the values of the coefficients {a, b} of the above equation,

such that the MSE is minimized to the global minimum. The excitation signal is

chosen to be random Gaussian noise with zero mean and unit variance. There exist

two minima of the MSE criterion performance surface with the local minimum at

{a, b} = {-0.519,0.114} and the global minimum at {a, b} = {0.906, -0.311}. Here we
use three types of annealing schedule for the step size (see Figure 3-2 which shows that

one is linear, one is sub linear and the other one is supra linear),

I (n) =0.1 cos(n7/2nmax)

/p2(n) =0.1 0.1n/nax, n < na, = 20000 (3-84)

P3(n) = 2p2(n) -1 (n)

The cooling schedule parameter for GLMS algorithm is a linear decreased function of

3(n) 100/n.

Table 3-2 shows the comparison of the number of global and local minimum hits by

various algorithms. The results are given by 100 Monte Carlo simulations with random

N 0.05

0.04

0.03

0.02 -

0.01 \\

0
0 0.5 1 1.5 2 2.5
# of iteration x 104

Figure 3-2: Step size p(n) for SAS algorithm.

initial conditions of 0 at each run. The convergence characteristics of 0 toward the

global minimum for the GLMS, LMS-SAS, and NLMS algorithm are shown in Figure

3-3, 3-4, and 3-5, respectively. The adaptation process with 0 approaching toward the

local minimum for the LMS, GLMS, and LMS-SAS, algorithm are also depicted in

Figure 3-6, 3-7, and 3-8, respectively, where 0 is initialized to the point near the local

minimum. Based on the simulation results, we can summarize performance as follows:

* Figure 3-6 and row 1, 2 in Table 3-2 show that the LMS algorithm is likely to
converge to the local minimum.

* Figure 3-3, 3-7 and row 3 in Table 3-2 show that the GLMS algorithm might jump to
the global minimum valley and converge to the global minimum, but it also can jump
back to the local minimum valley and then converge to the local minimum. Srinivasan
[56] claims that the GLMS algorithm could converge to the global minimum w.p.1
by carefully choosing the cooling schedule /(n). The cooling schedule is a crucial
parameter, but it is difficult to be determined such that global optimization will be
guarantee.

* Figure 3-4, 3-8 and row 4,5 in Table 3-2 show that the LMS-SAS algorithm are likely
to converge to the global minimum with proper step size. Even though the LMS-SAS

algorithm stays most of its time near the global minimum, it still has probability of
converging to the local minimum.

SFigure 3-5 and row 6, 7, 8 in Table 3-2 show that the NLMS algorithm with proper
step size, similarly to the LMS-SAS algorithm, could converge to the global minimum.
Figure 3-4 and 3-5 also show that the NLMS algorithm r i,,-; much longer time in the
global minimum valley than the other algorithms. These figures also show that the
step size of the NLMS algorithm doesn't p1 iv as crucial a role as the cooling schedule
of the GLMS algorithm.

3.9 Comparison of LMS-SAS and NLMS Algorithm

Recall that the LMS-SAS algorithm is described as

0(n + 1) e0(n) p(in)E(n)VOe(n, 0) P(n)E(n) (3

Figure
of 0.

i r,-

(3-85)

1

0.5

0

-0.5

-1

-1.5
0

0.5 1
# of iteration

1.5 2
x 104

Figure 3-5: Global convergence of 0 in the
of 0.

0 0.5 1
# of iteration

1.5 2
x 104

-0.5 0 0.5
pole

NLMS algorithm. A) Weight 0; B) Contour

-0.5 0 0.5
pole

3-6: Local convergence of 0 in the LMS algorithm. A) Weight 0; B) Contour of

LMS with pNLM
NLMS with INLi
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p
LMS-SAS with p

H(ii ,), which is not easily estimated from samples. An alternative estimated
mutual information between two probability density function (pdf) f(x) and g(x) is

Kullback-Leibler (KL) divergence [86], which is defined as

f(x)

Similarly Renyi's divergence measure with order a for two pdf f(x) and g(x) is

HR(fg) )log f dx (4-14)
(a 1) g(x) l

The relation between KL divergence and Renyi's divergence measures is as

lim HR~(f, 9)= (f, g) (4-15)
a->l

The KL divergence measure between two random variables Y1 and Y2 essentially

estimates the divergence between the joint pdf and the marginal pdfs. That is

s(YI,Y2) KL(fyjy2(zi,z2) fy,(l)fy2(2))

I fyy2 (z, z2) 10g YY2 ( Z2) dzldZ2 (4-16)
fJ J ,(Z) fy (zi) f (2

where fyy (zi, z2) is the joint pdf, fy(zli) and f, (z2) are marginal pdfs. Because those

divergence measures mentioned above are non-quadratic in the pdf term, they cannot

easily be estimated with the information potential. The following distance measures

between two pdfs, which contains only quadratic terms of pdf, are more practical.
* Distance measure based on the Euclidean difference of vectors inequality is

1,11- + 12 2xTy > 0 (4-17)

* Distance measure based on the Cauchy-Schwartz inequality is

log ,1 > 0 (4-18)
(xT) -y

Using the Cauchy Schwartz inequality, the distance measure between two pdfs f(x) and

g(x) is as

Ics(f ) log (fg( x) (4-19)
(f f(x)g(x)dx)2
It is obvious that Ics(f, g) > 0 and the equality holds true if and only if f(x) = g(x).
Similarly, using the Euclidean distance, the distance measure between two pdfs f(x)

and g(x) is as

IED(f, ) (f(x) 9g(x)2dx

Sf (dx + g(x)2dx 2 f (x)g(x)dx (4-20)

It is also obvious that IED(f, g) > 0 and the equality holds true if and only if f(x)

g(x)
4.3 Adaptive IIR Filter with Euclidean Distance Criterion

The system identification scheme of adaptive IIR filter is as Figure 2-1. The output

ao (4-36)
,(0t, t)= p(t)N(0)( fe(E)- 6())2
With the similar derivation of Equation (3-52) for the LMS-SAS algorithm, we obtain
the transition probability of the ITL algorithm escaping out a local minimum for the
scalar 0 case as

Remark. Equation (4-37) is the transition probability of the ITL algorithm
escaping out from the steady-state 0*. The p(0, t 0*, to) is determined by (0 0*), p(n),

N(0), and (f,(E) 6(E))2. Because we use an annealing kernel size, i.e. an annealing
variance of N(0), the ITL algorithm will decrease the probability of jumping out the
valley over of iterations. The transition probability of the process escaping out from the
global minimum is smaller than the one from a local minimum because of the smallest

value of (f(E) 6(E))2 at the global minimum. Thus, the algorithm will stay most of its

time near the global valley and will eventually converge to the global minimum.

4.6 Contour of Euclidean Distance Criterion

The Euclidean distance criterion for the adaptive IIR filters is defined as

IED f) (fE) -( f ())2dE
/oo
-oo

f2(E)dE 2f,(0) + c (4-38)
-oo

If the input signal is set to have a Gaussian distribution, N(p, J2), then the desired

signal will also be Gaussian, N(id, ad). The output signal of the adaptive filter will be a

Gaussian as well, N(py, ao2). Here we want to calculate the analytical expression of the

Euclidean distance in the simulation example of the system identification framework for

the unknown system of
bl + b2z -1
H(z)= b+ 2z (4-39)
1 alz-1 a2z-2

by reduced order adaptive filter of

Ha() = (4-40)
1 az-1

Here the desired output signal is realized as

d(i) = bix(i) + b2x(i 1) + ad(i 1) + a2d(i 2) (4-41)

Then
b + b2
l'd = 2 Px (4-42)
1 a, a2

Taking variance on both sides of Equation (4-41), we obtain that

Rd(0) = (b2 + b2 + 2bb121)Rx(O) + (a 2+ a )Rd(0) + 2aia2Rd(l) (4-43)

where Rd(t) and Rx(t) are the variance of desired output signal and input signal,

respectively. Right shifting one unit of Equation (4-41), we obtain that

criterion and kernel annealing approach allowed stable adaptation of the poles to their

global optimal values. We have also investigated the performance of the proposed

criterion and the associated steepest descent algorithm in IIR filter adaptation. We

have designed a proposed information theoretic learning algorithm, which is shown to

converge to the global minimum of the performance surface. The proposed algorithm

successfully adapted the filter poles avoiding local minima 100 of the time and

without causing instability.

The performance of this ITL algorithm was compared with the more traditional

LMS variants, which are known to exhibit improved probability of avoiding local minima

in previous chapter. Nevertheless, none of them were as successful as ITL in achieving

the global solution. An interesting observation was that the ITL criterion yields a

10-4 N

NSG

10-6
0 5 10 15
SNR(dB)

C
100

m

10-5
0 5 10
SNR(dB)

20 25

SNR(dB)

D

0 5 10 15
SNR(dB)

Figure 5-9: Average BER for a nonlinear equalizer, A) over the whole 100 Monte Carlo
runs; B) over the 10 best solutions of MSE; C) over the 10 medial solutions of MSE; D.)
over the 10 worse solutions of MSE.

20 25

77

smaller L, error norm between the impulse responses of the adaptive and the reference