Application of Information Theory to Blind Source Separation

Document Actions

BSS

Let
be a set of statistically independent signals. We will later examine some other assumptions, but for now assume simply that they are independent. The signals are processed according to

Now, not knowing either
or
A
, we desire to determine a matrix
W
so that

recovers
as fully as possible. Let us take as a criterion the mutual information at the output:
. (Q: how did they know to try this? A: It seemed plausible, they tried it, and it worked! Moral: think about the implications of ideas, then see if it works.) Then, as shown in the exercises,

If we maximize
, we should (1) maximize each
H
(
y
i
) and (2) minimize
. As mentioned before, the
H
(
y
i
) are maximized when (and if) the outputs are uniformly distributed. The mutual information is minimized when they are all independent! Achieving both of these exactly requires that
g
have the form of the CDF of
s
i
. So we might contemplate modifying
W
, and also modifying
g
. Or we might (as Bell and Sejnowski do) fix
g
, and don't worry about this. This corresponds to the assumption that
p
(
s
i
) is super-Gaussian (heavier tails than a Gaussian has). We can write

where we have

so that

Thus

Then

In the case that
, then the last stuff goes away. In other words, we ideally want
y
i
=
g
i
(
u
i
) to be the CDF of the
u
i
. When this is not exactly the case (there is a mismatch), then the last term exists and may interfere with the minimization of
. We call the term
and "error term". Now we note that

The term
does not depend upon
W
, so we obtain

Now we come to an important concept: We would like to compute the derivative, but can't compute the expectation. We make the
stochastic gradient approximation
:
. We just throw the expectation away! Does it work? On average! Now it becomes a matter of grinding through the calculus to take the appropriate partial derivative. Since

we will consider the elements:

since
, and
y
i
=
g
(
u
i
). Because this connection, the partial
is nonzero only when
i
=
j
. Combining these facts, we find

Thus

(See appdx E of Moon and Stirling.) Looking at the second term,

since
. Let us write

This looks like a density, and ideally would be so, as discussed above. But we can think of this as simply a function. We thus find, stacking all the results,

This gives us the learning rule:

We will let

be the learning nonlinearity, also called in the literature the score function. Then

This approach can only separate super-Gaussian distributions (heavy tails).