Autoencoders and Sparsity

From Ufldl

So far, we have described the application of neural networks to supervised learning, in which we have labeled
training examples. Now suppose we have only a set of unlabeled training examples ,
where . An
autoencoder neural network is an unsupervised learning algorithm that applies backpropagation,
setting the target values to be equal to the inputs. I.e., it uses .

Here is an autoencoder:

The autoencoder tries to learn a function . In other
words, it is trying to learn an approximation to the identity function, so as
to output that is similar to . The identity function seems a
particularly trivial function to be trying to learn; but by placing constraints
on the network, such as by limiting the number of hidden units, we can discover
interesting structure about the data. As a concrete example, suppose the
inputs are the pixel intensity values from a image (100
pixels) so , and there are hidden units in layer . Note that
we also have . Since there are only 50 hidden units, the
network is forced to learn a compressed representation of the input.
I.e., given only the vector of hidden unit activations ,
it must try to reconstruct the 100-pixel input . If the input were completely
random---say, each comes from an IID Gaussian independent of the other
features---then this compression task would be very difficult. But if there is
structure in the data, for example, if some of the input features are correlated,
then this algorithm will be able to discover some of those correlations. In fact,
this simple autoencoder often ends up learning a low-dimensional representation very similar
to PCAs.

Our argument above relied on the number of hidden units being small. But
even when the number of hidden units is large (perhaps even greater than the
number of input pixels), we can still discover interesting structure, by
imposing other constraints on the network. In particular, if we impose a
sparsity constraint on the hidden units, then the autoencoder will still
discover interesting structure in the data, even if the number of hidden units
is large.

Informally, we will think of a neuron as being "active" (or as "firing") if
its output value is close to 1, or as being "inactive" if its output value is
close to 0. We would like to constrain the neurons to be inactive most of the
time. This discussion assumes a sigmoid activation function. If you are
using a tanh activation function, then we think of a neuron as being inactive
when it outputs values close to -1.

Recall that denotes the activation of hidden unit in the
autoencoder. However, this notation doesn't make explicit what was the input
that led to that activation. Thus, we will write to denote the activation
of this hidden unit when the network is given a specific input . Further, let

be the average activation of hidden unit (averaged over the training set).
We would like to (approximately) enforce the constraint

where is a sparsity parameter, typically a small value close to zero
(say ). In other words, we would like the average activation
of each hidden neuron to be close to 0.05 (say). To satisfy this
constraint, the hidden unit's activations must mostly be near 0.

To achieve this, we will add an extra penalty term to our optimization objective that
penalizes deviating significantly from . Many choices of the penalty
term will give reasonable results. We will choose the following:

Here, is the number of neurons in the hidden layer, and the index is summing
over the hidden units in our network. If you are
familiar with the concept of KL divergence, this penalty term is based on
it, and can also be written

where
is the Kullback-Leibler (KL) divergence between
a Bernoulli random variable with mean and a Bernoulli random variable with mean .
KL-divergence is a standard function for measuring how different two different
distributions are. (If you've not seen KL-divergence before, don't worry about
it; everything you need to know about it is contained in these notes.)

This penalty function has the property that if ,
and otherwise it increases monotonically as diverges from . For example, in the
figure below, we have set , and plotted
for a range of values of :

We see that the KL-divergence reaches its minimum of 0 at
, and blows up (it actually approaches ) as
approaches 0 or 1. Thus, minimizing
this penalty term has the effect of causing to be close to .

Our overall cost function is now

where is as defined previously, and controls the weight of
the sparsity penalty term. The term (implicitly) depends on also,
because it is the average activation of hidden unit , and the activation of a hidden
unit depends on the parameters .

To incorporate the KL-divergence term into your derivative calculation, there is a simple-to-implement
trick involving only a small change to your code. Specifically, where previously for
the second layer (), during backpropagation you would have computed

now instead compute

One subtlety is that you'll need to know to compute this term. Thus, you'll need
to compute a forward pass on all the training examples first to compute the average
activations on the training set, before computing backpropagation on any example. If your
training set is small enough to fit comfortably in computer memory (this will be the case for the programming
assignment), you can compute forward passes on all your examples and keep the resulting activations
in memory and compute the s. Then you can use your precomputed activations to
perform backpropagation on all your examples. If your data is too large to fit in memory, you
may have to scan through your examples computing a forward pass on each to accumulate (sum up) the
activations and compute (discarding the result of each forward pass after you
have taken its activations into account for computing ). Then after
having computed , you'd have to redo the forward pass for each example so that you
can do backpropagation on that example. In this latter case, you would end up computing a forward
pass twice on each example in your training set, making it computationally less efficient.

The full derivation showing that the algorithm above results in gradient descent is beyond the scope
of these notes. But if you implement the autoencoder using backpropagation modified this way,
you will be performing gradient descent exactly on the objective
. Using the derivative checking method, you will be able to verify
this for yourself as well.