Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea.

Let's say we have a function <math>F</math> that takes a matrix <math>X</math> and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to <math>X</math> of <math>F</math>, that is <math>\nabla_X F</math>. The general idea is to see the function <math>F</math> as a multi-layer neural network, and to derive the gradients using the backpropagation idea.

-

To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(x)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.

+

To do this, we will set our "objective function" to be the function <math>J(z)</math> that when applied to the outputs of the neurons in the last layer yields the value <math>F(X)</math>. For the intermediate layers, we will also choose our activation functions <math>f^{(l)}</math> to this end.

+

+

Using this method, we can easily compute derivatives with respect to the inputs <math>X</math>, as well as derivatives with respect to any of the weights in the network, as we shall see later.

== Examples ==

== Examples ==

-

We will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]] to illustrate the method of computing gradients of functions on matrices using the backpropagation idea.

+

To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we will use two functions from the section on [[Sparse Coding: Autoencoder Interpretation | sparse coding]], in examples 1 and 2. In example 3, we use a function from [[Independent Component Analysis | independent component analysis]] to illustrate the use of this idea to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights.

=== Example 1: Objective for weight matrix in sparse coding ===

=== Example 1: Objective for weight matrix in sparse coding ===

-

Recall the objective function for the weight matrix <math>A</math>, given the feature matrix <math>s</math>:

where <math>W</math> is the weight matrix and <math>x</math> is the input.

+

+

We would like to find <math>\nabla_W \lVert W^TWx - x \rVert_2^2</math> - the derivative of the term with respect to the '''weight matrix''', rather than the '''input''' as in the earlier two examples. We will still proceed similarly though, seeing this term as an instantiation of a neural network:

To have <math>J(z^{(4)}) = F(x)</math>, we can set <math>J(z^{(4)}) = \sum_k J(z^{(4)}_k)</math>.

+

+

Now that we can see <math>F</math> as a neural network, we can try to compute the gradient <math>\nabla_W F</math>. However, we now face the difficulty that <math>W</math> appears twice in the network. Fortunately, it turns out that if <math>W</math> appears multiple times in the network, the gradient with respect to <math>W</math> is simply the sum of gradients for each <math>W</math> in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we can proceed to work out the deltas first:

</th><th width="200px">Delta</th><th>Input <math>z</math> to this layer</th></tr>

+

<tr>

+

<td>4</td>

+

<td><math>f'(z_i) = 2z_i</math></td>

+

<td><math>f'(z_i) = 2z_i</math></td>

+

<td><math>(W^TWx - x)</math></td>

+

</tr>

+

<tr>

+

<td>3</td>

+

<td><math>f'(z_i) = 1</math></td>

+

<td><math>\left( I^T \delta^{(4)} \right) \bullet 1</math></td>

+

<td><math>W^TWx</math></td>

+

</tr>

+

<tr>

+

<td>2</td>

+

<td><math>f'(z_i) = 1</math></td>

+

<td><math>\left( (W^T)^T \delta^{(3)} \right) \bullet 1</math></td>

+

<td><math>Wx</math></td>

+

</tr>

+

<tr>

+

<td>1</td>

+

<td><math>f'(z_i) = 1</math></td>

+

<td><math>\left( W^T \delta^{(2)} \right) \bullet 1</math></td>

+

<td><math>x</math></td>

+

</tr>

+

</table>

+

+

First we find the gradients with respect to each <math>W</math>.

+

+

With respect to <math>W^T</math>:

+

:<math>

+

\begin{align}

+

\nabla_{W^T} F & = \delta^{(3)} a^{(2)T} \\

+

& = 2(W^TWx - x) (Wx)^T

+

\end{align}

+

</math>

+

+

With respect to <math>W</math>:

+

:<math>

+

\begin{align}

+

\nabla_{W} F & = \delta^{(2)} a^{(1)T} \\

+

& = (W^T)(2(W^TWx -x)) x^T

+

\end{align}

+

</math>

+

+

Taking sums, noting that we need to transpose the gradient with respect to <math>W^T</math> to get the gradient with respect to <math>W</math>, yields the final gradient with respect to <math>W</math> (pardon the slight abuse of notation here):

+

+

:<math>

+

\begin{align}

+

\nabla_{W} F & = \nabla_{W} F + (\nabla_{W^T} F)^T \\

+

& = (W^T)(2(W^TWx -x)) x^T + 2(Wx)(W^TWx - x)^T

\end{align}

\end{align}

</math>

</math>

Revision as of 06:44, 30 May 2011

Contents

Introduction

In the section on the backpropagation algorithm, you were briefly introduced to backpropagation as a means of deriving gradients for learning in the sparse autoencoder. It turns out that together with matrix calculus, this provides a powerful method and intuition for deriving gradients for more complex matrix functions (functions from matrices to the reals, or symbolically, from ).

First, recall the backpropagation idea, which we present in a modified form appropriate for our purposes below:

For each output unit i in layer nl (the final layer), set

where J(z) is our "objective function" (explained below).

For

For each node i in layer l, set

Compute the desired partial derivatives,

Quick notation recap:

l is the number of layers in the neural network

nl is the number of neurons in the lth layer

is the weight from the ith unit in the lth layer to the jth unit in the (l + 1)th layer

is the input to the ith unit in the lth layer

is the activation of the ith unit in the lth layer

is the Hadamard or element-wise product, which for matrices A and B yields the matrix such that

f(l) is the activation function for units in the lth layer

Let's say we have a function F that takes a matrix X and yields a real number. We would like to use the backpropagation idea to compute the gradient with respect to X of F, that is . The general idea is to see the function F as a multi-layer neural network, and to derive the gradients using the backpropagation idea.

To do this, we will set our "objective function" to be the function J(z) that when applied to the outputs of the neurons in the last layer yields the value F(X). For the intermediate layers, we will also choose our activation functions f(l) to this end.

Using this method, we can easily compute derivatives with respect to the inputs X, as well as derivatives with respect to any of the weights in the network, as we shall see later.

Examples

To illustrate the use of the backpropagation idea to compute derivatives with respect to the inputs, we will use two functions from the section on sparse coding, in examples 1 and 2. In example 3, we use a function from independent component analysis to illustrate the use of this idea to compute derivates with respect to weights, and in this specific case, what to do in the case of tied or repeated weights.

Example 1: Objective for weight matrix in sparse coding

Recall for sparse coding, the objective function for the weight matrix A, given the feature matrix s:

We would like to find the gradient of F with respect to A, or in symbols, . Since the objective function is a sum of two terms in A, the gradient is the sum of gradients of each of the individual terms. The gradient of the second term is trivial, so we will consider the gradient of the first term instead.

The first term, , can be seen as an instantiation of neural network taking s as an input, and proceeding in four steps, as described and illustrated in the paragraph and diagram below:

Apply A as the weights from the first layer to the second layer.

Subtract x from the activation of the second layer, which uses the identity activation function.

Pass this unchanged to the third layer, via identity weights. Use the square function as the activation function for the third layer.

Sum all the activations of the third layer.

The weights and activation functions of this network are as follows:

Layer

Weight

Activation function f

1

A

f(zi) = zi (identity)

2

I (identity)

f(zi) = zi − xi

3

N/A

To have J(z(3)) = F(x), we can set .

Once we see F as a neural network, the gradient becomes easy to compute - applying backpropagation yields:

Example 3: ICA reconstruction cost

We would like to find - the derivative of the term with respect to the weight matrix, rather than the input as in the earlier two examples. We will still proceed similarly though, seeing this term as an instantiation of a neural network:

The weights and activation functions of this network are as follows:

Layer

Weight

Activation function f

1

W

f(zi) = zi

2

WT

f(zi) = zi

3

I

f(zi) = zi − xi

4

N/A

To have J(z(4)) = F(x), we can set .

Now that we can see F as a neural network, we can try to compute the gradient . However, we now face the difficulty that W appears twice in the network. Fortunately, it turns out that if W appears multiple times in the network, the gradient with respect to W is simply the sum of gradients for each W in the network (you may wish to work out a formal proof of this fact to convince yourself). With this in mind, we can proceed to work out the deltas first:

Layer

Derivative of activation function f'

Delta

Input z to this layer

4

f'(zi) = 2zi

f'(zi) = 2zi

(WTWx − x)

3

f'(zi) = 1

WTWx

2

f'(zi) = 1

Wx

1

f'(zi) = 1

x

First we find the gradients with respect to each W.

With respect to WT:

With respect to W:

Taking sums, noting that we need to transpose the gradient with respect to WT to get the gradient with respect to W, yields the final gradient with respect to W (pardon the slight abuse of notation here):