I am learning a machine learning class online from Stanford, namely CS 229. There is one section about deep learning and back-propagation in deep learning.

The network looks like:

The forward propagation can be defined as:

where g is the activation function.

The dimensions of each variable can also be given as:

Now, for back-propagation, by using chain rule, we can get:

To match up with the dimensions, we have:

I know that after applying chain rule, the normal way is to calculate generalized Jacobian matrix and do matrix multiplication. However, the dimension of each part in chain rule above does not match what generalized Jacobian matrix will give us. For example, for the last term in chain rule, the dimension from generalized Jacobian matrix should be (2 X 1) X (2 X 3). However, what course notes say is 1 X 3.

1 Answer
1

You're right that that doesn't make sense as the Jacobian. Furthermore if multiplying jacobians was really how autodiff worked, any pointwise function applied on vector of length $n$ would result in a huge $n \times n$ Jacobian being created. This is not what happens in any competant autodiff implementation.

In reality, it's not necessary to compute the jacobian in order to perform backpropagation. All that is needed is the "vector jacobian product", or VJP.

If you have a function $f : \mathbb{R}^n \rightarrow \mathbb{R}^m$, then $\text{VJP} : \mathbb{R}^m \times \mathbb{R}^n \rightarrow \mathbb{R}^n$ is a function which computes $\text{VJP}(g,x) = J_f(x)^T g$, where $g$ is the incoming gradient vector $\frac{\partial \mathcal{L}}{\partial f}$ and $J_f(x)$ is the jacobian of $f$. Technically this is a JVP rather than VJP but that's just a matter of convention.

The key point is that although one way to implement the VJP is explicitly computing the jacobian and then performing this vector-matrix product, if you are able to compute the VJP without doing that, it is also perfectly fine.

For example, the VJP for $\sin(x)$ is just $\text{VJP}(g,x) = g \circ \cos(x)$. The VJP of $f(W, x) = Wx$ with respect to $x$ is simply $\text{VJP}(g, W, x) = W^Tg$ and the VJP with respect to $W$ is $\text{VJP}(g, W, x) = gx^T$

Returning to your question: the expression in 3.30 is actually just computing $\text{VJP}(g, W, x) = gx^T$, with all the terms on the RHS except for the right-most being part of $g$, and the last term being $x^T$.