Motivation

Suppose you have some loss function \(\mathcal{L}(\beta) : \mathbb{R}^n \to \mathbb{R}\) you want to minimize with respect to some model parameters \(\beta\). You understand how gradient descent works and you have a correct implementation of \(\mathcal{L}\) but aren’t sure if you took the gradient correctly or implemented it correctly in code.

Solution

We can compare our implemention of the gradient of \(\mathcal{L}\) to a finite difference approximation of the gradient. Recall that the gradient of \(\mathcal{L}\), \(\nabla_\mathcal{L}\), in a direction \(d \in \mathbb{R}^n\) at a point \(x \in \mathbb{R}^n\) is defined as

If we take \(\epsilon\) to be fixed and small, we can use this formula to approximate the gradient in any direction. By approximating the gradient in each unit direction, we construct an approximation of the gradient of \(\mathcal{L}\) at a particular point \(x\).

Example: Checking the gradient of linear regression

Suppose that we have \(n = 20\) data points in \(\mathbb{R}^2\) with responses \(y \in \mathbb{R}\). Linear regression assumes the responses \(y\) are related linearly to the data matrix \(X\) via the equation

\[y = X \beta + \epsilon\]

We want to find an estimate \(\hat \beta\) that minimizes the sum of squared error of the predicted values \(\hat y = X \hat \beta\)

In the final step above we recognize that the sum of squared residuals can be written as a dot product. Next we’d like to the gradient of this dot product. There’s a beautiful explanation of how to take the gradient of a quadratic form here. The gradient (in matrix notation) is

\[\nabla_\mathcal{L}(\beta) = -\frac{1}{n} (y - X \beta)^T X\]

We can now implement an analytical version of \(\nabla_\mathcal{L}(\beta)\) and compare it to a finite difference approximation. First we simulate and visualize some data: