We see that in the 2nd equation, regularization is simply adding $\lambda$ to the diagonal of $\boldsymbol{X}^T\boldsymbol{X}$, which is done to improve the numerical stability of matrix inversion.

My current 'crude' understanding of numerical stability is that if a function becomes more 'numerically stable' then its output will be less significantly affected by the noise in its inputs. I am having difficulties relating this concept of improved numerical stability to the bigger picture of how it avoids/reduces the problem of overfitting.

I have tried looking at Wikipedia and a few other university websites, but they don't go deep into explaining why this is so.

2 Answers
2

In the linear model $Y=X\beta + \epsilon$, assuming uncorrelated errors with mean zero and $X$ having full column rank, the least squares estimator $(X^TX)^{-1}X^TY$ is an unbiased estimator for the parameter $\beta$. However, this estimator can have high variance. For example, when two of the columns of $X$ are highly correlated.

The penalty parameter $\lambda$ makes $\hat{w}$ a biased estimator of $\beta$, but it decreases its variance. Also, $\hat{w}$ is the posterior expectation of $\beta$ in a Bayesian regression with a $N(0,\frac{1}{\lambda}I)$ prior on $\beta$. In that sense, we include some information into the analysis that says the components of $\beta$ ought not be too far from zero. Again, this leads us to a biased point estimate of $\beta$ but reduces the variance of the estimate.

In a setting where $X$ high dimensional, say $N \approx p$, the least squares fit will match the data almost perfectly. Although unbiased, this estimate will be highly sensitive to fluctuations in the data because in such high dimensions, there will be many points with high leverage. In such situations the sign of some components of $\hat{\beta}$ can determined by a single observation. The penalty term has the effect of shrinking these estimates towards zero, which can reduce the MSE of the estimator by reducing the variance.

Problems?

For small samples, our sample estimates of $\mathrm{E}[\mathbf{x}\mathbf{x}']$ and $\mathrm{E}[\mathbf{x}y]$ may be poor.

If columns of $X$ are collinear (either due to inherent collinearity or small sample size), the problem will have a continuum of solutions! The solution may not be unique.

This occurs if $\mathrm{E}[\mathbf{x}\mathbf{x}']$ is rank deficient.

This also occurs if $X'X$ is rank deficient due to small sample size relative to the number of regressor issues.

Problem (1) can lead to overfitting as estimate $\hat{\mathbf{b}}$ starts reflecting patterns in the sample that aren't there in the underlying population. The estimate may reflect patterns in $\frac{1}{n}X'X$ and $\frac{1}{n}X'\mathbf{y}$ that don't actually exist in $\mathrm{E}[\mathbf{x}\mathbf{x}']$ and $\mathrm{E}[\mathbf{x}y]$

Problem (2) means a solution isn't unique. Imagine we're trying to estimate the price of individual shoes but pairs of shoes are always sold together. This is an ill-posed problem, but let's say we're doing it anyway. We may believe the left shoe price plus the right shoe price equals \$50, but how can we come up with individual prices? Is setting left shoe prices $p_l = 45$ and right shoe price $p_r = 5$ ok? How can we choose from all the possibilities?

Introducing $L_2$ penalty:

This may help us with both types of problems. The $L_2$ penalty pushes our estimate of $\mathbf{b}$ towards zero. This functions effectively as a Bayesian prior that the distribution over coefficient values are centered around $\mathbf{0}$. That helps with overfitting. Our estimate will reflect both the data and our initial beliefs that $\mathbf{b}$ is near zero.

$L_2$ regularization also always us to find a unique solution to ill-posed problems. If we know the price of left and right shoes total to $\$50$, the solution that also minimizes the $L_2$ norm is to choose $p_l = p_r = 25$.

Is this magic? No. Regularization isn't the same as adding data that would actually allow us to answer the question. $L_2$ regularization in some sense adopts the view that if you lack data, choose estimates closer to $0$.