Bias-variance decomposition

Abstract

Reference

Regression Decomposition

In regression analysis, it’s common to decompose the observed value as following:

where the true regression is regarded as a constant given .
The error (or the data noise) is independent of
with the assumption that follows a Gaussian distribution of mean 0 and variance .
Noticed that (0) is only a description of the data.
When we replace with its estimation , (0) turns into a more practical form:

Where the estimation of the true regression is regarded as a non-constant given ,
and the residual describes the gap between and .
Based on (0) and (1), we can make the following observations:

Bias-variance decomposition

Using (2) and (3), we can show that why minimizing mean squared error for regression problem is useful.
For the derivation, we need a few more identities related to expectation and variance.
Given any two independent random variable x, y, and a constant c, we have:

Begin with the definition of mean squared error; we can rewrite it in the form of expected value:

By expanding , we get

Noted that
because is independent of and

And we reach our final form (7) which is the sum of data noise variance ,
prediction variance
and the squared prediction bias .
Such result is the bias-variance decomposition.

Why variance matter in regression?

Lowing the prediction bias certainly gives the model higher accuracy on the training dataset;
however, to obtain similar performance outside of training dataset,
we want to prevent the model from overfitting the training dataset.
Given that the true regression has zero variance,
a robust model should have prediction variance as small as possible,
and this is consistent with the objective of the mean squared error.