Bounty: 50

When doing forward feature selection for linear regression, it is a well known trick that to select the next feature to add, we can compute the covariance of each candidate feature against the current set of residuals, and choose the one with maximum absolute value.

Intuitively, this makes sense to me, but I haven’t been able to find or derive a rigorous proof that this technique is equivalent to the naive approach of adding each candidate feature one-by-one, computing coefficients and a squared error for each one, and then choosing the feature yielding the minimum squared error.