Revision

In the previous post, we learned that regression is simply
a conditional expectation of the outcome Y given by input variables X=x (E(Y|X)) and least squares fitting can be used to find the coefficients of the linear regression. The linear model either assumes that the regression function E(Y|X) is linear or that the linear model is a reasonable approximation.

As you may remember, what we are trying to do is to find such betas so as to minimize the RSS:

From the previous post, we know that the betas that minimize the RSS can be found by:

For those who are familiar with linear algebra, does it remind you of something which is widely used in linear algebra? Yes, it is orthogonal projectionwhich is something extremely important.

Here is a brief overview of what orthogonal projection is and how it works:
As it can be clearly seen from the picture above, the least squares estimate of the betas is the coefficients requred for the orthogonal projection of response vector y onto the columns space of the input(feature) matrix. Then all we have to do is to multiply input matrix with our vector of estimated betas to find the least squares estimation of the responses:

X*(X^t*X)^-1*X^t part of the equation is called the projection matrix because it projects y onto the columns space of X and produces the orthogonal projections. The least squares estimates are essentially the orthogonal projections because the orthogonal distance between actual value and predicted value is the minimum distance between them. Think: what happens to projection matrix when the input matrix X is singular?

The Gauss-Markov Theorem

This theorem states that the LSE(least squares estimates) of the betas have the smallest variance among all linear unbiased estimates. But do the unbiased estimates always the best? As we will see later, there may be some biased estimates of betas that give a smaller mean squared error overall. The book gives the following equations:

This estimate is unbiased which means its expectation equals to original betas:

The proof is as below:

As we always keep in mind our assumptions, the proofs are easy as it can be seen. Then the book states that if we have any other linear estimator c’y (‘ means transpose)that is unbiased for then the variance of this estimates cannot be smaller than the variance of , the estimates of ordinay least squares.

Let’s try to prove it.

First: setting the equation:

The expectation of the linear estimator can be calculated as below:

Then the variance of it can be calculated as following:

For the ones who wonder the variance of OLS beta, here is how we calculate it:

As it can be seen, the variance of another linear estimator for the is greater than or equal to the variance of .

Then let’s try to find how well we estimate the betas by calculating the MSE of it:Correction: variance decomposition should be expectation of the second moment minusthe squared expectation, not plus.

As it is clear from the picture above, the MSE of our estimates of betas are linearly related to its variance and squared bias.Variance of the estimate is minimum when OLS is used. The Gauss-Markov theorem implies that the least squares estimator has the smallest mean squared error of all linear estimators with no bias: squared bias term is the same for all unbiased estimators but the variance part is the minimum for OLS thus cause OLS to have minimum MSE among all unbiased estimators.But sometimes biased estimates can cause a greater reduction in variance than they cause an increase in squared bias. Thus, it is possible to have a smaller MSE with biased estimates. This is bias-variance tradeof. This (along with preventing overfitting) is actually the main motivation behind biased estimates such as Ridge Regression and the LASSO.

Multiple Regression and Gram-Schmidt Procedure

So far we have seen univariate regression (regression with one input variable). Now we will see what happens when we have many of them which we call multivariate regression. In this case it is very likely to have input variables that are correlated with each other. But correlated variables are bad for regression. Why?

When we have input variables that are not correlated with each other, then it means input variables are orthogonal to each other. Thus, each input variable does not have any effect on the coefficients of other input variables. That means since they are orthogonal to each other, the effect of an input variable cannot be captured by the any other input variables. But when they are not orthogonal (some correlation), then the coefficients of correlated variables become shaky and unstable.

In the picture above, the left plot shows two centered (the mean of the vector is substracted from each element of it) orthogonal vectors. Any change along the blue vector axis cannot be captured by the orange vector axis and vice versa: they might change independently. Orthogonality does not necessarily imply uncorrelatedness or independence. It implies uncorrelatedness when the centered vectors have zero inner product because when they are centered their inner product gives the covariance of them. Independence is something related to their joint probability distribution. For example you can fix a point on the orange line and go freely along with the blue line. The right plot, however, shows two correlated vectors (no orthogonality). In this case the orange vector captures the some of the information provided by the blue line because they go in the similar direction. That means a change along the direction of orange line is somewhat captured by the blue line. So the two vectors say the same thing to some extend.

So we have to orthogonalize input variables with respect to each other so that they do not provide the same information. The book gives the following algorithm for this:

Here, z(i) is the residual vector of x(i) after x(i) is orthogonalized with respect to z(0), z(1), …, z(i-1). We have seen how orthogonalization works in the beginning. Suppose we have the input matrix:

This algorithm says that orthogonalize x1 with respect to x0 which is the first columns of 1s. To do it, we have to find orthogonal projection of x1 onto x0, then substract it from x1. To visualize it:

Here x1 is orthogonally projected onto x2, thereby production green vector x1′. Then when we substract x1′ from the blue vector x1, here is the residual vector z1. In other words: z1 is the orthogonalized vector of x1 with respect to x2. Then betas are simply calculated as follows:

In this example, the beta of x1 is the orthogonal projection of y(responses) onto the z1 (orthogonalized(or “adjusted”) version of x1 with respect to x2). When the residual vector zp is small, then it means variable p is correlated with some of the other variables. This causes its beta to be large and any small change in the residual would cause large changes in the beta, making the beta unstable and shaky. That is why we try to avoid correlation.

This entire procedure is called Gram-Schmidtprocedurefor multiple regression. The book says that we can represent the step 2 of the algorithm in matrix form:

where X is the original input matrix, Z has as columns z(j) and the other part is the upper triangular matrix with the coefficients of the residual vectors in the step 2 of the algorithm. Lets see how we can represent this algorithm in this way:

First: matrix Z is constructed like this:

Second: when we left x(s) alone, the equations and the resulting matrix X look like

So we can represent these equations in the matrix form as below:

A simple matrix calculation yields the same equations in the second part. This decomposition looks like QR decomposition, where Z is orthogonal input vectors and the other matrix is an upper triangular matrix. But in order for it to be a real QR decomposition, Z must be orthonormal: vectors must be orthogonal(ortho) and their length must equalt to 1 (normal). So in order to do this, we must divide each column of Z by its corresponding length (a.k.a norm). Here is what normal vector is:

So dividing each column of Z by its corresponding length means multiplying Z with some diagonal matrix:

Let D denote the diagonal matrix on the picture above. Its inverse is simply the same matrix with the diagonal elements 1/a, 1/b, and so on. If the jth diagonal corresponds to the length of z(j) then the following equation gives the QR decomposition:

where Q is orthonormal matrix and R is an upper triangular matrix. If we replace X with QR in our calculation of beta, the equations look like:

This is how they are derived:

Beta estimates are much easier to calculate in this form compared to original algorithm. That is why QR decomposition is used very commonly.