Current Economics Research, Teaching Materials, and Side Projects

Why do we need n-2? An example in R

Below is a simple example showing why we may want the as our estimates of , when our naive intuition may suggest we only want the simple average of squared errors .

To show this in no uncertain terms, I have coded a linear regression by hand in R. Also embedded in the work below are several rules I follow about writing code. They are rules 0-6. There are many other rules, since code writing is an art.

####Coding in R
#### Rule 1: Always comment on every few lines of code. It is not unheard of to comment every single line, particularly for new coders, or complex code.
#### You will need to reference your work at a later date, and after about 3 months, the purpose is lost. Also, I need to read it.

#### Rule 2: Define your variables first. Luckily these names are shared for us.
#### For your projects, use names which are clear for your research: (y=crime in Williamsburg, VA, X= Number of jello puddings consumed)

#### Rule 3: After definitions begin your second stage of work. Probably trimming existing data, etc. Do these in the order they were added.
hat_b_1<-sum( (x-mean(x)) * (y-mean(y)) ) / sum( (x-mean(x))^2 ) #Spaces between any parenthesized section of operations. We need to be able to see which parentheses are which.
hat_b_1 # Rule 4: Indent work which is conceptually subordinate. Indent more as needed. Four spaces=1 tab.
hat_b_0<-mean(y)-hat_b_1*mean(x)
hat_b_0 # Rule 5: Check your work as you go along. For our example, I got 9.89

abline(a=hat_b_0, b=hat_b_1, col=”red”) #let’s add a red line of best fit. And we must see how our plot looks. Repeat rule 0.

#### Rule 6: Keep your final analysis as punchy and short as possible without sacrificing clarity.
#### The mean sum of the squared errors (usually unknown to us as researchers)
sigma_sq<-sum(u^2)/n #this is the value we’re trying to estimate
sigma_sq_naive<-sum(hat_u^2)/n #this is a naive estimation of it
sigma_sq_hat<-sum(hat_u^2)/(n-2) #this turns out to be more accurate, particularly in small samples. If n->infinity this goes away. Try it for yourself!

#R, is this assessment true? Is sig_sq_hat a better estimator of sig_sq than our naive estimator? Is it true we need the (-2)?
(sigma_sq-sigma_sq_naive) > (sigma_sq-sigma_sq_hat)

Here is one of several plots made by this code, showing a nice linear regression over the data:

Please don’t forget the derivation of why this is true! This is simply some supportive evidence that it might be true.