0:01Skip to 0 minutes and 1 secondMany of the research papers we talked about, and many others we will talk about later, use a technique called linear regression, to find out the relationship between two variables. For example, to find out whether higher wage levels reduce crime, we make collect data on wage and crime, and run linear regression to examine how the two variables are related. In this video and the following exercise, we will talk about the basic intuition behind linear regression and how we can actually run it using simple data. I usually use a computer program called Stata to run linear regression and other empirical analyses, but there are many other software packages such as Matlab R and Microsoft Excel that allow you to run linear regression.

0:48Skip to 0 minutes and 48 secondsFor now, we will just run a very simple linear regression, and for this task, Excel works just fine. But please feel free to check out other online courses and textbooks on econometrics to learn more details about linear regression. I think the easiest way to motivate linear regression is to see it as a finding the best fitting line exercise. Here is the data on larceny and unemployment rates from the 200 largest U.S. counties in 2000. From last week, we know we can find the official crime data in the United States from the FBI UCR website. I also obtained the county-level unemployment data from the U.S. Census website. Both data sets are publicly available online.

1:37Skip to 1 minute and 37 secondsNow let’s use the Excel’s scatterplot function to visualize the data. The x-axis represents the rate of unemployment and the y-axis represents the larceny rate. To go back to my main question, I want to find out the relationship between unemployment and larceny rates. From this figure, I would say that relationship is positive. In areas with high unemployment rates, larceny rates tend to be high as well. But we want more details. We want to quantify the relationship, and be able to say things like, “When unemployment goes up by X%, larceny will go up by Y%.” How can we do this? For now, let’s assume that the true relationship between unemployment and larceny is linear.

2:36Skip to 2 minutes and 36 secondsWe know that a line can be represented by a linear function: y=a+bx. In this case, we want to recover the linear function that gives us the best fit of our data on unemployment and larceny. So how do we find a line that fits our data points the best? Different people may have different ideas about how to define the best fitting line, but the most widely used way to define the best fitting line is to choose the line where the sum of squared errors is minimized. To give a concrete example, let’s look at this figure which has five dots. I drew a line to fit these five dots.

3:19Skip to 3 minutes and 19 secondsBut it is clear that no matter how hard I try, there is no way I can perfectly fit these five dots on a straight line. No matter how I draw the line, there will be always some error, meaning that there will be some difference between the actual value of my data point and the value predicted by the fitted line. In this figure, the five blue dots represent actual values of my data points, and the five green dots represent the values predicted by the fitted line. The difference between actual values and predicted values are the errors associated with my fitted line.

4:04Skip to 4 minutes and 4 secondsIntuitively, if we want to have a line that fits these data points well, we want to draw the line so that the errors are as small as possible. To be more exact, for a linear function y=a+bx, we want to choose the values of a and b

4:22Skip to 4 minutes and 22 secondsthat will minimize the sum of squared errors: (y1-a-bx1)^2 + (y2-a-bx2)^2, and so on. Here x1 and y1 refer to the first dot, x2 and y2 the second dot, and so on. Why we want to choose the values of a and b that minimize the sum of squared errors not just sum of errors? That’s because, if we want to minimize the sum of errors, we will run into an obvious problem. In this graph we saw, some errors were positive and some errors were negative.

5:04Skip to 5 minutes and 4 secondsAnd if we just add up these errors, the errors with opposite signs will cancel each other out, and the sum of errors will be pretty small, although this fitted line is not really doing a great job in predicting the values of my data points. On the other hand, when we add the squares of each error term, because all square terms will be positive, we won’t have to worry about such a problem. When we have just a few data points, it is actually not that hard to the linear regression by hand using simple calculus. But when we have hundreds and thousands of data points, it's probably better to let a computer to do the computation.