Tuesday, September 04, 2012

Multicollinearity

In baseball, a run is worth about a tenth of a win. So, if I did a regression to predict wins from runs scored (RS), I'd probably get an equation something like

Wins = 0.1*RS + 11

Bill James' "Runs Created" (RC) is a statistic that's a reasonably unbiased estimate of runs scored. So, if I did a regression to predict wins from Runs Created, I'd get roughly the same thing:

Wins = 0.1*RC + 11

(Actually, the coefficient might be a little lower, because of random prediction error, but never mind.)

Now, what if I try to do a regression to predict wins from *both* RS and RC? What would happen?

When I asked myself this question, my first reaction is that it RS would still be worth 0.1, and RC would be worth close to zero, and not significant. Because, who needs RS when you have RC? So, maybe it would be something like this:

Wins = 0.1*RS + 0.007*RC + 11

And then it occurred to me: how would the regression "know" that RS is more important than RC? Might it not keep the 0.1 coefficient on the RC, instead of the RS? Like this:

Wins = 0.02*RS + 0.1*RC + 11

Or, why wouldn't it just give half credit to each?

Wins = 0.05*RS + 0.05*RC + 11

And, when I thought about it, I realized that any combination of coefficients that adds up to 1 is possible, like

Wins = 0.3*RS + 0.7*RC + 11Wins = 5.4*RS - 4.4*RC + 11

And so on. So, I wondered, which one is correct? And why?

The answer, it turns out, is that you can't really predict what will happen. The coefficients are very heavily dependent on the data, and the random error in the data.

That's because RC and RS are very highly correlated. It's a known rule of thumb that when you have variables that are highly correlated, the results are unpredictable. That's called "multicollinearity".

Here's the example for real. I ran the RS/RC regression for three different MLB seasons, 1973, 1974, and 1975. Here are the three equations I got:

The coefficients jump around a lot, and the standard errors are large. For the 1975 regression, none of the coefficients are statistically significant. (But if I take out RC and just use RS, I get significance at 6 SD.)

As I said, this is established knowledge. But I still wanted to understand, intuitively, why it happens. This is how I explained it to myself.

-------

Suppose I run a regression where I expect Y to be equal to X -- maybe I'm trying to percentage of red balls drawn from an urn with replacement, based on the percentage of red balls actually in the urn. I create a dataset that looks like this:

Y X-----56 5172 7044 4567 8290 93...

Maybe the dataset has 1,000 rows. I run the regression, and I get the best fit equation,

Y = 1*X + 0

Well, I don't get exactly that, because of random variation, but close.

Now, maybe I had two people independently counting the contents of the urns, and I expect the counts to be slightly different. I can try to reconcile the differences, but I figure, hey, why not let the regression do it? So I just add a second dependent variable, for the second person's count.

As it turns out, there were no errors, and the counts are exactly the same. So the dataset looks like:

Y X1 X2--------56 51 5172 70 7044 45 4567 82 8290 93 93...

When I run the regression, the software tells me it can't do it: the regression matrix is "singular," because my dependent variables are perfectly correlated (which, obviously, they are).

What that means, in English, is that there are an infinite number of regression equations that work. For instance, I can just use X1:

Y = 1*X1 + 0

Or I can just use X2:

Y = 1*X2 + 0

Or I can use half of each:

Y = 0.5*X1 + 0.5*X2

In fact, I can use any combination in which the coefficients add to 1:

and so on. Because I have perfect "multicollinearity," I have infinite solutions.

Now, let's make one small change -- suppose there was one difference in the count, in the first row:

Y X1 X2--------56 51 5072 70 7044 45 4567 82 8290 93 93...

Now, what happens? Well, X1 and X2 are no longer absolutely perfectly correlated, so the regression can come up with an answer. This is it:

Y = 6*X1 - 5*X2

Why that one? Because it meets the criteria that for the other 999 lines, the coefficients have to add to 1. And, it minimizes the squared error for the first line, at 0, becuase it works out exactly (since 6 * 51 - 5 * 50 equals 56).

But the random error could have happened at any line. What if it had been the second line? And, what if it was the same error, just off by one, like this:

Y X1 X2--------56 51 5172 70 6944 45 4567 82 8290 93 93...

You'd expect the regression equation to be roughly the same, right? It's still the same dataset in 998 lines out of 1000, and the other two lines just changed by 1. But it's not. The coefficients have shrunk:

Y = 3*X1 - 2*X2

The question is: why did the coefficients vary so much?

The answer, as I see it, is that what's really important is not X2 -- it's the *difference* between X1 and X2. And we don't have a variable for that. So, when the estimate of (X1 - X2) changes, it has to drag both X1 and X2 along with it.

That's a much easier way to understand it, and it keeps the X1 coefficient constant, at 1.00. Except ... we don't have a variable for X1 - X2. So the regression does the expansion, and gets

Y = 3*X1 - 2*X2 for this case, andY = 5*X1 - 4*X2 for the other case.

It's the same equation, just arranged differently, and therefore harder to interpret.

The obvious way around this is to use (X1-X2) as the second variable. Call it X3:

Y X1 X3--------56 51 072 70 144 45 067 82 090 93 0...

Now, the dependent variables (X1 and X3) are no longer correlated. And, so, we get the "correct" coefficient for X1:

Y = X1 + 2*X3 for this case, and Y = X1 + 5*X3 for the previous case.

What happens is, this time, you get the wide confidence interval for X3, but a narrow one for X1, which is really how you want to look at it.

-------

If that's not intuitive enough, try this:

Every line except one has X1 and X2 the same. In those lines, the regression gets no extra information from X2 than it did from X1. Therefore, for X2, the regression is *completely* dependent on that one line in which they're different. In other words, the regression will assume that the difference between X1 and X2 is *completely* responsible for the error in that line.

If the error in that line is 5 (for instance, Y=60 and X1=55), and the difference between X1 and X2 is 1, the regression will see that you need 5 differences to fill the gap. Therefore, you need 5(X1-X2). Therefore, the coefficient of X2 has to be 5.

If the error in that line were 1 (Y=77, X1=76), the regression will see you need the coefficient of X2 to be 1.

If the error in that line were 10 (Y=62, X1 =52), the regression will see that the coefficient of X2 has to be 10.

That's why the coefficient of X2 varies so much -- it has to vary with the size of the error on the row where it's different from X1. That error may have a high error -- choosing 100 balls out of an urn with 50 red balls, the SD of red balls drawn is 5. That means the SD of (Y-X1) is 5. Therefore, the SD of the coefficient of X2 is 5.

And, since X1 and X2 have to add to 1, that means X1 has to vary the same as X2, with an SD of about 5.

--------

More formally: in the example that I've been using, where only one line has X1 and X2 being off by 1, you can do some algebra and figure that the coefficient of X1 has to be equal to

1 - (Y - X1)

where Y and X1 are the values for that line. The SD of 1 - (Y - X1), for any given line, is 5. When the "real" value of the coefficient is 1, but it has a SD of 5 ... that's why the coefficient jumps around a lot, and that's why it's often not statistically significant.

-------

Do any of these explanations make sense? I'm finding this harder to explain smoothly than other stuff, but I hope you get the idea.