I am looking at some simple regression models using both R and the statsmodels package of Python. I've found that, when computing the coefficient of determination, statmodels uses the following formula for $R^2$:
$$
R^2 = 1 - \frac{SSR}{TSS}~~~~~~(\text{centered})
$$
where $SSR$ is the sum of squared residuals, and $TSS$ is the total sum of squares of the model. ("Centered" means that the mean has been removed from the series.) However, the same calculation in R yields a different result for $R^2$. The reason is that R seems to be calculating $R^2$ as:
$$
R^2 = 1 - \frac{SSR}{TSS}~~~(\text{uncentered})
$$
So, what gives? Presumably there's some reason to prefer one over the other in certain situations. I haven't been able to find any information online about the cases where one of the above formulae should be preferred.

The colors suck, and I can't tell if I checked your response. I've clicked it a random number of times (by now), if you're not getting appropriate credit at this point, comment here and I'll click it once more :)
–
BenDundeeSep 12 '12 at 16:04

Ok, I thought I'd follow up on this. I've been struggling with the answers here a bit, and have come to some better understanding of the problem. For posterity, I also think that a full explanation of _why there are two different forms of this equation for R^2 would be beneficial to anyone who stumbles upon this thread. I don't know if this is common knowledge, or what--no one seems to explain (possibly a lot of people just don't know, or possibly it's so basic that it's expected that people 'just know') WHY there are two forms for R^2. This includes several sets of lecture notes by professors at major universities: perhaps I'm just not looking in the right places.

The reason for the two different equations above comes from the fact that you're comparing the model against the null hypothesis. The null hypothesis is "there exists zero relationship between the dependent and independent variables". This means you're taking the slope to be zero. Another way to say this is that you're comparing the regression model you build to a nested model with one fewer parameters.

Now, suppose we have a set of data with one independent variable (x) and one dependent variable (y). We have two choices:

We choose to model the relationship between x and y with a one parameter linear model, namely $\hat{y}_i = a_1 \hat{x}_i + \epsilon_i$. The null hypothesis is that there is no relationship between x and y, thus the correct null hypothesis is $\hat{y}_i = \epsilon_i$. In other words, the null hypothesis is just white noise. Clearly, $\mathbb{E}(y) = 0$, thus the correct form of $R^2$ is
$$
R^2 = 1- \frac{\sum_i (y_i - \hat{y}_i)^2}{\sum_i y_i^2}
$$

A good way to think about this is the following: suppose the null hypothesis were (100%) correct, and there truly were no relationship between x and y. What would we expect? If anything's fair, the answer is "We expect $R^2=0$."

In the case where we choose a two-parameter model, we expect that $\bar{y} = \hat{y}_i = a_0$. If this isn't obvious, try drawing the picture with the model value under the null hypothesis $\hat{y}_i$, the data point as $y_i$, and the average $\bar{y}$. If the model is correct (i.e., number of data points -> infinity), then you should be able to see graphically that $\bar{y} = \hat{y}_i = a_0$, in the case where the null hypothesis is true.

Conversely, using the same picture as above, $\hat{y}_i = \bar{y} = 0$. There's a slight rub here, because you have to worry about how these things go to zero. L'Hopital will tell you that, in this case at least, $\lim 0/0 = 0$, and everything is ok.

You can see why you get funny things happening with $R^2$ (like negative values) if you use the wrong form of the equation. I noticed it first because the statsmodels package in Python does one thing, and R does something else: it pains me to say, but R is right and statsmodels is wrong. (Well, not really "pains"...)

I would love for some feedback on this intuition. I have only found one reference where this is explained explicitly. Please see this pdf file (download here), Section 5.3.6. Additionally, the other linked answer on stackexchange alludes to this fact, but the reasoning wasn't completely clear to me (no offense to the person who answered the question, it is a very well-written response, and I can be dense at times! ).

Again, please correct my reasoning in the comments, and I will amend the post until it is acceptable.

I guess that, as Stephane hinted in the comment, all the difference is in how you describe your model. If you describe the same model, the r squared will be the same in both cases.

I will post some python code to show that afterward, but first a word of caution: statsmodels, with the OLS function do not add automatically the intercept, while the R formula will, so this may be the origin of your difference. If this make you feel uncomfortable, try to use the new formula syntax.

With the following code oyu can verify that they bot gives you the same value (statsmodels is actually tested to give the same results as R)

It was indeed the source of the difference. I talked a bit with one of the statsmodels devs, and I think the updated version (I can't remember version numbers, by now!) behaves as R does---i.e., it calculates the correct R^2, given the model.
–
BenDundeeMar 14 '13 at 20:45

Yes, I guess that at the time of the question they could actually behave differently. I just chosen to answer just to give an update on the problem. Statsmodels is IMHO a good python alternative for statistics and deserve to be promoted a little :) btw, compliments for your answer, it was really educational
–
EnricoGiampieriMar 14 '13 at 21:13