Tuesday, August 22, 2006

On correlation, r, and r-squared

The ballpark is ten miles away, but a friend gives you a ride for the first five miles. You’re halfway there, right? Nope, you’re actually only one quarter of the way there.

That’s according to traditional regression analysis, which bases some of its conclusions on the square of the distance, not the distance itself. You had ten times ten, or 100 miles squared to go – your buddy gave you a ride of five times five, or 25 miles squared. So you’re really only 25% of the way there.

This makes no sense in real life, but, if this were a regression, the "r-squared" (which is sometimes called the "coefficient of determination") would indeed be 0.25, and statisticians would say the ride "explains 25% of the variance." There are good mathematical reasons why they say this, but they mean "explains" in the mathematical sense, not in the real-life sense.

For real-life, you can also use "r". That’s the correlation coefficient, which is the square root of 0.25, or 0.5. In this example, obviously the r=0.5 is the value which makes the most sense in the context of getting to the ballpark. Because you really are, in the real life sense, halfway there.

r is usually the value you use to draw real life conclusions from a regression. According to "The Hidden Game of Baseball," if you regress Runs Scored against Winning Percentage, you get an r of .737, which is an r-squared of .543. A statistician might use the r-squared to say that runs "explains 54.3% of the variation in winning percentage." Which is true if you are concerned with the sums of the squares of the differences – and only a statistician cares about those.

What real people are concerned about is what conclusions we can draw about baseball. And those conclusions are based on the "r", the 0.737. What that tells us is that (a) if a team ranks one standard deviation above average in runs scored, then (b) on average, it will rank 0.737 standard deviations above average in winning percentage. The 73.7% is useful information about the value of runs to winning ballgames. But the 54.3% figure doesn’t tell you anything you need to know.

I made this point in my review of "The Wages of Wins," where the authors found that payroll "explains only 18%" of wins. They were using r-squared. The r is the square root of .18, which is about .42. Every SD of increased salary leads to an increase of 0.42 SD in wins. In real life, salary explains 42% of wins – although a statistician would probably never put it that way.

Sometimes, the correlation coefficient is used not to predict anything, but just to give you an idea of the relationship between variables. Everyone knows that +1 is a perfect positive relationship, -1 is a perfect negative relationship, and 0 is no relationship at all. And the higher the absolute value of the number, the stronger the relationship. So an r of 0.1 is a weak relationship, but -0.9 is a very strong relationship.

But a "very strong relationship" depends on the context. Sean Forman reports that the correlation between year-to-year players’ batting average is 0.45. That’s pretty high. But if the game-to-game correlation was 0.45, that would be enormous! It would indicate a huge "hot hand" effect. It would mean that if a player was two hits above average one night – say, he went 3-for-4 instead of 1-for-4 – he would be 0.9 hits above average the next night. That would mean that a .250 hitter turns into a .475 hitter after a 3-for-4 game!

Obviously, if you really did the experiment of computing game-to-game correlations, you’d get a very small number. I’m guessing, but, for the sake of argument, let’s say it might be 0.04.

Now, these two correlations are measuring the same ability – hitting for average. But because of context, an 0.45 can be pretty high in the season case, but earth-shattering in the game case. Conversely, 0.04 is meaningful in the game case, but, in the season case, it would show that batting average is barely a repeatable skill at all.

And so he writes, "this exercise reveals that there is a great deal of consistency between the Football Outsiders metrics and the metrics we report in The Wages of Wins."

With which I disagree. The interpretation of correlation coefficient depends, again, upon the context. If you were completely ignorant about football statistics, then, yes, a 90% correlation would indicate that you’re measuring roughly the same thing. But given the vast amount of sabermetric knowledge we have about football, 90% could mean the statistics are very different at the margins of knowledge.

For instance, I’d bet that, in baseball, Total Average and Runs Created might correlate on the order of 90%. But, given our knowledge of baseball, we know that Total Average is unsatisfactory in many ways, and the differences are significant at the level of detail that we need for future research. 90% is enough to put Babe Ruth on the top and Mario Mendoza on the bottom. But it’s not good enough to tell the productive base stealers from the unproductive, or give us reliable information about the relative value of hits, or even to distinguish the 55th percentile player from the 45th.

To sum up: in one example, a 0.45 correlation was huge; in another example, a 0.9 correlation was mediocre. If your analysis starts and stops with the correlation coefficient, you really haven’t proven anything at all.

23 Comments:

There was an excellent academic article related to this topic -- using baseball actually -- published in Psychological Bulletin in 1985 (Vol. 97) by Robert Abelson, entitled:

A Variance Explanation Paradox: When a Little is a Lot

Through calculations and simulations, Abelson concludes that, for any single at-bat, the variance in outcome (hit/non-hit) explained by batting skill is only about 1%.

Abelson also notes, however, that "good teams usually win." In other words, even before the season starts, teams whose hitters look good on paper actually tend to do well over the long haul of the season.

Hence the paradox: Batting skill accounts for seemingly microscopic variance in any at-bat, yet teams with hitters identifiable a priori as "good" usually win.

Abelson resolves the conundrum with the idea of cumulativity:

"...a team scores runs by conjunctions of hits, so a team with many high-average batters is more likely to stage rallies than a team with many low-average batters."

For any single PA, it's either safe or out. That makes our var(random) = .474^2.

var(observed) therefore will be .475^2.

Your regression toward the mean therefore will be over 99%.

That's for one PA. But, there are 80 PAs per game, more or less (hitters and pitchers). The var(random) drops down to .053^2.

So, it all depends on the number of "trials". In football, you probably have around 150 possessions? Basketball is what, 200? Hockey likely in the 100+ neighboorhood? Tennis, 4 matches x 9 games x 6 to 8 points = 250?

The less trials, and the closer the var(true) is to zero, the more luck plays a role. My guess is that tennis has far fewer upsets simply because the trials are so high, and the spread in talent is so much wider.

On the subject of tennis upsets, I think there was a mathematical treatment in "Game, Set, Math" by Ian Stewart. I'll look for it next time I'm at the library.

The conclusion, if I remember correctly, was that the number of trials is so large that the probability that the better player wins is very close to 100%, even if one player is only a bit better than the other.

Another interesting and relevant article is by Dan Ozer, 1985, Psychological Bulletin. Here he shows that there are really two models for understanding the correlation between two variables. The first, the one referred to in this comment as the regression approach, is called the variance decomposition model. Here, the predictor represents some component of a larger outcome criterion. For example, if we were interested in knowing the relationships between gender and voting preference (e.g., republican vs democrat). There are many factors that go into voting preference, and gender is just one. In this case, to understand the amount of overlapping variance, the correlation would be squared. Thus if gender correlated .40 with voting preference, it would be said to explain 16% of the variance. This is the common model presented in stat books.

However, there is another model that is commonly used but not much ackowledged. This is what Ozer refers to as the Common Elements model. Here, the two variables are believed to share a common cause. In this case, the correlation coefficient DOES GIVE THE AMOUNT OF SHARED VARIANCE. We see this model operating most clearly in a test-retest reliability paradigm. Here, the same instrument is given to the same people at two different points in time. Then, scores on the two administrations of the test are correlated. The correlation IS the retest reliability coefficient (as you know, a reliability coefficient is a variance estimate). So if the retest correlation is .80, we say that 80% of the variance is reliable.

Thus, in looking at the baseball example, it needs to be determined whether the predictor is a component of a larger criterion, or if the two variables are themselves reflections of a common, underlying factor. So, one would need to specify the conceptual model that underlies the relationship between the two variables in order to determine whether to square r or not.

The "close to 100%" was in theory for a very simplified model ... it may have assumed each player had the same chance of winning any given point. It may have taken service into account; I'm not sure.

Or, actually, it's more likely that I'm wrong about the "a bit better than the other" part. I remember there was one example where one player randomly won 60% of the points, which is a huge difference in ability.

Ah, 60% is huge! I guess the simple question is: if a guy who wins 60% of the points faces a guy who wins 40% of the point, how often will the second guy win more than 50% of the points, over 250 trials? I get 99.9%.

If the probability we expected was simply 51% to 49% for any single point, the better guy will win 62% of the time. If let's say this was Sampras/Agassis head-to-head record, it shows you how very close they are, and it's only the setup in tennis that allows Sampras to stand out much more.

It could even be that early in the sport, it was realized that players are very close in ability, and they deliberately chose a long-match format so that you could find out who the better players are.

That is, maybe the "length" of the game is deliberately chosen to produce an aesthetically-pleasing frequency of upsets -- not too many, not too few.

If this is true, then maybe a prediction would be that sports in which there was a large variation in team or player ability (when the rules were being made or changed) would have a "shorter" game, and sports in which the variation was smaller would have a "longer" game.

For tennis, this is likely the case, with women. The spread in talent in women's tennis is likely far wider than in men's tennis. To ensure that the same women don't always win, you need fewer games per match.

As for baseball, var(true) for a baseball team is about .060 (which can be calculated in many ways).

var(random) reaches .060, when the number of games played is 69. That is, after 69 games, the "r" is .50.

I don't know what the var(true) for a football team is. I'm sure it's quite a bit higher. Just taking a quick stab at it now, let's say var(true) it's .150 for football. To get var(random) to be .150, you need 11 games. That, is, after 69 baseball games, you'll know as about the true talent of teams, as you would after 11 NFL games.

Hmmm... should have checked first. var(obs) in the NHL is .100^2, making var(true) = .083^2. To match var(rand) of of .083^2, you need to play 36 games.

So, 12 NFL games, 36 NHL games, and 69 MLB games are equivalent.

In the NFL, with only 16 games, luck plays a huge role. In the NHL and MLB, both those number of games is 43-44% of their respective seasons. There's no "true talent" reason for the NHL to have all those playoff games.

I apologize for the cross-posting to here an on my site, but here's another thought.

======================Which makes me think about the home field advantage. We all know in basketball it's way high. I always figured it was because of travel and fatigue. But, maybe it's something similar here. Let's say that all athletes get a 1% boost by playing at home. In basketball, because of the way the game is laid out (100 possessions per team as opposed to 40 for baseball), then they get to keep piling up on that. That is, if basketball were only played for one quarter (25 possessions each), and you look at the home record, I'm sure it won't be .620. Likely, it'll be something like .530.

This relationship between r-squared and the coefficient of correlation only holds in a univariate regression setting. And the discussion of "real life" use of statistics sounds like you mean "people who can't be bothered to take the time to learn statistics."

Very interesting article and comments. I have one comment about the use of r versus r^2. I'm always looking for simple explanations of statistics and I think r^2 can actually have a practical and intuitive interpretation in some cases. Suppose, you are looking at the correlation between team winning percentage and runs scored. Here is how I would explain r^2 in this case: The winning percentages vary a lot. Why are they different? One of the reasons is differences in runs scored. How much of the variance in winning percentages is explained by runs scored? About 33%. How much of the variance is explained by runs and runs allowed together? About 83%.

>"How much of the variance in winning percentages is explained by runs scored? About 33%."

I'd argue that "variance" is not an intuitive real-life concept. Standard deviation is, but not variance. If adult male height has a standard deviation of 4 inches, that makes sense. But if the variance is 16 inches squared, what the heck does *that* mean to the non-statistician?

These data are interesting – I’d read of them for a long time previously, especially as an Australian used to (Australian Rules) football and to cricket matches lasting over two innings and five days.

In (Australian Rules) football, each team gets around three hundred possessions per game, which means that even the short 22-game season (necessitated by wear upon legs and knees from contact with Australia’s hard, dry turf and ancient soils) is plentifully adequate to determine beyond doubt which teams are best. In fact, five Australian Football League games would, at the possession rate of football noted in the last sentence, reveal as much skill as the 12 NFL games, 14 NBA games, 35 NHL games and 69 MLB games noted as equivalent in the post above. In first-class cricket, where the most skilled batsmen can even in practice receive up to six hundred balls of bowling, one or two games might be enough to reveal as much skill as a full season of baseball. (In the more popular modern forms of limited-overs and especially 20/20 cricket, batsmen can only receive as many balls as in baseball, so skill revealed is vastly less).

It seems as if there is an “inverse square” law between number of possessions and amount of skill revealed: in the NBA each team gets just over twice as many possessions per game as baseball, yet one fifth as many games reveals equivalent skill. The notion that one game of (Australian Rules) football reveals the same amount of skill as fourteen baseball games would fit in with this “inverse square” theory if each team in football obtains three-and-a-half to four times as many possessions per game in (Australian Rules) football as in baseball. This “inverse square” law may relate to the relationship between standard deviation (result) and variance (skill) but such a hypothesis would need testing.