Sabermetric Research

Phil Birnbaum

Tuesday, May 19, 2009

Don't always blindly insist on statistical significance

Suppose you run a regression, and it turns out that the input you're investigating turns out to appear to have a real-life relationship to the output. But it also turns out that the despite being significant in the real-life sense, the relationship is not statistically significant. What do you do?

David Berri argues (scroll down to the second half of his post) that once you realize the variable is statistically insignificant, you stop dead:

We do not say (and this point should be emphasized) the “coefficient is insignificant” and then proceed to tell additional stories about the link between these two variables.

One of my co-authors puts it this way to her students.

“When I teach econometrics I tell my students that a sentence that begins by stating a coefficient is statistically insignificant ends with a period.” She tells her students that she never wants to see “The coefficient was insignificant, but…”

Well, I don't think that's always right. I explained why in a post two weeks ago, called "Low statistical significance doesn't necessarily mean no effect." My argument was that, if you already have some reason to believe there is a correlation between your input and your output, the result of your regression can help confirm your belief, even if it doesn't rise to statistical significance.

Here's an example with real data. I took all 30 major league teams for 2007, and I ran a regression to see if there was a relationship between the team's triples and its runs scored. It turned out that there was no statistically-significant relationship: the p-value was 0.23, far above the 0.05 that's normally regarded as the threshold.

Berri would now say that we should stop. As he writes,

"Even though we have questions, at this point it would be inappropriate to talk about the coefficient we have estimated ... as being anything else than statistically insignificant."

And maybe that would be the case if we didn't know anything about baseball. But, as baseball fans, we know that triples are good things, and we know that a triple does help teams score runs. That's why we cheer our team's players when they hit them. There is strong reason to believe there's a connection between triples and runs.

So I don't think it's inappropriate at all to look at our coefficient. It turns out that the coefficient is 1.88. On average, every additional triple a team hit was associated with an increase of 1.88 runs scored.

Of course, there's a large variance associated with that 1.88 estimate -- as you'd expect, since it wasn't statistically significant from zero. The standard deviation of the estimate was 1.53. That means a 95% confidence interval is approximately (-1.18, 4.94). Not only is the 1.88 not significantly different from zero, it's also not significantly different from -1, or from almost +5!

But why can't we say that? Why shouldn't we write that we found a coefficient of 1.88 with a standard deviation of 1.53? Why can't we discuss these numbers and the size of the real effect, if any?

Berri and his co-author would argue that it's because we have no good evidence that the effect is different from zero. But what makes zero special? We also have no good evidence that the effect is different from 1.88, or 4.1, or -0.6. Why is it necessary to proceed as if the "real" value of the coefficient is zero, when zero is just one special case?

As I argued before, zero is considered special because, most of the time, there's no reason to believe there's any connection between the input and the output. Do you think rubbing chocolate on your leg can cure cancer? Do you think red cars go faster than black cars just by virtue of their color? Do you think standing on your head makes you smarter?

In all three of these examples, I'd recommend following Berri's advice, because there's overwhelming logic that says the relationship "should" be zero. There's no scientific reason that red makes cars go faster. If you took a thousand similarly absurd hypotheses, you'd expect at least 999 of them to be zero. So if you get something positive but not statistically significant, the odds are overwhelming that the non-zero point estimate got that way just because of random luck.

But, for triples vs. runs, that's not the case. Our prior expectation should be that the result will turn out positive. How positive? Well, suppose we had never studied the issue, or read Bill James or Pete Palmer. Then, we might naively figure, the average triple scores a runner and a half on base, and there's a 70% chance of scoring the batter eventually. That's 2.2 runs. Maybe half the runners on base would score eventually even without the triple, so subtract off .75, to give us that the triple is worth 1.45 runs. (I know these numbers are wrong, but they're reasonable for what I might have guessed pre-Bill James.)

If our best estimate going in was that a triple should be worth 1.45 runs, and the regression gave us something close to that (and not statistically significantly different), then why should we be using zero as a basis for our decision for whether to consider this valid evidence?

Rather than end the discussion with a period, as Berri's colleague would have us do, I would suggest we do this:

-- give the regression's estimate of 1.88, along with the standard error of 1.53 and the confidence interval (-1.18, 4.94).-- state that the estimate of 1.88 is significant in the baseball sense.-- admit that it's not significantly different from zero.-- BUT: argue that there's reason to think that the 1.88 is in the neighborhood of what theory predicts.

If I were writing a paper, that's exactly what I'd say. And I'd also admit that the confidence interval is huge, and we really should repeat this analysis with more years' worth of data, to reduce the standard error. But I'd argue that, even without statistical significance, the results actually SUPPORT the hypothesis that triples are associated with runs scored.

You've got to use common sense. If you got these results for a relationship between rubbing chocolate on your leg and cancer, it would be perfectly appropriate to assume that the relationship is zero. But if you get these results for a relationship between height and weight, zero is not a good option.

And, in any case: if you get results that are significant in the real world, but not statistically significant, it's a sign that your dataset is too small. Just get some more data, and run your regression again.

------

Here's another example of how you have to contort your logic if you want to blindly assume that statistical insignificance equals no effect.

I'm going to run the same regression, on the 2007 MLB teams, but I'm going to use doubles instead of triples. This time, the results are indeed statistically significant:

-- p=.0012 (signficant at 99.88%)-- each double is associated with an additional 1.50 runs scored-- the standard error is 0.417, so a 95% confidence interval is (0.67, 2.33)

Everyone would agree that there is a connection between hitting doubles and scoring runs.

But now, Berri and his colleague are in a strange situation. They have to argue that:

-- there is a connection between doubles and runs, but-- there is NO connection between triples and runs!

If that's your position, and you have traditional beliefs about how doubles lead to more runs (by scoring baserunners and putting the batter on second base), those two statements are mutually contradictory. It's obvious to any baseball fan that, on the margin, a triple will lead to at least as many runs scoring as a double. It's just not possible that a double is worth 1.5 runs, but the act of stretching it into a triple makes it worth 0.0 runs instead. But if you follow Berri's rule, that's what you have to do! Your paper can't even argue against it, because "the coefficient was insignificant, but ..." is not allowed!

Now, in fairness, it's not logically impossible for doubles to be worth 1.5 runs in a regression but triples 0.0 runs. Maybe doubles are worth only 0.1 runs in current run value, but they come in at 1.5 because they're associated with power-hitting teams. Triples, on the other hand, might be associated with fast singles-hitting teams who are always below average.

In the absence of other evidence, that would be a valid possibility. But, unlike the chocolate-cures-cancer case, I don't think it's a very likely possibility. If you do think it's likely, then you still have to make the argument using other evidence. You can't just fall back on the "not significantly different from zero."

Using zero as your baseline for significance is not a law in the field of statistical analysis. It's a consequence of how things work in your actual field of study, an implementation of Carl Sagan's rule that "extraordinary claims require extraordinary evidence." For silly cancer cures, for red cars going faster than black cars, saying there's a non-zero effect is an extraordinary claim. And so you need statistical significance. (Indeed, silly cancer cures are so unlikely that you could argue that 95% significance is not enough, because that would allow too many false cures (2.5%) to get through.)

But for triples being worth about the same as doubles ... well, that's not extraordinary. Actually, it's the reverse that's extraordinary. Triples being worth zero while doubles are worth 1.5 runs? Are you kidding? I'd argue that if you want to say triples are worth less than doubles, the burden is reversed. It's not enough to show that the confidence interval includes zero. You have to show that the confidence interval does NOT include anything higher than the value of the double.

According to David Berri, the rule of thumb in econometrics is, "if you don't have signficance, ignore any effect you found." But that rule of thumb has certain hidden assumptions. One of those assumptions is that on your prior beliefs, the effect is likely to be zero. That's true for a lot of things in econometrics -- but not for doubles creating runs.

-----

This doubles/triples comparison is one I just made up. But there's a real life example, one I talked about a couple of years ago.

In that one, Cade Massey and Richard Thaler did a study (.pdf) of the NFL draft. As you would expect, they found that the earlier the draft pick, the more likely the player was to make an NFL roster. Earlier choices were also more likely to play more games, and more likely to make the Pro-Bowl. Draft choice was statistically significant for all three factors.

Then, the authors attempted to predict salary. Again as you'd expect, the more games you played, and the more you were selected to the Pro Bowl, the higher your salary. And, again, all these were statistically significant.

Finally, the authors held all these constant, and looked at whether draft position influenced salary over and above these factors. It did, but this factor did not reach statistical significance. Higher picks earned more money, but by somewhere between 1 and 2 SDs.

From the lack of significance, the authors wrote:

" ... we find that draft order is not a significant explanatory variable after controlling for [certain aspects of] prior performance."

I disagree. Because for that to be true, you have to argue that

-- higher draft choices are more likely to make the team-- higher draft choices are more likely to play more games-- higher draft choices are more likely to make the Pro-Bowl

but that

-- higher draft choices are NOT more likely to be better players in other ways than that.

That makes no sense. You have two offensive linemen on two different teams -- good enough to play every game for five years, but not good enough for the Pro Bowl. One was drafted in the first round; one was drafted in the third round. What Massey and Thaler are saying is that, despite the fact that the first round guy makes, on average, more money than the third round guy, that's likely to be random coincidence. That flies in the face of the evidence. Not statistically significant evidence, but good evidence nonetheless -- a coefficient that goes in the right direction, is signficant in the football sense, and is actually not that far below the 2 SD cutoff.

That isn't logical. You've shown, with statistical significance, that higher picks perform better than lower picks in terms of playing time and stardom. The obvious explanation, which you accept, is that the higher picks are just better players. So why would you conclude that higher picks are exactly the same quality as lower picks in the aspects of the game that you chose not to measure, when the data don't actually show that?

In this case, it's not only acceptable, but required, to say "the coefficient was insignificant, but ..."

How many runs are created by good baserunning?

Baumer set out to quantify baserunning skill, in terms of runs. Specifically, he considered these seven skills:

-- advancing first to third on a single-- advancing first to home on a double-- advancing second to home on a single-- beating out a DP attempt on a ground out-- stealing second-- stealing third-- tagging up on a fly ball when on second or third

He created a (Markov) simulation using 2005-2007 league-average results for each of the seven skills, and proved that his model came close to actual league runs scored.

Then, he substituted actual team lineups, and, for every player, used their actual baserunning percentages for each of the seven situations. There were two probabilities for each situation: the probability of trying for an extra base (for double plays, this is the probability of there being a force play on the runner on first with less than two outs), and the probability of success given that an attempt was made.

For each team, he then ran the same simulation, but using league-average baserunning. The difference is an estimate of how many runs the team's players gained (or lost) with their baserunning.

The top three and bottom three:

+21.1 Mets+18.0 Yankees+14.7 Rockies

-12.3 Marlins-12.7 Red Sox-20.8 White Sox

Baumer concludes that most teams should be within 25 runs of average baserunning.

But now he wants to figure out, in theory, how many runs a really great baserunning team would gain, and how much a really bad team would lose. He tries a bunch of different selection criteria for "best" and "worst." The results, simplified a bit:

As I said, I really like this paper; it asks an interesting and well-defined question and answers it well. Moreover, it's written for readers who know baseball a bit. It does use more mathematical notation than is necessary for sabermetricians, but given that it's an academic paper, and given that the notation is not overdone and clearly explained, I'd have to say that it's very well done.

The one criticism I have is that, as far as I can tell, Baumer used actual raw success rates and didn't regress to the mean at all. That means that while the results wind up accurate in terms of what the actual run contribution was, they are exaggerated estimates of the actual skill of the players involved. If you're thinking about 2010, there's probably no way to estimate, in advance, what any given set of baserunners will do. While the "combination" group added 68.4 runs a season from 2005 to 2007, they'd regress to the mean in 2009-2001 by some amount. What's that amount? We don't really know.

Oh, and one useful point that I'll use in future: for leagues that score 0.531 runs per inning, the variance of runs per inning is 1.125. I've always used 1.000 as an estimate, based on some research I did on the 1988 AL a long time ago, but I think that league scored only .5 runs/inning. Also of note: a simulation that assumes average pitching and an average lineup has a variance a bit smaller: around 1.1 runs instead of 1.125. That's obviously because the pitching doesn't vary in the simulation, only the hitting.

Friday, May 08, 2009

The regression equation versus r-squared

OK, I hope I'm not beating a dead horse here, but here's another way to think of the difference between r-squared and the regression equation.

The r-squared comes from the standpoint of stepping back and looking at the distribution of wins among teams in your dataset. Some teams have over 60 wins, some teams have under 20 wins, and some teams are in the middle. If you look at the standings, and ask yourself, "how important are differences in salary to how we got this way?", then you're asking about r-squared.

The regression equation matters more if you're interested in the future, if you care about how much you can influence wins by increasing payroll. If you ask yourself, "how much do I have to spend to get a few extra wins?", then you want the regression equation.

The r-squared looks at the past, and asks, "was salary important to how we got to this variance in wins?". The regression equation looks to the future, and says, "can we use salary to influence wins?"

It's very possible, and very easy, to have two different answers to these two questions. Here's an example.

Suppose you're trying to see what activities 25-year-olds partake in that affect their life expectancy. You might discover that the average 25-year-old lives to 80, but you want to try to figure out what factors influence that. You run a multiple regression, and you figure out that if the person smokes at 25, it appears to cut five years off his life expectancy. If he eats healthy, it adds four years. If he commits suicide at 25, it cuts off 55 years (since he dies at 25 instead of 80).

We should all agree that committing suicide has a big effect on life expectancy, right?

Now, let's look at the r-squared. To do that, look at all the 25-year-olds in the sample (which might be several thousand). You'll see a few that live to 25, some that live to 45, a bunch that live to 65, a larger bunch that live to 80, and some that live to 100. The distribution is probably bell-shaped.

For the r-squared, ask yourself: how much did suicide contribute to the curve looking like this? The answer: very little. There are probably very few suicides at 25, and even if you adjusted for those, by taking those points out of the left side of the curve and moving them to the peak, the curve would still look roughly the same. Suicide is not a very big factor in making the curve look like it does.

And so, you get a very low r-squared for suicide. Maybe it would be .01, or even less.

See the apparent contradiction?

-- suicide has a HUGE effect on lifespan.-- r-squared for suicide vs. lifespan is very low

And, again, that's because:

-- the regression equation tells you what effect the input has on the output;-- the r-squared tells you how important that input was in creating the distribution you see.

The regression equations tell you that having a piano drop on your head is very dangerous. The low r-squared tells you that pianos haven't historically been a major source of death.

----

Here's a different way to explain this, which might make more sense to gamblers:

Suppose that you had to predict the lifespan of a random 25-year-old. Obviously, the more information you have, the more accurate your estimate will be. And, imagine the amount you lose is the square of the error in your guess. So if you guess 80, and the random person dies at 60, you lose $400 (the square of 80 minus 60).

Without any information, your best strategy is to guess the average, which we said was 80. Your average loss will be the variance, which is the square of the SD. Suppose that SD is 15. Then, your average loss would be $225.

Now, how valuable is knowing the value of whether or not the guy committed suicide? It's probably not that valuable. Most of the time, the answer will be "no", and you're only slightly better off than when you started (maybe you guess 80.05 now instead of 80). A tiny, tiny proportion of the time, the answer will be "yes," and you can safely guess 25 and be right on. On balance, you're a little better off, but not much.

On average, how much less will you lose given the extra information? The answer is given by the r-squared. If the r-squared of the suicide vs. lifespan regression is .01, as estimated above, then your loss will be reduced by 1%. Instead of losing $225, on average, you'll lose only about $222.75.

Again: the r-squared doesn't tell you that suicide is dangerous. It just tells you that, because of *some combination of dangerousness of suicide and historical frequency of suicide*, you can shave 1% off your error by taking it into account.

If you took a bet where you had to guess a random team's wins, and had to pay the square of the difference, you'd pick "41" and, on average, owe $199. But let's suppose someone tells you the team's payroll. Now, you can adjust your guess, to predict higher if the team has a high payroll, or lower if the team has a low payroll. If you adjust your guess optimally -- by using the results of the regression equation -- you'll cut your average loss by 25.61%. So, on average, you'd lose only 74.39% as much as before. That works out to $148.11.

What Berri, Brook and Schmidt are saying, in "The Wages of Wins," is, "look, if you can only cut your losses by 25.61% by knowing salary, then money can't be that important in buying wins." But that's wrong. What they should conclude is that "how important money is, combined with how often it's been used to buy wins," isn't that important.

And, really, if you look at the full results of the regression, it turns out that money IS important in buying wins, but that not too many teams took advantage of that fact in 2008-09.

The equation shows that every $1.6 million dollars in additional salary will buy you a win -- so if you want to go 61-21, it should only cost you $32 million more than the league-average payroll of $68.5 million.

That's pretty important, and so the low r-squared must be that not a lot of teams varied much in salary. If you look at the salary chart, there's a huge group bunched near the average: there are 18 teams between $62mm and $75mm, within $6.5 million of the average. Those teams are so close together that there's not much difference in their expected wins.

If you have to bet, and the random team you pick turns out to be the lowest-spending in the league, you'll reduce your estimate. You would have lost a lot of money guessing 41, so the information that you picked a low-spending team will cut your losses a lot. If it turns out be be one of the highest-spending in the league, same thing. But if it turns out to be one of the 18 teams in the mdidle, the salary information won't help you much. And why the r-squared is only about 25% -- for many of the teams in the sample, knowing the salary doesn't help you cut your losses much.

What if we take out those 18 teams, and regress only on the remaining 12? Well, the regression equation stays almost the same -- $1.5 million per win instead of $1.6. But the r-squared increases to .4586. Why does the r-squared increase? Because salary is much more significant a factor for those 12 teams than for the ones in the middle. Before, knowing the salary might not do you much good for your estimate if it's one of the teams bunched in the middle. But, now, those teams are gone. Your random team is much more likely to be the Cavaliers or the Clippers, so knowing the salary is a much bigger help, and it lets you cut your betting losses by almost half.

----

One last summary:

1. The regression equation tells you how powerful the input is in affecting output -- is it a nuclear weapon, or a pea-shooter?

2. The r-squared tells you how powerful the input is, "multiplied by" how extensively the input was historically used. That is: a nuclear weapon used once might give you the same r-squared as a pea-shooter used a billion times.

So a low r-squared might mean

-- an input that doesn't have much effect on the output (e.g., shoe size probably doesn't affect lifespan much);

-- an input that has a big effect on output but doesn't happen much (e.g., suicide curtails 100% of lifespan but happens rarely); or

Why r-squared doesn't tell you much, revisited

In a blog post I wrote about yesterday, "Wages of Wins" author Stacey Brook ran a regression to try to figure out what kind of relationship there is between an NBA team's payroll and its success on the court.

The regression gives you several pieces of information. Which ones should you use to best explain the relationship?

Brook says it's the r-squared. He writes,

"We use R2 since we are interested in the proportion of variance that is in common between NBA team payroll and NBA team performance."

But is that truly what we're interested in? I don't think so.

I do agree with Brook when he says that R-squared gives you "the proportion of variance that is in common between NBA team payroll and NBA team performance." But what does that mean? Almost nothing, unless you're a statistician.

When you do research like this, there's a question that you want to answer. In this case, if your question is "what proportion of variance is in common between NBA team payroll and NBA team performance?," well, then, there's your answer. But that's not the question. It's not even Brook's real question. His real question is implied by the first paragraph of his post:

"I have to disagree that NBA (or for that matter NHL, MLB or NFL) teams that have high payrolls result in higher winning percentages; nor am I the first to say this."

The question is: do teams with higher payrolls do better on the court? And that question is different from "what proportion of variance is in common between NBA team payroll and NBA team performance?"

If you want to see what payroll does to performance, what you want to see is the regression equation. The way regression works, of course, is to plot all the datapoints on a graph, then draw the best fit straight line among those points. That line represents the best-fit relationship between payroll and wins.

If you do that for the 2008-09 NBA teams, you get

Wins = 0.61 (millions of $ spent) - 0.76

This, basically, answers your question, in several ways

-- every extra million dollars you spend on salaries gives you three-fifths of a win. -- every extra $1.64 million you spend gives you an extra win.-- if you spend $100 million, like the Knicks, you should win about 60 games.-- if you spend only $45 million, like the Grizzlies, you should win only about 27 games.

Not that complicated, right? If you want to know about the direct relationship between salary and wins, the regression equation does it.

Of course, you want to check the statistical significance; it's possible that while the best-fit straight line says $1.64 million per win, that might not be significantly different from zero. (As it turns out, it IS significant, at the 99.5% level. In fairness to Brook, it appears his data source had incorrect information, and because of that, his results were not, in fact, significant.)

I think we can all agree, from these results, that it certainly does appear that spending leads to winning. When the highest-spending team is expected to go 60-22, and the lowest-spending team is expected to go 27-55, you can't really claim that payroll is irrelevant. (Again, in fairness to Brook, he didn't get results this extreme. With the incorrect data, the regression suggests the highest-spending team should only be 45-37.)

So if the regression equation is the gold standard for making these kinds of calculations, what's with the r-squared? Well, the r-squared answers a different question.

Let's suppose that you had no idea what makes teams win basketball games. You see the Cavs go 66-16, and you see the Clippers go 19-63, and you think, what causes the difference?

What you could do is list as many plausible things as you could think of. Payroll would be one of them. Maybe average days of rest. Maybe whether they're an offensive or defensive team. Maybe average age. Maybe pace of play. Just list them all, as many as you want. Then, run a regression, and look at the r-squared.

What the r-squared will do is tell you, in a certain mathematical sense, after correcting for all those variables, what percentage of all the variation in wins have you explained? What you're trying to do is get as close to 100% as you can. The closer you get, the more you've explained what makes teams win and what makes teams lose. Maybe, if you actually ran this regression, you'd get to something like 40%. If you adjusted team wins for all those variables, as best you could, your variance would decrease by 40%.

In this particular case, our regression didn't include all that other stuff, like pace of play or average age. We only had one variable, payroll. And it turned out that the r-squared was .256, which means that 25.6% of the variation is "explained" by payroll.

It doesn't sound like a lot. In "The Wages of Wins," Brook (and co-authors David Berri and Martin Schmidt) did that for MLB, and came up with only 18%. That doesn't sound like a very big number either, and those authors decide that means that payroll isn't very important.

But that doesn't follow.

The r-squared, the seemingly-low 25.6% number, does NOT tell you about the relationship between payroll and wins. It just tells you that payroll is 25.6% of the total variance, and other factors are 74.4%. But, if the total variance is large, 25.6% of it would be substantial.

When you go into the car dealership and ask for a price, you want the amount in dollars. If you ask "how much for that Camry," and the salesman says, "it's 700% of your monthly pay," it may sound like a lot. If he says, "it's 9.5% of your net worth," it may sound cheaper. And if he says, "it's less than 0.01% of Bill Gates' disposable income for the week," it may sound cheaper still. But those all represent the same number of dollars. The fact that one percentage is a large number, and one percentage is a small number, doesn't change that fact.

It's the same thing for r-squared. The size of the percentage number depends what it's a percentage of -- which happens to be the total variance of wins in the league. Do you know, intuitively, what that variance is? I don't. But I know that a lot of it is random chance. And random variation depends on sample size. You could have exactly the same relationship between salary and wins, but, in one case, the r-squared is .25, and in another case, it's .04, and in another case, it's .5.

Want to see how you can use the same data to get a larger r-squared? Easy. I'm going to take the actual data for the 30 teams, but group them into threes according to payroll. So instead of the three data points "$100 million, 32 wins" (Knicks), "$90.1 million, 66 wins" (Cavs), and "$86 million, 50 wins" (Mavericks), I'm going to add them all up into the one data point "$276.1 million, 148 wins". Then I'm going to repeat for the other 27 teams, until I have 10 sums of three teams. Then, I'm going to run a regression on those 10 data points.

What happens? The r-squared now goes up to .497 -- almost double what it was!

But while I was able to arbitrarily double the r-squared, the regression line stayed almost the same -- which makes sense, since the actual relationship between salary and wins shouldn't change just because we arranged the data differently. Using all 30 teams, we got 0.61 wins per million dollars. Using the 10 groups of three teams, we get 0.68 wins per million dollars. Pretty close.

If Stacey Brook did the analysis his way, using all 30 teams, he'd say "salary explains 25.6% of the variance in wins." If I do the analysis my way, using groups of three teams, I'd say "salary explains 49.7% of the variance in wins." Which one of us would be right? Both of us! Because we are using different denominators, different variances. The same Toyota Camry can be a smaller percentage of Brook's salary than of my salary, because our salaries are different.

And so saying "payroll explains 25.6% of the variance of wins" is like saying "a Camry costs 35% of salary." Whose salary, and how much does he earn? Unless you know that, the "35%" figure is useless.

But, again, despite the fact that Brook and I did our regression differently, the equation should come out very similar. It won't come out exactly the same, because of random fluctuation, but you should *expect* it to come out the same, in the same sense as you expect a coin to come up heads 50% of the time. 0.61 wins per $million and 0.68 wins per $million are pretty close.

The regression equation is meaningful, it requires less information to interpret, and its expected value is the same regardless of your sample size. Most importantly, it answers the exact question that you want to know.

The r-squared, on the other hand, is unintuitive, can be made to come out to almost anything you like by tweaking the sample size to get a different total variance, and requires you to know how the study was done in order to interpret what it means. In terms of answering real-life questions, it's not very useful at all.

Thursday, May 07, 2009

USA Today's NBA salary data is flawed

My previous post pointed to a blog entry by sports economist Stacey Brook, in which Brook found a low correlation between team payroll and wins. Specifically, he found that for the 30 teams in the 2008-09 NBA, the r-squared was only .0410.

I think Brook used incorrect data. The article he pointed to in turn poined to the USA Today basketball salary page. However, the USA Today database misallocates salaries. When a player was with more than one team in 2008-09, it counts his entire salary for only one of those teams. That throws everything off.

For instance, mid-season, the Raptors traded Jermaine O'Neal and Jamario Moon to the Heat for Marcus Banks and Shawn Marion. All four of those players are listed on the Raptors page. This (and probably other similar situations) causes the Raptors payroll to come out to $95.3 million, compared to only $67.4 million at other sources (like this one). And since all four of those players are absent from the Heat page, Miami comes out with a payroll of only $50 million instead of $68.6 million.

If you use the more standard numbers, you wind up with a solid positive relationship between salary and wins, an r-squared of .2561 instead of .0410. (That's an r of .5061.)

This doesn't affect any of the comments I made in the last post (or plan to make in future posts), but I thought I should report it anyway.

Low statistical significance doesn't necessarily mean no effect

He says there is none. Seriously. Not that the relationship is weak, not that money doesn't help much. Brook seems to honestly believe that salary doesn't buy wins at all. Read the full post to see if I'm interpreting him correctly, but here's a quote:

"So not only the proportion of variance that is common between the two tiny, but here I am able to show that the correlation coefficient between the two populations (NBA payroll and NBA performance) for the 2008-2009 season is statistically zero."

I have several problems with this analysis. The first one is not unique to Brook, and it drives me nuts. It's the idea that if you do a regression, and the significance level is less than 95%, it's OK to claim that there is no relationship between the variables.

That's not always right. It's often right; I suppose you could even say it's *usually* right. But this is one of those exceptions where it's not right at all.

Let's suppose that somehow you get it into your head that rubbing chocolate on your legs can help cure cancer. So you set up a double-blind experiment, where one set of patients gets the chocolate rub, and the other set gets a rub with fake chocolate. It turns out that the first group actually improves more than the second group -- by a small amount, maybe 1%. But the result is not statistically significant. Maybe, instead of the 95% you were looking for, you only have 80% significance.

In this case, I agree with Brook -- it would be wrong to argue that the 1% improvement you saw was real. It's probably just random chance, and you'd be justified in saying that there's no reason to believe that a chocolate rub has any therapeutic value at all.

But, now, let's turn to salary and wins. Suppose you study actual NBA payrolls and records, and you find a similar small effect: every $1 million gives you 0.1 extra wins. Again, suppose that's significant at only the 80% level.

In this case, can you draw the same conclusion, that money has no effect on wins at all? No, you can't. In this case, it's likely that the effect is real, despite the low significance level.

Why the difference? Because in the first case, there was absolutely no reason to believe that chocolate can have any effect on cancer. There's no previous scientific evidence for it, and there isn't a plausible mechanism for how the effect might work.

Suppose that, going in to the study, you (generously) thought there couldn't be more than a one in a million chance that chocolate helps treat cancer. So imagine a million different universes where you run the experiment. One time, you'll get a real effect. 200,000 times, you'll get 80% significance just by chance. So the chance that the chocolate actually works in this universe is roughly 1 in 200,001. That's still no reason to believe.

But the salary case is very different. There's no basis to believe that chocolate can cure cancer, but there's very good reason to believe that spending money buys better players and leads to more wins. In fact, every serious basketball fan in the world (except maybe Stacey Brook) believes that you can buy wins. When the Celtics pay Kevin Garnett some $25 million, does anyone really believe that the signing won't help the team? That if the Celtics instead paid $500,000 for some mediocre guy, they'd be doing just as well?

In the salary case, when you run regressions and get only 80% significance, the calculation works out differently. Suppose that going into the study, you figured there was a 99% chance that money helped buy performance (which is again conservative). Then, in a million different universes, you'd get 2,000 where the 80% signfiicance came up just by chance; and you'd get 990,000 universes where the effect is real. The chance, then, that salary actually does buy wins in this particular universe is 99.8% (990,000 divided by 992,000). The effect that Brook found is probably a real one.

(The above argument can be put into more formal mathematics using Bayesian probability, but I won't bother -- first, because it makes more sense to explain it in plain English, and, second, because I don't remember all the terminology and notation from the one Bayesian course I took in 1996.)

-----

Here's another way to look at it, if you don't like the "multiple universes" approach.

There are two possible reasons you might get a non-significant correlation between two variables:

1. There really is no relationship between the variables; or

2. There *is* a relationship, but you haven't looked at enough data to get a high enough significance level.

Almost any relationship, no matter how strong, will give you low significance if your sample size is too small. If you look at one random Ted Williams game, and one random Mario Mendoza game, what kind of significance level will you get? Pretty low. Even if Ted goes 2-for-5, and Mario goes 1-for-5 -- both of which are more extreme than their career averages -- you won't find the difference to be significant at the 95% level. One game is just not enough.

That doesn't mean this particular experiment is useless. You can still show the effect that you found, and invite further investigation. In this case, the difference between Williams and Mendoza is huge in the baseball sense -- .400 vs. .200. As a general rule, when you find an effect that's significant in the real-life sense, but not in the statistical sense, that's an indication that you might need more data. If the observed effect does have real-life importance, you are NOT entitled to conclude that there is no relationship between the variables. You are only entitled to conclude that you need more data.

And, in my opinion, you MUST show the size of the effect you found, not just the signficance level. Brook doesn't do that in his blog post. He gives us significance levels, and r, and r-squared, but the purpose of the study was to estimate the relationship between payroll and wins. Is it $5 million per win? $10 million per win? $15 million per win? Because, regardless of the significance level, the slope of the best-fit line is still the best estimate of that relationship. And I suspect that the results are reasonable, very close to what other analysts have estimated as the rate at which you can buy wins.

I suspect if we were able to look more closely at Brook's study, we'll find that:

-- he got an estimate of wins per dollar that's close to conventional wisdom;-- but he didn't have enough data to get statistical significance;-- so he claims that the proper estimate of wins per dollar is zero.

That ain't right.

-----

P.S. Probably more on this topic in the next post -- for a preview, this is why I think Brook got such low correlation.

UPDATE: Actually, I think Brook got a low correlation because the data was flawed. Details in my next post here.

Monday, May 04, 2009

NBA's debunking of referee bias flawed, says researcher

A couple of years ago, Joseph Price and Justin Wolfers came out with a study (.pdf) that found a bit of racial bias among NBA referees. The more white referees on the court, the more fouls were called against black players (relative to white players). And vice-versa: more black referees meant relatively more fouls against whites. (The vice-versa has to be true by definition, since the white referees can only be judged relative to their black peers.)

At the time, I summarized the Price/Wolfers study here, here, and here.

When that study came out, the NBA wasn't pleased, and David Stern commissioned a counter-study to refute it. I haven't seen that NBA study, but David Berri has, and he recently wrote about it on his blog. Apparently, it's very amateurish: according to Berri, the researchers estimated the results twice, once with a dummy variable for black refs, and again with a dummy variable for white refs. But, as Berri points out, that will just give the same results -- it doesn't matter which way you define the dummy variable, and "that suggests that the person doing the work for the NBA didn't understand dummy variables."

I wish I had a copy of the NBA study ... I haven't been able to find it online, and I think I read somewhere that it was only distributed to a select group of readers. One of them was Joseph Price, the author of the original study. Berri's post was based on a visit Price made to Berri's class. In addition to the dummy variable issue, Berri reports that Price said "that was just the beginning" of the problems with the NBA study, but gives no further details.

Anyone know where the NBA response can be found? I'd love to take a look at it firsthand.

P.S. Berri gets in a shot at non-academic researchers:

"Unfortunately, the quality of work offered by the consulting firm [the NBA hired] was consistent with what you sometimes see in on-line studies. In other words, it wasn’t very good. In fact, much of it consisted of mistakes you would not expect an undergraduate in econometrics to make."