Sabermetric Research

Tuesday, June 19, 2012

Significance testing and contradictory conclusions

A bank is robbed. There are witnesses and video cameras. There are two suspects -- identical twin brothers. The police investigate, and are unable to determine which brother is the criminal.

The police call a press conference. They are familiar with the standards of statistical inference, where you don't have evidence until p < .05. And, of course, the police have only p = .5 -- a fifty/fifty chance.

"There is no evidence that Al did the crime," they say, accordingly. "And, also, there is no evidence that Bob did the crime."

But if the newspaper says, "police have no evidence pointing to who did it," that seems wrong. There is strong evidence that *one* of the brothers did it.

------

"Bad Science," by Ben Goldacre, is an excellent debunking of some bad research and media reporting, especially in the field of medicine. A lot of the book talks about the kinds of things that we sabermetricians are concerned about -- bad logic, misunderstandings by reporters, untested conventional wisdom, and so forth.

I just discovered that Goldacre has a blog, and I found this post on significance, which brings up the twins issue.

Let me paraphrase his example. You have two types of cells -- normal, and mutant. You give them both a treatment. In the normal cells, firing increased by 15 percent, but that's not statistically significant. In the mutant cells, firing increased by 30 percent, and that *is* significant. (I'm going to call these cases the "first treatment" and the "second treatment," even though it's the cells that changed, and not really the treatment.)

So, what do you conclude?

Under the normal standards of statistical inference, you can say the following:

1. There is no evidence that the first treatment works.

2. There is significant evidence the second treatment works.

Seeing that, a lay reader would obviously conclude that the researcher found a difference between the two treatments.

Goldacre objects to this conclusion. Because, after all, the difference between the two treatments was only 15 percent. That's probably not statistically significant, since that same 15 percent number was judged insignificant when applied to the normal cell case. So, given that insignificance, why should we be claiming there's a difference?

What Goldacre would have us do, I think, is to say that we don't have evidence that there's a difference. So we'd say this:

1. There is no evidence that the first treatment works.

2. There is significant evidence the second treatment works.

3. There is no evidence that the treatments are different.

And now we have the twin situation. Because even though there's no evidence for #1, and there's no evidence for #3, there is evidence that one of the two must be true. Either the first treatment has an effect, or the two treatments are different. They can't both be false. At least of the twins must be guilty.

You have to be especially careful, more careful than usual, that you don't assume that absence of evidence is evidence of absence. Otherwise, if you insist that both coefficients should be treated as zero, you're claiming a logical impossibility.

----

Even if you just assume that one of the two coefficients is zero ... well, how do you know which one? If you choose the first one, you assume the effect is zero, when the observation was 15 percent. If you choose the second one, you assume the effect is 30 percent, when the observation was 15 percent.

And it gets worse. Imagine that instead of 15 percent for the first treatment, the result was 27 percent, which was significant. Now, you can say there is evidence for the first treatment, and there is also evidence for the second treatment. And, you can also say that there is no evidence that the two treatments are different.

That's all good so far. But, now, you head to your regression, and you start computing estimates. And, what do you do? You probably use 27 percent for the first treatment, and 30 percent for the second treatment. But you just finished saying there's no evidence they're different! Shouldn't you be using 27 for both, or 30 for both, or 28.5 for both? Shouldn't it be a problem that you assume one thing on one page, and the opposite on the very next page?

If you're going to say, "there's no evidence for a difference but we're going to assume it anyway," why is that better than saying (in the previous example) "there's no evidence that the first treatment works, but we're going to assume it anyway?"

Tuesday, June 05, 2012

Privileging the null hypothesis

The fallacy of "privileging the hypothesis" occurs when you concentrate on one particular possibility, without having any logically correct reason for preferring it to all the others.

The general idea is: if you want to put forth a hypothesis for consideration, it's incumbent on you to explain why that particular hypothesis is worthy of attention, and not any other ones. The "Less Wrong" website gives an example:

"Privileging the hypothesis" is the fallacy committed by a creationist who points out a purported flaw in standard evolutionary theory, and brings forward the Trinity to fill the gap - rather than a billion other deities or a trillion other naturalistic hypotheses. Actually, without evidence already in hand that points to the Trinity specifically, one cannot justify raising that particular hypothesis to explicit attention rather than a trillion others.

I think this is right. And, therefore, I think that many statistical studies -- the traditional, frequentist, academic kind -- are committing this fallacy on a regular basis.

Specifically: most of these studies test whether a certain variable is significantly different from zero. If it isn't, they assume that zero is the correct value.

That's privileging the "zero" hypothesis the same way the example privileges the "Trinity" hypothesis, isn't it? The study comes up with a confidence interval, which includes zero, but, by definition, includes an uncountable infinity of other values. And then it says, "since zero is possible, that's what we're going to assume."

That's not the right thing to do -- unless you can explain why the "zero" hypothesis is particularly worthy.

Often times, it obviously IS worthy. Carl Sagan wrote that "extraordinary claims require extraordinary evidence." If a non-zero coefficient is an extraordinary claim, then, of course, there's no problem.

For instance, suppose a subject in an ESP study gets 3 percent more correct guesses than you'd expect by chance. That's not significantly different than zero percent. In that case, you're absolutely justified in assuming the real effect is zero.

"ESP exists" is an extraordinary claim. A p-value of, say, .25 is not extraordinary evidence. So, you're not "privileging the hypothesis," in the sense of giving it undeserved consideration. You do have a logically correct reason for preferring it.

------

But ... that's not always the case. Suppose you want to test whether scouts can effectively evaluate NHL draft prospects. So you find 50 scouts, and you randomly choose two prospects, and ask them which one is more likely to succeed in the NHL. If scouting were random, you'd expect 25 out of the 50 scouts to be correct. Suppose it turns out that 27 are correct, which, again, isn't statistically significant.

Should you now conclude that scouts' picks are no better than random chance -- that scouts are worthless?

I don't think you should.

Because, why not start with a different null hypothesis, one that says that scouts are always 54.3 percent right? If you do that, you'll again fail to find statistical significance. Then, just like in the "zero" case, you say, "there's no evidence that 54.3 percent is wrong, so we will assume it's right."

That second one sounds silly, doesn't it? It's obvious that a null hypothesis "scouts are good for exactly 4.3 percent" is arbitrary. But, "Scouts are no good at all" seems ... better, somehow.

Why should we favor one over the other? Specifically: why do we judge that this null hypothesis is good, but that other null hypothesis is bad?

It's not just the number zero. Because, obviously, we can easily set up this study so that the null hypothesis is 50 (percent), or 25 (out of fifty), and we'll still think that's a better hypothesis than 54.3 percent.

Also, you can set up any hypothesis you want, to make the null zero. Suppose I want to "prove" that third basemen live exactly 6.3 percent longer than second basemen. I say, "John Smith believes third basemen live 6.3 percent longer. So I built that into the model, and added another parameter for how much John is off by. And, look, the other parameter isn't significantly different from zero. So, while others might suggest that the other parameter should be negative 6.3 percent, there's no proof of that. So we should assume that it's zero, and therefore that third basemen live 6.3 percent longer than second basemen."

That should make us very uncomfortable.

So if it's not the *number* zero, is it, perhaps, but the hypothesis of zero *effect*? That is, the hypothesis that a certain variable doesn't matter, regardless of whether we represent that with a zero or not.

I don't think that's it either. Suppose I find a bunch of random people, and use regression to predict the amount of money in their pocket based on the number of nickels, dimes, and quarters they're carrying. And the estimate for nickels works out to 4 cents, but with an SD of 5 cents, so it's not statistically significantly different from zero.

In this case, nobody would assume the zero is true. Nobody would say, "nickels do not appear to influence the amount of money someone is carrying."

It would be obvious that, in this case, the null hypothesis of "zero effect" isn't appropriate.

-----

So what is it? Well, as I've argued before, it's common sense. The null hypothesis has to be the one that human experience thinks is very much more likely to be true. And that's often zero.

If you're testing a possible cancer drug, chances are it's not going to work; even after research, there are hundreds of useless compounds for every useful one. So, the chance that this one will work is small, and zero is reasonable.

If people had ESP, we'd see a lot of quadrillionaires in the world, so common sense suggests a high probability that ESP is zero.

But what about scouting? Why does it seem OK to use the null hypothesis of zero?

Perhaps it's because zero just seems like it should be more likely than any other single value. It might still be a longshot, that scouts don't know what they're doing -- maybe you consider it a 5 percent chance. That's still higher than the chance that scouts are worth exactly 2.156833 percentage points. Zero is the value that's both more likely, and less arbitrary.

But ... still, it depends on what you think is common sense, which is based on your experience. If you're an economist who's just finished reading reams of studies that show nobody can beat the stock market, you might think it reasonable that maybe scouts can't evaluate players very well either.

On the other hand, if you're an NHL executive, you feel, from experience, that you absolutely *know* that scouts can often see through the statistics and tell who's better. To you, the null hypothesis that scouts are worth zero will seem as absurd as the null hypothesis that nickels are worth zero.

What happens, then, when a study and comes up with a confidence interval of, say, (-5 percent, 30 percent)? Well, if the null hypothesis were zero, the researcher might say, "scouts do not appear to have any effect." And the GM will say, "That's silly. You should have used the null hypothesis of around 10 percent, which all us GMs believe from experience and common sense. Your confidence interval actually fails to reject our null hypothesis too."

Which is another reason I say: you have to make an argument. You can't just formulaically decide to use zero, and ignore common sense. Rather, you have to argue that zero is an appropriate default hypothesis -- that, in your study, zero has *earned* its status.

But ... for scouting, I don't think you can do that. There are strong, plausible reasons to assume that scouts have value. If you ignore that, and insist on a null hypothesis of zero, you're begging the question. You're committing the fallacy of privileging your null hypothesis.

Friday, June 01, 2012

Defending my "walk year" study

Over at Slate, Dan Engber talks about the hypothesized "walk year" effect, in which players put out extra effort when they're about to negotiate a new contract (or, correspondingly, put out less effort after signing a long-term deal). He discusses various studies (including one of mine) that look at player statistics to try to see whether this happens. Some of those papers found an effect, and some (like mine) did not.

I agree with Dan that the hardest thing about these kinds of studies is coming up with a way to estimate what the player "should have" done under neutral circumstances. For instance, suppose a player OPSes .800 and .750 in the third-last and second-last year of his contract, but then hits .780 in his walk year. How do you evaluate that? Is it evidence that he's trying harder, since he improved from the year before? Or, is that if the player hit .800 two years ago, and he only hits .780 now, obviously he isn't showing much of an effect?

What you need is to have a way of benchmarking. You need to say, for instance, "if a player hits .800 and .750 in his two previous years, he's expected to hit .770 this year. So, if he hits more, that's confirmatory evidence of a walk year effect, and if he hits less, that's evidence against."

But how do you come up with that number, the .770? That's critical to your conclusions.

The various academic studies Dan cites come up with various methods. Most of them do it by some kind of regression. They try to predict performance by various combinations of variables: age, contract status, salary, experience, number of years with the same team ... stuff like that.

The problem with that approach, as Dan points out, is that you really have to get everything right if you're going to avoid confounding effects. For instance, suppose your regression includes a variable for age, but you incorrectly assume that the relationship is linear (it actually rises to a peak in a player's 20s, then falls). In that case, you're probably going to overestimate older and younger players, but underestimate players in their peak ages. And so, if players in the last year of a contract tend to be those in their prime, you'll underestimate them, and therefore appear to find a walk year effect when, actually, you're really just seeing an artifact of age.If you find an effect, you don't know if it's real, or if it's just that you're consistently biased in your expectations for the players.

----

How, then, do I defend my own study? Why do I think it works better than the other studies Dan mentions?

There's one big reason. Most of the other studies try to predict what the player should have done based on his past performance. My study tries to predict based on this past *and future* performance.

That is: the other studies asks, what comes next in the series ".750, .800"? My study adds the two years *after* the walk year, and asks, what's the missing number in the series ".750, .800, ???, .900, .950"?

Obviously, the second question is easier to answer than the first. The first way, if the player hits .850 the next year, it looks like he must be trying harder. The second way, it's obvious that .850 is exactly what you'd expect.

And it's not just the fact that you have twice as much data. It's also that age matters a lot less now. If you only look at the "before" years, you have to figure out whether the current year should be higher or lower based on age. But if you look at "before" and "after", it doesn't matter as much, because the subsequent years tell you which way the player is headed. Regardless of whether he's is getting better or worse, the average of the four years should still be roughly what you'd expect in the middle.

The only time you'd have to significantly adjust for age is when a player is near his peak. For instance -- and I'll switch to small numbers to make it simpler -- if a young guy goes 4, 6, ?, 4, 6, you'll probably guess the middle should be around 5. And if an old player does the same thing, again you'll guess the middle should be around 5.

But if the "?" year is at the peak, when the player is, say, 27, maybe you want to bump that to 5.3 or 5.4, to reflect that most players are a bit better at 27 than in the average of the surrounding years. Still, that's a much simpler age adjustment than trying to predict age-related changes for every player, at any age.

------

One more example, to maybe make things clearer.

1. At the end of 1980, the Dow-Jones Industrial Average was about 825. What should you have expected it to be at the end of 1981?

2. At the end of 1980, the Dow-Jones Industrial Average was about 825. At the end of 1982, it was about 1046. What should you have expected it to be at the end of 1981?

Much easier to get an answer to the second question than the first, right? Sure, for any given year, you might be off, but, overall, knowing the subsequent year will make you much, much closer, on average.