Sabermetric Research

Wednesday, November 30, 2011

Why it's hard to estimate small effects

Here's a great 2009 paper (.pdf) by Andrew Gelman and David Weakliem (whom I'll call "G/W"), on the difficulty of finding small effects in a research study.

I'm translating to baseball to start.

-----

Let's suppose you have someone who claims to be a clutch hitter. He's a .300 hitter, but, with runners on base, he claims to be a bit better.

So, you say, show us! You watch his 2012 season, and see how well he hits in the clutch. You decide in advance that if it's statistically significantly different from .300, that will be evidence he's a clutch hitter.

Will that work? No, it won't.

Over 100 AB, the standard deviation of batting average is about 46 points. To find statistical significance, you want 2 SD. That means to convince you, the player would have to hit .392 in the clutch.

The problem is, he's not a .392 hitter! He, himself, is only claiming to be a little bit better than .300. So, in your study, the only evidence you're willing to allow, is evidence that you *know* can't be taken at face value.

Let's say the batter actually does meet your requirement. In fact, let's suppose he exceeds it, and hits .420. What can you conclude?

Well, suppose you didn't know in advance that you were looking for small effect. Suppose you were just doing a "normal" paper. You'd say, "look, he beat his expectation by 2.6 SD, which is statistically significant. Therefore, we conclude he's a clutch hitter." And then you write a "conclusions" section with all the implications of having a .420 clutch hitter in your lineup.

But, in this case, that would be wrong, because you KNOW he's not a .420 clutch hitter, even though that's what he hit and you found statistical significance. He's .310 at best, maybe .320, if you stretch it. You KNOW that the .420 was mostly due to luck.

Still ... even if you can't conclude that the guy is truly a .420 clutch hitter, you SHOULD be able to at least conclude that he's better than .300 right? Because you did get that statistical significance.

Well ... not really, I don't think. Because, the same evidence that purports to show he's not a .300 hitter ALSO shows he's not a .320 hitter! That is, .420 is also more than 2 standard deviations from .320, which is the best he possibly could be.

What you CAN do, perhaps, is compare the two discrepancies. .420 is 2.6 SDs from .300, but only 2.2 SDs from .320. That does appear to make .320 more likely than .300. In fact, the probability of a .320 hitter going 42-for-100 is almost five times as high as the probability of the .300 hitter going 42-for-100.

But, first, that's only 5 in 6. Second, that ignores the fact that there are a lot more .300 hitters than .320 hitters, which you have to take into account.

So, all things considered, you should know in advance that you won't be able to conclude much from this study. The sample size is too small.

-------

That's Gelman and Weakliem's point: if you're looking for a very small effect, and you don't have much data, you're ALWAYS going to have this problem. If you're looking for the difference between .300 and .320, that's a difference of 20 points. If the standard error of your experiment is a lot more than 20 points ... how are you ever going to prove anything? Your instrument is just too blunt.

In our example, the standard error is 46 points. To find statistical significance, you'd have to observe an effect of at least 92 points! And so, if you're pretty sure clutch hitting talent is less than 92 points, why do the experiment at all?

But what if you don't know if clutch hitting talent is less than 92 points? Well, fine. But you're still never going to find an effect less than 92 points. And so, your experiment is biased, in a way: it's set up to only find effects of 92 points or more.

That means that if the effect is small, no matter how many scientists you have independently searching for it, they'll never find it. Moreover, they will frequently find a LARGE effect.

No matter what happens, the experiment will either be wrong too high, or wrong too low. It is impossible for it to be accurate for a small effect. The only way to find a small effect is to increase the sample size. But even then, that doesn't eliminate the problem: it just reduces it. No matter what your experiment, and how big your sample size, if the effect your looking for is smaller than 2 SDs, you'll never find it.

That's G/W's criticism. It's a good one.

-------

G/W's example, of course, is not about clutch hitting. It's about a previously-published paper, which found that good-looking people are more likely to produce female offspring than male offspring. That study found an 8 percentage point difference between the nicest-looking parents and the worst-looking parents -- 52 percent girls vs. 44 percent girls.

And what G/W are saying is, that 8 point difference is HUGE. How do they know? Well, it's huge as compared to a wide range of other results in the field. Based on the history of studies on birth sex bias, two or three points is about the limit. Eight points, on the other hand, is unheard of.

Therefore, they argue, this study suffers from the "can't find the real effect" problem. The standard error of the study was over 4 points. How can you find an effect of less than 3 points, if your standard error is 4 points? Any reasonable confidence interval will cover so much of the plausible territory, that you can't really conclude anything at all.

Gelman and Weakliem don't say so explicitly, but this is a Bayesian argument. In order to make it, you have to argue that the plausible effect is small, compared to the standard error. How do you know the plausible effect is small? Because of your subject matter expertise. In Bayesian terms, you know, from your prior, that the effect is most likely in the 0-3 range, so any study that can only find an 8-point difference must be biased.

Every study has its own limits of how the standard error compares to the expected "small" effect. You need to know what "small" is. If a clutch hitting study was only accurate to within .0000001 points of batting average ... well, that would be just fine, because we know, from prior experience, that a clutch effect of .0000002 is relatively plausible. On the other hand, if it's only accurate to within .046, that's too big -- because a clutch effect of .092 is much too large to be plausible.

It's our prior that tells us that. As I'veargued, interpreting the conclusions of your study is an informal Bayesian process. G/W's paper is one example of how that kind of argument works.

Monday, November 28, 2011

Why p-value isn't enough, reiterated

Question 1:

People are routinely tested for disease X, which 1 in 1000 people have overall. It is known that if the person has the disease, the test is correct 99% of the time. If the person does not have the disease, the test is also correct 99% of the time.

A patient goes to his doctor for the test. It comes out positive.

What is the probability that the patient has the disease?

Question 2:Researchers routinely run studies to test unexpected hypotheses (such as: can outside prayer help cure disease?), of which 1 in 1000 tend to be true overall. It is known that if a hypothesis is true, a study correctly finds statistical significance 99% of the time. If the hypothesis is false, the study correctly finds NO statistical significance 99% of the time.

A researcher tests one such unexpected hypothesis. He finds statistical significance.

Wednesday, November 23, 2011

Research conclusions *have* to be bayesian

The lastcouple of posts here have been about interpreting the results of statistical studies. I argued that the statistical method itself might be just fine, but the *interpretation* of what it means, the conclusions you draw about real life, require an argument. That is, you can get the regression right, but the conclusions wrong, because the conclusions call for argument and judgment.

Or, as some commenters have substituted, "intuition" and "subjectivity". Those are negative things, in academic circles. Objectivity is the ideal, and the idea that the reliability of a work of scholarship depends on a subjective evaluation of the author's judgment doesn't seem to be something that people like.

But, I think it absolutely has to follow. If you find a connection between A and B, how do you know if it's A that causes B, or B that causes A, or if it's all just random? That's something no statistical analysis can tell you. By definition, it calls for judgment, doesn't it? At least a little bit. Recall the recent (contrived) study that showed that listening to kids' music is linked to being physically older. Nobody would conclude that the music MAKES you older, right? But that's not a result of the statistical analysis -- it's a judgment based on outside knowledge. An easy, obvious judgment, but a judgment nonetheless.

It occurred to me that this judgment, that takes you from regression results to conclusions, is really an informal Bayesian inference. I don't think this is a particularly novel insight, but it helps to make the issue clearer. My argument is this: first, even if you do a completely normal, ("frequentist") experiment, the step from the results to the conclusions HAS to be Bayesian. And, more importantly, because Bayesian techniques sometimes require judgment, and are therefore not completely objective, the convention has been to avoid such judgment in academic papers. Therefore, these studies have locked themselves in to a situation in which they have to suspend judgment, and use strict rules, which sometimes lead to wrong -- or seemingly absurd -- answers.

OK, let me start by explaining Bayesianism, as I understand it, first intuitively, then in a baseball context. As always, real statisticians should correct me where I got it wrong.

----------

Generally, Bayesian is a process by which you refine your probability estimate. You start out with whatever evidence you have which leads you to a "prior" estimate for how things are. Then, you get more evidence. You add that to the pile, and refine your estimate by combining the evidence. That gives you a new, "posterior" estimate for how things are.

You're a juror at a trial. At the beginning of the trial, you have no idea whether the guy is guilty or not. You might think it's 50/50 -- not necessarily explicitly, but just intuitively. Then, a witness comes up that says he saw the crime happen, and he's "pretty sure" this is the guy. Combining that with the 50/50, you might now think it's 80/20.

Then, the defense calls the guy's boss, who said he was at work when the crime happened. Hmmm, you say, that sounds like he couldn't have done it. But there's still the eyewitness. Maybe, then, it's now 40/60.

And so on, as the other evidence unfolds.

That's how Bayesian works. You start out with your "prior" estimate, based on all the evidence to date: 50/50. Then, you see some new evidence: there's an eyewitness, but the boss provides an alibi. You combine that new evidence with the prior, and you adjust your estimate accordingly. So your new best estimate, your "posterior," is now 40/60.

---------

That's an intuitive example, but there is a formal mathematical way this works. There's one famous example, which goes like this:

People are routinely tested for disease X, which 1 in 1000 people have overall. It is known that if the person has the disease, the test is correct 100% of the time. If the person does not have the disease, the test is correct 99% of the time.

A patient goes to his doctor for the test. It comes out positive. What is the probability that the patient has the disease?

If you've never seen this problem before, you might think the chance is pretty high. After all, the test is correct at least 99% of the time! But that's not right, because you're ignoring all the "prior" evidence, which is that only 1 in 1000 people have the disease to begin with. Therefore, there's still a strong chance that the test is a false positive, despite the 99 percent accuracy.

The answer turns out to be about 1 in 11. The (non-rigorous) explanation goes like this: 1000 people see the doctor. One has the disease and tests positive. Of the other 999 who don't have the disease, about 10 test positive. So the positive tests comprise 10 people who don't have the disease, and 1 person who does. So the chance of having the disease is 1 in 11.

Phrasing the answer in terms of Bayesian analysis: The "prior" estimate, before the evidence of the test, is 0.001 (1 in 1000). The new evidence, though, is very significant, which means it changes things a fair bit. So, when we combine the new evidence with the prior, we get a "posterior" of 0.091 (1 in 11).

If that still seems counterintuitive to you, think of it this way: if the test is 99% positive, that's 1 in 100 that it's wrong. That's low odds, which makes you think the test is probably right! But ... the original chance of having the disease is only 1 in 1000. Those are even worse odds. The prior of 1/1000 competes with the new evidence of 1/100. Because the new number (test being wrong) is more likely than the old number (no disease), the odds are skewed to the test being wrong: odds of 10:1 that the test is wrong, compared to the patient having the disease.

Another way to put it: the less likely the disease was to start with, the more evidence you need to overcome those low odds. 1/100 isn't enough to completely overcome 1/1000.

(Perhaps you can see where this will be going, which is: if a research study's hypothesis is extremely unlikely in the first place, even a .01 significance level shouldn't be enough to overcome your skepticism. But I'm getting ahead of myself here.)

---------

Let's do an oversimplified baseball example. At the beginning of the 2011 baseball season, you (unrealistically) know there's a 50% chance that Albert Pujols' batting average talent will be .300 for the season, and a 50% chance that his batting average talent will be .330. Then, in April, he goes 26 for 106 (.245). What is your revised estimate of the chance that he's actually a .300 hitter?

You start with your "prior" -- a 50% chance he's a .300 hitter. Then, you add the new evidence: 26 for 106. Doing some calculation, you get your "posterior." I won't do the math here, but if I've got it right, the answer is that now the chance is 80% that Pujols is actually a .300 hitter and not a .330 hitter.

That should be in line with your intuition. Before, you thought there was a good chance he was a .330 hitter. After, you think there's still a chance, but less of a chance.

We started thinking Pujols was still awesome. Then he hit .245 in April. We thought, "Geez, he probably isn't really a .245 hitter, because we have many years of prior evidence that he's great! But, on the other hand, maybe there's something wrong, because he just hit .245. Or maybe it's just luck, but still ... he's probably not as good as I thought."

That's how Bayesian thinking works. We start with an estimate based on previous evidence, and we update that estimate based on the new evidence we add to the pile.

--------

Now, for the good part, where we talk about academic regression studies.

You want to figure out whether using product X causes cancer. You do a study, and you find statistical significance at p=0.02, and the coefficient says that using product X is linked with a 1% increased chance of cancer. You are excited about your new discovery. What do you put in the "conclusions" section of your paper?

Well, maybe you say "this study has found evidence consistent with X causing cancer." But that isn't helpful, is it? I mean, you also found evidence that's consistent with X *not* causing cancer -- because, after all, it could have just been random luck. (A significance level of .02 would happen by chance 1 out of 50 times.)

Can you say, "this is strong evidence that X causes cancer?" Well, if you do, it's subjective. "Strong" is an imprecise, subjective word. And what makes the evidence "strong"? You better have a good argument about why it's strong and not weak, or moderate. The .02 isn't enough. As we saw in the disease example, a positive test -- which is equivalent to a significance level of .01, since a positive test happens only 1 in 100 times -- was absolutely NOT strong evidence of having the disease. (It meant only a 1 in 11 chance.)

Similarly, you can't say "weak" evidence, because how do you know? You can't say anything, can you?

It turns out that ANY conclusion about what this study means in real life has to be Bayesian, based not just on the result, but on your PRIOR information about the link between cancer and X. There is no conclusion you can draw otherwise.

Why? Well, it's because the study has it backwards.

What we want to know is, "assuming the data came up the way it did, what is the chance that X causes cancer?"

But the study only tells us the converse: "assuming X does not cause cancer, what is the chance that the data would come up the way it did?"

The p=0.02 is the answer to the second question only. It is NOT the answer to the first question, which is what we really want to know. There is a step of logic required to go from the second question to the first question. In fact, Bayes' Theorem gives us the equation for finding the answer to the first question given the second. That equation requires us to know the prior.

What the study is asking is, "given that we got p=0.02 in this experiment, what's the chance that X causes cancer?" Bayes' Theorem tells us the question is unanswerable. All we can answer is, "given that we got p=0.02 in this experiment, what is the chance that X causes cancer, given our prior estimate before this experiment?"

That is: you CANNOT make a conclusion about the likelihood of "X causes cancer" after the experiment, unless you had a reliable estimate of the likelihood of "X causes cancer" BEFORE the experiment. (In mathematical terms, to calculate P(A|B) from P(B|A), you need to know p(B) and p(A).)

Does this sound wrong? Do you think you can get a good intuitive estimate just from this experiment alone? Do you feel like the .02 we got is enough to be convincing?

Well, then, let me ask you this: what's your answer? What do you think the chance is that X causes cancer?

If you don't agree with me that there's no answer, then figure out what you think the answer is. You may assume the experiment is perfectly designed, the sample size is adequate, and so on. If you don't have a number -- you probably don't -- think of a description, at least. Like, "X probably causes cancer." Or, "I doubt that X causes cancer." Or, "by the precautionary principle, I think everyone should avoid X." Or, "I don't know, but I'd sure keep my kids away from X until there's evidence that it's safe!" Go ahead. I'll leave some white space for you. Get a good intuitive idea of what your answer is.

It should. Your conclusion about the dangers of X should absolutely depend on what X is -- more specifically, what you knew about X before. That is, your PRIOR. Your prior, I hope, had a probability of close to 0% that a Bible can cause cancer. That's not just a wild-ass intuition. There are very good, rational, objective reasons to believe it. Indeed, there is no evidence that the information content of a book can cause cancer, and there is no evidence or logic that would lead you to believe that bibles are more carcinogenic than, say, copies of the 1983 Bill James Baseball Abstract.

Call this "intuition" or "subjectivity" if you want. But if you decide not to use your own subjective judgment, what are you going to do? Are you going to argue that bibles cause cancer just to avoid having to take a stand?

I suppose you can stop at saying, "this study shows a statistically significant relationship between bible use and cancer." That's objectively true, but not very useful. Because the whole point of the study is: do bibles cause cancer? What good is the study if you can't apply the evidence to the question?

--------

You could do the Bayesian approach thing more formally. That's what researchers usually mean when they talk about "Bayesian methods" -- they mean formal statistical algorithms.

To do a Bayesian analysis, you need a prior. You could just arbitrarily take something you think is reasonable. "Well, we don't believe there's much of a chance bibles cause cancer, so we're going to assume a prior 99.9999% probability that there's no effect, and we'll split up the last remaining .0001 in a range between -2% and +2%." Now, you do the study, and recalculate your posterior distribution, to see if you now have enough evidence to conclude there's a danger.

If you did that, you'd find that your posterior distribution -- your conclusion -- was that the probability of no effect went down, but only from 99.9999% to 99.995%, or something. That would make your conclusion easy: "the evidence should increase our worry that bibles cause cancer, but only from 1 in a million to 1 in 20,000."

But, that Bayesian technique is not really welcome in academic studies. Why? Because that prior distribution is SUBJECTIVE. The author can choose any distribution he wants, really. I chose 99.9999%, but why not 99.99999% (which is probably more realistic)? The rule is that academic papers are required to be objective. If you allow the author to choose any prior he wants, based on his own intuition or judgment, then, first, the paper is no longer objective, and second, there is the fear that the author could get any conclusion he wanted just by choosing the appropriate prior.

So papers don't want to assume a prior. So instead of arguing about the chance the effect is real, the paper just assumes it's real, and takes it at face value. If X appears to increase cancer by 1%, and it's statistically significant, then the conclusion will assume that X actually *does* increase cancer by 1%.

That sounds like it's not Bayesian. But, in a sense, it is. It's exactly the result you'd get from a Bayesian analysis with a prior that assumes every result is equally likely. Yes, it's objective, because you're always using the same prior. But it's the *wrong* prior. You're using a fixed assumption, instead of the best assumption you can, just because the best assumption is a matter of discretion. You're saying, "Look, I don't want to make any subjective assumptions, because then I'm not an objective scientist. So I'm going to assume that bibles are just as likely to cause 1% more cancers as they are to cause 0% more cancers."

That's obviously silly in the bible case, and, when it's that obvious, it looks "objective" enough that the study can acknowledge it. But most of the time, it's not obvious. In those cases, the studies will just take their results at face value, *as if theirs is the only evidence*. That way, they don't have to decide if their result is plausible or not, in terms of real-life considerations.

Suppose you have two baseball studies. One says that certain batters can hit .375 when the pitcher throws lots of curve balls. Another says that batters gain 100 points on their batting average after the manager yells at them in the dugout. Both studies find exactly the same size effect, with exactly the same significance level of, say, .04.

Of the two conclusions, which one is more likely to be true? The curve ball study, of course. We know that some batters hit curve balls better than others, and we know some batters hit well over .300 in talent. It's fairly plausible that someone might actually have .375 talent against curve balls.

But the "manager yells at them" study? No way. We have a strong reason to believe it's very, very unlikely that batters would improve by that much just because they were yelled at. We have clutch hitting studies, that barely find an effect even when the game is on the line. We have lots of other studies that, even when they do find an effect, like platooning, find it to be much, much less than 100 points. Our prior for the "manager yelling is worth 100 points" hypothesis is so low that a .04 will barely move it.

Still ... I guarantee you that if these two studies were published, the two "conclusions" sections would not give the reader any indication of the relative real-life likelihood of the conclusions being correct, except by reference to the .04. In their desire to be objective, the two studies would not only fail to give shadings of their hypotheses' overall plausibility, but they'd probably just treat both conclusions as if they were almost certainly true. That's the general academic standard: if you have statistical significance, you're entitled to just go ahead and assume the null hypothesis is false. To do anything else would be "subjective."

But while that eliminates subjectivity, it also eliminates truth, doesn't it? What you're doing, when you use a significance level instead of an argument, is that you're choosing what's most objective, instead of what's most likely to be right. You're saying, "I refuse to make a judgment, and so I'm going to go by rote and not consider that I might be wrong." That's something that sounds silly in all other aspects of life. Doesn't it also sound silly here?

--------

So, am I arguing that academics need to start doing explicit Bayesian analysis, with formal mathematical priors? No, absolutely not. I disagree with that approach for the same reasons other critics do: it's too subjective, and too subject to manipulation. As opponents argue, how do you know you have the right prior? And how can you trust the conclusions if you don't?

So, that's why I actually prefer the informal, "casual Bayesian" approach, where you use common sense and make an informal argument. You take everything you already know about the subject -- which is your prior -- and discuss it informally. Then, you add the new evidence from your study. Then, finally, you conclude about your evaluation of the real-life implications of what you found.

You say, "Well, the study found that reading the bible is associated with a 1% increase in cancer. But, that just sounds so implausible, based on our existing [prior] knowledge of how cancer works, that it would be silly to believe it."

Or, you say, "Yes, the study found that batters hit 100 points better after being yelled at by their manager. But, if that were true, it would be very surprising, given the hundreds of other [prior] studies that never found any kind of psychological effect even 1/20 that big. So, take it with a grain of salt, and wait for more studies."

Or, you say, "We found that using this new artificial sweetener is linked to one extra case of cancer per 1,000,000 users. That's not much different from what was found in [prior] studies with chemicals in the same family. So, we think there's a good chance the effect is real, and advise caution until other evidence makes the answer clearer."

That's what I meant, two posts ago, where I said "you have to make an argument." If you want to go from "I found a statistically significant 4% connection between cancer and X," to "There is a good chance X causes cancer," you can't do that, logically or mathematically, without a prior. The p value is NEVER enough information.

The argument is where you informally think about your prior, even if you don't use that word explicitly. The argument is where you say that it's implausible that bibles cause cancer, but more plausible that artificial sweeteners cause cancer. It's where you say that it's implausible that songs make you older, but not that the effect is just random. It's where you say that there's so much existing evidence that triples are a good thing, that the fact that this one correlation is negative is not enough to change your mind about that, and there must be some other explanation.

You always, always, have to make that argument. If you disagree, fine. But don't blame me. Blame Bayes' Theorem.

Monday, November 21, 2011

"Statisticians can prove almost anything"

Sometimes, when you look for statistical significance, you'll find it even if the effect isn't real -- in other words, a false positive. With a 5% significance level, you'll find that one out of 20 times.

However, experimenters don't do just one analysis one time. They'll try a bunch of different variables, and a bunch of different datasets. If they try enough things, they have a much better than 5% chance of coming up with a positive. How much better? Well, there's no real way to tell, since the tests aren't independent (adding one dependent variable to a regression isn't really a whole new regression). But, intuitively: if, by coincidence, your first experiment winds up at (say) p=0.15, it seems like it should be possible to get it down to 0.05 if you try a few things.

That's exactly what Joseph P. Simmons, Leif D. Nelson, and Uri Simonsohn did in a new academic paper (reported on in today's National Post). They wanted to prove the hypothesis that listening to children's music makes you older. (Not makes you *feel* older, but actually makes your date of birth earlier.) Obviously, that hypothesis is false.

Still, the authors managed to find statistical significance. It turned out that subjects who were randomly selected to listen to "When I'm Sixty Four" had an average (adjusted) age of 20.1 years, but those who listened to the children's song "Kalimba" had an adjusted age of 21.5 years. That was significant at p=.04.

How? Well, they gave the subjects three songs to listen to, but only put two in the regression. They asked the subjects 12 questions, but used only one in the regression. And, they kept testing subjects 10 at a time until they got significance, then stopped.

In other words, they tried a large number of permutations, but only reported the one that led to statistical significance.

One thing I found interesting was that one variable -- father's age -- made the biggest difference, dropping the p-value from .33 to .04. That makes sense, because father's age is very much related to subject's age. If you father is 40, you're unlikely to be 35. You could actually make a case that father's age *helps* the logic, not hurts it, even though it was arbitrarily selected because it gave the desired result.

-----

In this case, all the permutations meant that statistical significance was extremely likely. Suppose that, before any regressions, the two groups had about the same age. Then, you start adjusting for things, one at a time. What you're looking for is a significant difference in that one respect. The chances of that are 5%. But, the things the researchers adjusted for are independent: how much they would enjoy eating at a diner, their political orientation, which of four Canadian quarterbacks believed they won an award ... and so on. With ten independent thingies, the chance at least one would be significant is about 0.4.

Add to that the possibility of continuing the experiment until significance was found, and the possibility of combining factors, and you're well over 0.5.

Plus, if the researchers hadn't found significance, they would have kept adjusting the experiment until they did!

-----

The authors make recommendations for how to avoid this problem. They say that researchers should be forced to decide in advance, when to stop collecting data. And they should be forced to list all variables and all conditions, allowing the referees and the readers to see all the "failed" options.

These are all good things. Another thing that I might add is: you have to repeat the *exact same study* with a second dataset. If the result was the result of manipulation, you'll have only a 5% chance of having it stand up to an exact replication. This might create more false negatives, but I think it'd be worth it.

-----

One point I'd add is that this study reinforces my point, last post, that the interpretation of the study is just as important as the regression. For one thing, looking at all the "failed" iterations of the study is necessary to decide how to describe the conclusions. But, mostly, this study shows an extreme example of how you have to use insight to figure out what's going on.

Even if this study wasn't manipulated, the conclusion "listening to children's music makes you older" would be ludicrous. But, the regression doesn't tell you that. Only an intelligent analysis of the problem tells you that.

In this case, it's obvious, and you don't need much insight. In other cases, it's more subtle.

-----

Finally, let me take exception to the headline of the National Post article: "Statisticians can prove almost anything, a new study finds." Boo!

First of all, the Post makes the same mistake I argued against last post: the statistics don't prove anything: the statistics *plus the argument* make the case. Saying "statistics prove a hypothesis" is like saying "subtraction proves socialism works" or "the hammer built the birdhouse."

Second, a psychologist who uses statistics should not be described as a statistician, any more than an insurance salesman should be described as an actuary.

Third, any statistician would tell you, in seconds, that if you allow yourself to try multiple attempts, the .05 goes out the window. It's the sciences that have chosen to ignore that fact.

The true moral of the story, I'd argue, is that the traditional academic standard is wrong -- the standard that once you find statistical significance, you're entitled to conclude your effect is real.

Friday, November 18, 2011

A research study is just a peer-reviewed argument

To make your case in court, you need two things: first, some evidence; and, second, an argument about what the evidence shows.

The same thing is true in sabermetrics, or any other science. You have your data, and your analysis; that's the evidence. Then you have an argument about what it means.

But, most of the time, the "argument" part gets short shrift. Pick up a typical academic paper, and you'll see that most of the pages are devoted to explaining a regression, and listing the results and the coefficients and the corrections and the tests. Then, the author will just make an unstated assumption about what that means in real life, as if the regression has proven the case all by itself.

That's not right. The regression is important, but it's just the gathering of the evidence. You still have to look at that evidence, and explain what you think it means. You have to make an argument. The regression, by itself, is not an argument. The *interpretation* of the regression is the argument.

For instance: suppose you do a simple regression on exercise and lifespan, and you get the result that every extra mile jogged is associated with an increased lifespan of, say, 10 minutes. What does that mean in practical terms? Probably, the researcher will say that if you want Americans' lifespan to increase by a day, we should consider getting each of them to jog 144 more miles than they would otherwise. That would seem reasonable to most of us.

Suppose, now, another study looks at pro sports, and finds that every year spent as a starting MLB shortstop is associated with an extra $2 million in lifetime earnings. Will the researcher now say that if we want everyone to earn an extra $2 million, we should expand MLB so that everyone in the USA can be a starting shortstop? That would be silly.

Still another researcher does a regression to use triples to predict runs scored. That one finds a negative relationship. Should the study conclude that teams stop trying to hit triples, that it's just hurting them? Again, that would be the wrong conclusion.

All three of these regressions have exactly the same structure. The math is the same, the computer software is the same, the testing for heteroskedasticity is the same ... everything about the regressions themselves is the same. The difference is in the *interpretation* of what the regressions mean. The same interpretation, the same argument, makes sense in the first case, but is obviously ludicrous in the other two cases. And even the third case is very different from the second case.

The regression is just data, just evidence. It's the *interpretation* that's crucial, the argument about what that evidence means.

Why, then, do so many academic papers spend pages and pages on the details of the regression, but only a line or two justifying their conclusions? I don't know for sure, but I'd guess it's because regression looks mathematical and scholarly and intellectual and high-status, while arguments sound subjective and imprecise and unscientific and low-status.

Nonetheless, I think the academic world has it backwards. Regressions are easy -- shove some numbers into a computer and see what comes out. Interpretations -- especially correct interpretations -- are the hard part.

-----

If you think my examples are silly because they're too obvious, here's a real-life example that's more subtle: the relationship between salary and wins in baseball, a topic that's been discussed quite a bit over the last few years. If you do a regression on 2009 data, you'll get that

-- the correlation coefficient is .48-- the r-squared = .23-- the value of the coefficient is .16 of a win per $1 million spent-- the coefficient is statistically significant (as compared to the null hypothesis of zero).

That's all evidence. But, evidence of what? So far, it's just numbers. What do they actually *mean*, in terms of actual knowledge about baseball?

To get from the raw numbers to a conclusion, you have to interpret what the regression says. You have to make an argument. You have to use logic and reason.

So you look the coefficient of .16. From that, you can say, in 2009, every extra $6 million spent resulted, on average, in one extra win. I'm happy calling that a "fact" -- it's exactly what the data shows. But, almost anywhere you go from there now becomes interpretation. What does that *mean*, that every extra $6 million resulted in an extra win? What are the practical implications?

For instance, suppose you're a GM and want to gain an extra win next year. How much extra money do you have to spend on free agents? If you want to convince me that you know the answer, you have to take the evidence provided by the regression, and *make an argument* for why you're right.

A naive interpretation might be to just use that $6 million figure, and say, that's it! Spend an extra $6 million, and get an extra win. It seems obvious from the regression, but it would be wrong.

Why is it wrong? It's wrong because there are other causes of winning than spending money on free agents. There's also spending money on "slaves," and spending money on "arbs". Those are much cheaper than free agents. Effectively, some teams get wins almost for free, by having good young players. The teams that don't have that have to spend double, as it were: they have to buy a free agent just to catch up to the team with the cheap guys, and then they have to buy another one to surpass him.

For instance, team A has 80 wins for "free". Team B has 70 wins for "free" and buys another 20 on the free-agent market. The regression doesn't know free from not free. It sees that team B has 10 more wins, but spent an extra $20X dollars, where X is the actual cost of a free agent per win. Therefore, it spits out that it took 2X dollars to buy each extra win, even though it only took X.

That is: the coefficient of dollars per win from the regression is twice what it actually costs to buy one. The coefficient doesn't measure what a naive researcher might think it does.

My numbers are artificial, but I chose numbers that actually come fairly close to real life. Various sabermetric studies have shown that a free agent win actually costs $4.5 million. But regressions for 2008, 2009, and 2010 respectively show figures of $8.9, $6.2, and $12.6 million, respectively -- about twice as much.

Again, the issue is interpretation. If you're just showing the regression results, and saying, "here, figure out what this means," then, fine. But if your paper has a section called "discussion," or "conclusions," that means you're interpreting the results. And that's the part where it's easy to go wrong, and where you have to be careful.

----

Which brings me, finally, to the point that I'm trying to make: we should stop treating academic studies as objective scientific findings, and start treating them as arguments. Sure, we can remember that academic papers are written by experts, and peer reviewed, and that much of the time, there's no political slant behind them. If we want, we can consider them as generally well-reasoned arguments by experts of presumably above-average judgment.

But they're still arguments.

So when an interesting study is published, and the media report on it, they should treat it as an argument. And we should hold it to the same standards of skepticism to which we hold other arguments. A research paper is like an extended op-ed. The fact that there's math, and a review process, doesn't make them any less argument-like. The New York Times wouldn't present Paul Krugman's column as fact just because he used regressions and peer review, would they?

I googled the phrase "a new study shows." I got 55 million results. "A new study claims" gives only 4 million. "A new study argues" gives only 300,000.

But, really, It should be the other way around. New studies normally don't "show" anything but the regression results. Their conclusions are always "claimed" or "argued".

-----

The word "show" should be used only when the writer wants to indicate that the claim is true, or that it has been widely accepted in the field. At the time his original Baseball Abstract came out, you'd have to say Bill James was "arguing" that the Pythagorean Projection is a good estimator of team wins. But now that we know it's right, we say he "showed" it.

"Show" implies that you accept the conclusion. "Argue" or "claim" implies that you're not making a judgment.

The interesting thing is that the media seem to understand this. Sure, 90 percent of the time, they say "show". But when they don't, it's for a reason. The "claims" and "argues" are saved for controversial or frivolous cases, ones that the reporter doesn't want to imply are true. For instance, "New study claims gun-control laws have no effect on Canadian murder rate." And, "a new study argues that poker is a game of skill, not chance."

It's as if the reporters want to pretend scientific papers are always right, unless they conclude something that the reporter or editor doesn't agree with. But it's not the reporter's job to be implying the correctness of a conclusion, unless the reporter has analyzed the paper, and is writing the article as an opinion piece.

Ninety-nine percent of the time, a research paper does not "show" anything -- it only argues it. Because, correct conclusions don't just pop out of a regression. They only show up when you support that regression with a good, solid argument.

Thursday, November 10, 2011

The main economic benefit of baseball: we love it

From "The Sports Economist," here's commenter "bobby" with a very good comment:

I find it mildly distressing that almost all of the discussion about economic impacts of sporting events is about rectangles with rarely if ever a discussion of triangles. I was always trained that welfare was measured by consumer and producer surplus, not expenditures, but then what do I know?

...

I guess the idea that people are happier with a baseball game than a movie doesn’t mean much anymore, and its downright silly to suggest that a baseball game makes a place better off because people could have gone to a movie instead.

What he's saying is that if you're trying to measure the benefit of something, you measure it by "consumer surplus." That's the economic term for the difference between what you have to pay for something, and the maximum price you'd be willing to pay for it.

A lot of things have a huge consumer surplus. Take, for instance, a headache pill. An ibuprofen tablet costs only a few pennies -- a dime, tops. How much would you be willing to pay to make your headache go away? It's at least a dollar -- probably more, but at least a dollar. That means that every time you buy an Advil for ten cents, you're making a "profit" of at least 90 cents.

An easier way of looking at it is this: if the product you're using didn't exist, how much worse off would you be? That's consumer surplus.

There's consumer surplus in almost everything you pay for. That's because, if you didn't buy it, you'd have to buy something else you liked less. If there were no Tim Hortons, I'd have to buy Starbucks coffee, which I don't like as much. If my favorite restaurant closed, I'd have to go to my second favorite, which is still good, but not as good as my favorite. And so on.

Now, for sports: how much consumer surplus do you get from sports? How much worse off would you be if there were no baseball, or hockey? For most of you reading this, your answer is probably -- a lot worse. You'd have more money, because you wouldn't be spending on trading cards and tickets and Bill James Baseball Abstracts, but, that barely matters, compared to how much less interesting your life would be without baseball. The same is probably true for your favorite team, if you have one. My life would be a lot worse without the Toronto Maple Leafs, even if all the other teams were still around.

So Bobby's argument is quite correct. The most important economic consideration, when it comes to pro sports is how much better off people are because of it. Why, then, in almost every discussion of sports and economics, is this not a consideration?

One reason, as Victor Matheson says here, is that economists are often reacting to questions from non-economists, or politicians, who are more concerned about GDP and job creation than about the intrinsic value of sports as entertainment. When a candidate for office talks about building a new stadium to attract a team, the economic arguments are about the monetary values of the transactions it will create, rather than how happy the fans will be. And so, that's what the economists have to respond to.

Most of the time, though, the answer is that a new sports team creates only negligible zero jobs on net, and little increase in GDP. Because, after all, if the money didn't get spent on sports, it would get spent on something else. If you live in Montreal and don't have the Expos any more, you'll go to a movie, or go out to dinner instead.

That's true for almost anything. If medicines were made illegal, GDP wouldn't change much in the long term -- you'll take the dime you would have spent on an Advil, and spend it on something else instead. The main reason a headache pill is a good thing is not that it adds 10 cents to total output, but that its benefit is way, way higher than its cost.

What I'd like to see, in economic analysis of sports, is some kind of estimate of how much it improves people's lives. It's a lot. Matheson says in his post that there have been some attempts to quantify consumer surplus, but his example is only among people who pay for tickets. But what about everything else? Most of us benefit from baseball far, far more than just our ticket purchases. We watch games on TV, write blogs about them, analyze them, talk about them at the water cooler. Sports are a big part of the fabric of most of our lives, and having a team in our city to root for is a huge unmeasured happy benefit.

I have argued before that if it makes sense for government to subsidize things like public broadcasting (CBC, BBC, NPR, etc.), it should also make sense for it to subsidize a hockey team, on a cost/benefit basis. But I can't show you any evidence (other than back-of-the-envelope) that that's the case, unless and until the economists start listening to bobby, and get to work on showing the size of the benefit.

Wednesday, November 02, 2011

"Baseball Analyst" archives now available

In 1982, Bill James created the "Baseball Analyst," a bimonthly amateur sabermetrics journal that relied on contributions from readers. It ran 40 issues, dying in early 1989.Last weekend, with Bill's permission, I scanned all my issues and sent them to Jacob Pomrenke of SABR. Stephen Roney contributed some pages I was missing. Jacob reformatted everything. Rob Neyer, who was responsible for the Analyst's last few issues, wrote an introduction.

Finally, Jacob put it all online at the SABR website. All 40 issues are now publicly available for download.