Sabermetric Research

Phil Birnbaum

Wednesday, February 29, 2012

Are early NFL draft picks no better than late draft picks? Part II

This is about the Dave Berri/Rob Simmons paper that concludes that QBs who are high draft choices aren't much better than QBs who are low draft choices. You probably want to read Part I, if you haven't already.

--------

If you want to look for a connection between draft choice and performance, wouldn't you just run a regression to predict performance from draft choice? The Berri/Simmons paper doesn't. The closest they come is their last analysis of many, the one that starts at the end of page 47.

Here's what the authors do. First, they calculate an "expected" draft position, based on a QB's college stats, "combine stats" (height, body mass index, 40 yard dash time, Wonderlic score), and whether he went to a Division I-A school. That's based on another regression earlier in the paper. I'm not sure why they use that estimate -- it seems like it would make more sense to use the real draft position, since they actually have it, instead of their weaker (r-squared = 0.2) estimate.

In any case, the use that expected draft position, and run a regression to predict performance for every NFL season in which a QB had at least 100 plays. They also include terms for experience (a quadratic, to get a curve that rises, then falls).

It turns out, in that regression, that the coefficient for draft position is not statistically significant.

And, so, Berri and Simmons conclude,

"Draft pick is not a significant predictor of NFL performance. ... Quarterbacks taken higher do not appear to perform any better."

I disagree. Two reasons.

1. Significance

As the authors point out, the coefficient for draft position wasn't nearly significant -- it was only 0.52 SD from zero.

But, it's of a reasonable size, and it goes in the right direction. If it turns out to be non-significant, isn't that just that the authors didn't use enough data?

Suppose someone tells me that candy bars cost $1 at the local Kwik-E-Mart. I don't believe him. I hang out at the store for a couple of hours, and, for every sale, I mark down the number of candy bars bought, and the total sale.

I do a regression. The coefficient comes out to $0.856 more per bar, but it's only 0.5 SD.

That would be silly, wouldn't it? But that's what Berri and Simmons are doing.

Imagine two quarterbacks with five years' NFL experience. One was drafted 50th. The other was drafted 150th. How much different would you expect them to be in QB rating? If you don't know QB rating, thing about it in terms of rankings. How much higher on the list would you expect the 50th choice to be, compared to the 150th? Remember, they both have 5 years' experience and they both had at least 100 plays that year.

Well, the coefficient would say the early one should be 1.9 points better. I calculate that to be about 15 percent of the standard deviation for full-time quarterbacks. It'll move you up in the rankings two or three positions.

Is that about what you thought? It's around what I would have thought. Actually, to be honest, maybe a bit lower. But well within the bounds of conventional wisdom.

So, if you do a study to disprove conventional wisdom, and your point estimate is actually close to conventional wisdom ... how can you say you've disproven it?

That's especially true because the confidence interval is so wide. If we add 2 SD to the point estimate, we find that the effect of draft choice could be as high as -0.091. That means that 100 draft positions is worth 9.1 points. That's a huge difference between quarterbacks. Nine points would move you up at least 6 or 7 positions -- just because you were drafted earlier. It's almost 75 percent of a standard deviation.

Basically, the confidence interval is so wide that it includes any plausible value ... and many implausible values too!

That regression doesn't disprove anything at all. It's a clear case of "absence of evidence is not evidence of absence."

2. Attrition

In Part I, I promised an argument that doesn't require the assumption that QBs who never play are worse than QBs who do. However, we can all agree, can't we, that if a QB plays, but then he doesn't play any more because it's obvious he's not good enough ... in *that* case, we can say he's worse than the others, right? I can't see Berri and Simmons claiming that Ryan Leaf would have been a star if only his coaches gave him more playing time.

If we agree on that, then I can show you that the regression doesn't work -- that the coefficient for draft choice doesn't accurately measure the differences.

Why not? Again, because of attrition. The worse players tend to drop out of the NFL earlier. That means they'll be underweighted in the regression (which has one row for each season). So, if those worse players tend to be later draft choices, as you'd expect, the regression would underestimate how bad those later choices are.

Here, let me give you a simple example.

Suppose you rate QBs from 1 to 5. And suppose the rating also happens to be the number of seasons the QB plays.

Let's say the first round gives you QBs of talent 5, 4, 4, and 3, which is an average of 4. The second round gives you 4, 2, 1 and 1, which averages 2.

Therefore, what we want is for the regression to give us a coefficient of 4 minus 2, which is 2. That would confirm that the first round is 2 better than the second round.

But it won't. Why not? Because of attrition.

-- The first year, everyone's playing. So, those years do in fact give us a difference of 2.

-- The second year, the two "1" guys are gone. The second round survivors are the "4" and "2", so their average is now 3. That means the difference between the rounds has dropped down down to 1.

-- The third year, the "2" guy is gone, leaving the second round with only a "4". Both rounds now average 4, so they look equal!

-- The fourth year, the "3" drops out of the first round pool, so the difference becomes 0.33 in favor of the first round.

-- The fifth year ... there's nothing. Even though the first round still has a guy playing, the second round doesn't, so the results aren't affected.

So, see what happens? Only the first year difference is correct and unbiased. Then, because of attrition, the observed difference starts dropping.

Because of that, if you actually do the regression, you'll find that the coefficient comes up 1.15, instead of 2.00. It's understated by almost half!

This will almost always happen. Try it with some other numbers and assumptions if you like, but I think you'll find that the result will almost never be right. The exact error depends on the distribution and attrition rate.

Want a more extreme case? Suppose the first round is four 4s (average 4), and the second round is a 7 and three 1s (average 2.5). The first round "wins" the first year, but then the "1"s disappear, and the second round starts "winning" by a score of 7-4.

In truth, the first round players are 1.25 better than the second round. But if you do the Berri/Simmons regression, the coefficient comes out negative, saying that the first round is actually 0.861 *worse*!

So, basically, this regression doesn't really measure what we're trying to measure. The number that comes out isn't very meaningful.

------

Choose whichever of these two arguments you like ... or both.

I'll revisit some of the paper's other analyses in a future post, if anyone's still interested.

The study was in the news a couple of years ago, gaining a little bit of fame in the mainstream media when bestselling author Malcolm Gladwell debated it with Steven Pinker, the noted author and evolutionary psychologist.

"... Berri and Simmons found no connection between where a quarterback was taken in the draft -- that is, how highly he was rated on the basis of his college performance -- and how well he played in the pros."

"It is simply not true that a quarter­back’s rank in the draft is uncorrelated with his success in the pros."

Gladwell wrote to Pinker, asking for evidence that would contradict Berri and Simmons' peer-reviewed published study. Pinker referred Gladwell to some internet analyses, one of which was from Steve Sailer. Gladwell was not convinced, but responded mostly with ad hominem attacks and deferrals to the credentialized:

"Sailer, for the uninitiated, is a California blogger with a marketing background who is best known for his belief that black people are intellectually inferior to white people. Sailer’s “proof” of the connection between draft position and performance is, I’m sure Pinker would agree, crude: his key variable is how many times a player has been named to the Pro Bowl. Pinker’s second source was a blog post, based on four years of data, written by someone who runs a pre-employment testing company, who also failed to appreciate—as far as I can tell (the key part of the blog post is only a paragraph long)—the distinction between aggregate and per-play performance. Pinker’s third source was an article in the Columbia Journalism Review, prompted by my essay, that made an argument partly based on a link to a blog called “Niners Nation." I have enormous respect for Professor Pinker, and his description of me as “minor genius” made even my mother blush. But maybe on the question of subjects like quarterbacks, we should agree that our differences owe less to what can be found in the scientific literature than they do to what can be found on Google."

Pinker replied:

"Gladwell is right, of course, to privilege peer-reviewed articles over blogs. But sports is a topic in which any academic must answer to an army of statistics-savvy amateurs, and in this instance, I judged, the bloggers were correct."

And, yes, the bloggers *were* correct. They pointed out a huge, huge problem with the Berri/Simmons study. It ignored QBs who didn't play.

----

As you'd expect, the early draft choices got a lot more playing time than the later ones. Even disregarding seasons where they didn't play at all, and even *games* where they didn't play at all, the late choices were only involved in 1/4 as many plays as the early choices. Berri and Simmons don't think that's a problem. They argue -- as does Gladwell -- that we should just assume the guys who played less, or didn't play at all, are just as good as the guys who did play. We should just disregard the opinions of the coaches, who decided they weren't good enough.

That's silly, isn' t it? I mean, it's not logically impossible, but it defies common sense. At least you should need some evidence for it, instead of just blithely accepting it as a given.

And, in any case, there's an obvious, reasonable alternative model that doesn't force you to second-guess the professionals quite as much. That is: maybe early draft choices aren't taken because they're expected to be *better* superstars, but because they're expected to be *more likely* to be superstars.

Suppose there is a two-round draft, and a bunch of lottery tickets. Half the tickets have a 20% chance of winning $10, and the other half have a 5% chance of winning $10. If the scouts are good at identifying the better tickets, everyone will get a 20% ticket in the first round, and a 5% ticket in the second round.

Obviously, the first round is better than the second round. It has four times as many winners. But, just as obviously, if you look at only the tickets that win, they look equal -- they were worth $10 each.

Similarly for quarterbacks. Suppose, in the first round, you get 5 superstar quarterbacks and 5 good ones. In the last round, you get only one of each. By Berri's logic, the first round is no better than the last round! Because, the 10 guys from the first round had exactly the same aggregate statistics, per play, as the 2 guys from the last round.

I don't see why Gladwell doesn't get it, that the results are tainted by the selective sampling.

Anyway, others have written about this better than I have. Brian Burke, for instance, has a nice summary.

"We should certainly expect that if [Andrew] Luck and [Robert] Griffin III are taken in the first few picks of the draft, they will get to play more than those taken later. But when we consider per-play performance (or when we control for the added playing time top picks receive), where a quarterback is drafted doesn’t seem to predict future performance."

What he's saying is that Andrew Luck, who is widely considered to be the best QB prospect in the world, is not likely to perform much better than a last-round QB pick, if only you gave that last pick some playing time.

Presumably, Berri would jump at the chance to trade Luck for two last-round picks. That's the logical consequence of what he's arguing.

-----

Anyway, I actually hadn't looked at Berri's paper (.PDF) until a couple of days ago, when that Freakonomics post came out. Now that I've looked at the data, I see there are other arguments to be made. That is: even if, against your better judgment, you accept that the unknowns who never got to play are just as good as the ones who did ... well, even then, Berri and Simmons's data STILL don't show that late picks are as good as early picks.

Tuesday, February 21, 2012

Why is the SD of the sum proportional to the square root?

If you take the sum of two identical independent variables, the SD of the sum is *not* two times the SD of each variable. It’s only *the square root of two* times the SD.

There’s a mathematical proof of that, but I’ve always wondered if there was an intuitive understanding of why that is.

First, you can get a range without doing any math at all.

It seems obvious that when you add up two variables, you’re going to get a wider spread than when you just look at one. For instance, you can see that it’s easier to go (say) 10 games over .500 over two years, than over just one year. You can see that if you roll one die, it’s hard to go two points over or under the average (you have to roll a 1 or 6). But if you roll 100 dice, it’s easier to go two points over the average (You can roll anything except 349, 350, or 351.)

So, the spread of the sum is wider than the spread of the original. In other words, it's more than 1.00 times the original.

Now, if you just doubled everything, it’s obvious that the multiplier would be 2.00, that the curve would be twice as wide. A team that goes +10 over one season will go +20 over two seasons. The team that goes -4 will go -8. And so on. So, if you do that, it's exactly 2.00 times the original.

But, in real life, you don't just double everything -- there’s regression to the mean. The team that goes +10 one random season probably will go a lot less than +10 the second random season. And if you roll 6 on the first die, you're probably going to roll less than 6 the second die. So the curve will be less stretched out than if you just doubled everything.

That means the multiplier has to be less than 2.00.

So, the answer has to be something between 1.00 and 2.00. The square root of two is 1.41, which fits right in. It seems reasonable. But why exactly the square root of two? Why not 1.5, or 1.3, or 1.76, or Pi divided by 2?

I’m looking for an intuitive way to explain why it’s the square root of two. I’ve come up with two different ways, but I’m not really happy with either. They’re both ways in which you can see how the square root comes into it, but I don’t think you really *feel* it.

Here they are anyway. Let me know if you have improvements, or you know of any others. I’m not looking for a mathematical proof -- there are lots of those around -- I’m just looking for an explanation that lets you say, “ah, I get it!”

-------

Explanation 1:

First, I’m going to cheat a bit and use something simpler than a normal distribution. I’m going to use a normal six-sided die. That’s because of my limited graphics skills.

So, here’s the distribution of a single die. Think of it as a bar graph, but using balls instead of bars.

Part of my cheating is that I’m going to use the shortcut that the SD of a distribution is proportional to its horizontal width. That’s not true for normal distributions, but if you pretend it is, you’ll still get the intuitive idea.

Now, since we’re adding two dice, I’m going to prepare a little addition table with one die on the X axis, and another on the Y axis. The sums are in white:

Now, I’m going to take away the axes, and just leave the sums:

The balls represent the distribution of the sum of the dice. We want the standard deviation of this distribution. That is, we want to somehow measure its spread.

We can’t do it just like this, because the sums seem scattered around, instead of organized into the graph of a distribution. But we can fix that, just by turning it 45 degrees. I’ll also add some color, to make it easier to see:

See? Now the distribution is in a more familiar format. All the 2s are in a vertical line, and all the 3s, 4s, and so on. (Well, they should be exactly vertical, but they’re a bit off … my graphics abilities are pretty mediocre, so I couldn’t get that square to be exactly square. But you know what I mean.)

It’s like the usual bar graph you see of the distribution of the sum of two dice, except that the bar extends above and below the main axis, instead of just above. (If you want, imagine that the column of 7s is sitting on the floor. Then let gravity drop all the other columns down to also rest on the floor. That will give you the more standard bar graph.)

Now, in the above diagram, look at the main horizontal axis, the one that goes 2-4-6-8-10-12. The length of that axis is the spread of the graph, the one that we’re using to represent the standard deviation. What’s that length?

Well, it’s the hypotenuse of a right triangle, where the two sides are the spread of the original die.

By the Pythagorean theorem -- the real one, not the baseball one -- the diagonal must be exactly the square root of two times the original.

As I said, I’m not thrilled with this, but it kind of illustrates where the square root comes from.

-----

Method 2:

If I just take one die and double it, I get twice the variance. This looks like this:

The blue and green are the two SDs of 1. The pink line just goes from beginning to end, and its length represents the SD of the sum. Obviously, that SD is 2.

Now, suppose I take one die, but, instead of just doubling it for the sum, I insetad add the amount on the bottom of the die. I always get 7 (because that’s how dice are designed). That means the bottom is perfectly negatively correlated with the top. The variance of the top is 1, the variance of the bottom is 1, but the variance of the sum is zero (since the sum is always the same). That looks like this, with the "second die" arrow going exactly the opposite direction of the first. The pink line isn't a line at all -- which is to say, it's a line of length zero, since the beginning is the same as the end.

Now, what if I take the one die and roll it again? Then, the second die is completely independent of the first die. It doesn’t go right, and it doesn't go left. It has to go in a direction that’s independent of the first direction. Like, straight up:

Now, the distance from beginning to end is the hypotenuse of the triangle, which is the square root of 2! Which is what we were trying to show.

----------

As I said, I’m not thrilled with these explanations. Are there better ones?

Wednesday, February 15, 2012

Absence of evidence vs. evidence of absence

People tell me that Albert Pujols is a better hitter than John Buck. So I did a study. I watched all their at-bats in August, 2011. I observed that Pujols hit .298, and Buck hit .254.

Yes, Pujols' batting average was better than Buck's, but the difference wasn't statistically significant. In fact, it wasn't even close: it was less than 1 standard deviation!

So, clearly, August's performance shows no evidence that Pujols and Buck are different in ability.

Does that sound wrong? It's right, I think, at least as I understand how things work in the usual statistical studies. If you fail to reject the null hypothesis, you are entitled to use the words "no evidence."

Which is a little weird, because, of course, it *is* evidence, although perhaps *weak* evidence. I suppose they could have chosen to say "not enough" evidence, or "insufficient" evidence, but that carries with it an implication that the null hypothesis is correct. If I say, "the study found no evidence that whites are smarter than blacks," that sounds fine. But if I say, "the study found insufficient evidence that whites are smarter than blacks," that sounds racist.

The problem is, if you don't really know what "no evidence" really means, you might get the wrong impression. You might have 25 different studies testing whether Pujols is better than Buck, each of them using a different month. They all fail to reject the hypothesis that they're equal, and they all say they found "no evidence". (That's not unlikely: to be significant at .01 for a single month, you'd have to find Pujols outhitting Buck by about 200 points.)

And you think, hey, "25 studies all failed to find any evidence. That, in itself, is pretty good evidence that there's nothing there."

But, the truth is, they all found a little bit of evidence, not *no* evidence. If you multiply *no* evidence by 25, you still have *no* evidence. But if you multiply a little bit of evidence by 25, now you have *enough* evidence.

------

There's an old saying, "absence of evidence is not evidence of absence." The idea is, just because I look around my office and don't see any proctologists or asteroids, it doesn't mean proctologists or asteroids don't exist. I may just not be looking in the right place, or looking hard enough. Similarly, if I look at only one month of Pujols/Buck, and I don't see a difference, it doesn't mean the difference isn't there. It might just mean that I'm not looking hard enough.

This is the point Bill James was making in his "Underestimating the Fog." We looked for clutch hitting, and we didn't find it. And so we concluded that it didn't exist. But ... maybe we we just need to look harder, or in different places.

What Bill was asking is: we have the absence of evidence, but do we have the evidence of absence?

------

Specifically, what *would* constitute evidence of absence? The technically-correct answer: nothing. In normal statistical inference, there's actually no evidence that can support absence.

Suppose I do a study of clutch hitting, and I find it's not significantly different from zero. But ... my parameter estimate is NOT zero. It's something else, maybe (and I'm making this up), .003. And maybe the SD is .004.

If I think clutch hitting is zero, and you think it's .003, we can both point to this study as confirming our hypotheses. I say, "look, it's not statistically significantly different from zero." And you say, "yeah, but it's not statistically significantly different from .003 either. Moreover, the estimate actually IS .003! So the evidence supports .003 at least as much as zero."

That leaves me speechless (unless I want to make a Bayesian argument, which let's assume I don't). After all, it's my own fault. I didn't have enough data. My study was incapable of noticing a difference between .000 and .003.

So I go back to the drawing board, and use a lot more data. And, this time, I come up with an estimate of .001, with an SD of .002.

And we have the same conversation! I say, "look, it's not different from zero." And you say, "it's not different from .001, either. I still think clutch hitting exists at .001."

So I go and try again. And, every time, I don't have an infinite amount of data, so, every time, my point estimate is something other than zero. And every time, you point to it and say, "See? Your study is completely consistent with my hypothesis that clutch hitting exists. It's only a matter of how much."

------

What's the way out of this? The way out of this is to realize that you can't use statistics to prove a point estimate. The question, "does clutch hitting exist?" is the same as the question "is clutch hitting exactly zero?". And, no statistical technique can ever give you an exact number. There will always be a standard error, and a confidence interval, so it will always be possible that the answer is not zero.

You can never "prove" a hypothesis about a single point. You can only "disprove" it. So, you can never use statistical techniques to demonstrate that something does not exist.

What we should be talking about is not existence, but size. We can't find evidence of absence, but we can certainly find evidence of smallness. When an announcer argues for the importance of being able to step up when the game is on the line, we can't say, "we studied it and there's no such thing". But we *can* say, "we studied it, and even under the most optimistic assumptions, the best clutch hitter in the league is only going to hit maybe .020 better in the clutch ... and there's no way to tell who he is."

Or, the short form -- "we studied it, and the differences between players are so small that they're not worth worrying about."

------

But aren't there issues where it's important to actually be able to disprove a hypothesis? Take, for instance, ESP. Some people believe they can do better than chance at guessing which card is drawn from an ESP deck.

If we do a study, and the subject guesses exactly what you'd expect by chance, you'd think that would qualify as a failure to find ESP. But when you calculate the confidence interval, centered on zero, you might have to say, "our experiment suggests that if ESP exists, its maximum level is one extra correct guess in 10,000."

And, of course, the subject will hold it up, and triumphantly say, "look, the scientists say that I might have a small amount of ESP!!"

What's the solution there? It's to be common-sense Bayesian. It's to say, "going into the study, we have a great deal of "evidence of absence" that ESP doesn't exist -- not from statistical tests, but from the world's scientific knowledge and history. If you want to challenge that, you need an equal amount of evidence."

That makes sense for ESP, but not for clutch hitting. Don't we actually *know* that clutch hitting talent must exist, even at a very small level? Every human being is different in how they respond to pressure. Some batters may try to zone out, trying to forget about the situation and hit from instinct. Some may decide to concentrate more. Some may decide to watch the pitcher's face between pitches, instead of adjusting their batting glove.

Any of those things will necessarily change the results a tiny bit, in one direction or the other. Maybe concentration makes things worse, maybe it makes it better. Maybe it's even different for different hitters.

But we *know* something has to be different. It would be much, much too coincidental if every batter did something different, but the overall effect is exactly .0000000.

Clutch hitting talent *must* exist, although it might be very, very small.

So why are we so fixated on zero? It doesn't make sense. We know, by logical argument, that clutch hitting can't be exactly zero. We also know, by logical argument, that even if it *were* exactly zero, it's impossible to have enough evidence of that.

When we say "clutch hitting doesn't exist," we're using it as a short form for, "clutch hitting is so small that, for all intents and purposes, it might as well not exist."

------

When the effect is small, like clutch hitting, it's not a big deal. But when the effect might be big, it's a serious issue.

A lot of formal studies -- not just clutch hitting or baseball -- will find they can't reject the null hypothesis. They usually say, "we found no evidence," and then they go on to assume that that also means they can assume that what they're looking for doesn't exist.

They'll do a study on, I don't know, whether an announcer is right that playing a day game after a night game affects you as a hitter. And they'll get an estimate that says that batters are 40 points of OPS worse the day after. But it's not statistically significant. And they say, "See? Baseball guys don't know what they're talking about. There's no evidence of an effect!"

But that's wrong. Because, unlike clutch hitting, the confidence interval does NOT show an effect that "for all intents and purposes, might as well not exist." The confidence interval is compatible with a *large* effect, of at least 80 points. (That is, since 2 SD is enough to drop from 40 points to zero on one side, it's also enough to rise from 40 points to 80 points on the other side.)

So it's not that there's evidence of absence. There's just absence of evidence.

And that's because of the way they did their study. It was just too small to find any evidence -- just like my office is too small to find any asteroids.

Tuesday, February 14, 2012

Two new "Moneyball"-type possibilities

I'm usually doubtful that significant "Moneyball"-type inefficiencies still exist in sports. But, recently, two possibilities came up that got me wondering.

First, in a discussion about baseball player aging, commenter Guy suggested that there are lots of good young players kept in the minors when they're good enough to be playing full-time in the majors. He mentions Wade Boggs, whom the Red Sox held back in the early 80s in favor of Carney Lansford.

It's certainly a possibility, especially when you consider the Jeremy Lin story. Of course, baseball and hockey are different from basketball and football, because they have minor leagues in which players get to show their stuff. But, still.

That means the difference between first and second place was almost twice the difference between second and twentieth place. Dustin Brown is exceptionally good at getting his team a power play.

Desjardins writes,

"Incidentally, 380 non-coincidental penalties is worth roughly $33M in 2012 dollars relative to the league average, and quite a bit more relative to replacement level. ... Dustin Brown has made roughly $15M so far in his career, making him one of the biggest deals in the entire league."

Wow. If you had tried to convince me that you could find an official NHL stat that would uncover $33 million worth of hidden value, I wouldn't have believed you. But there it is.

Wednesday, February 08, 2012

A research study is just a peer-reviewed argument (part II)

I've always said that a regression doesn't speak for itself. A regression is just manipulated data. To support a hypothesis, you need more that just data: you need an argument about why that data matters.

I wrote about that here, when I said that a research paper is just a peer-reviewed argument. Some commenters disagreed. They argued that science is, and has to be, objective -- whereas, arguments are always subjective.

Having thought about it further, I don't understand how it isn't more obvious that there's always a subjective argument involved. At the very least, if you find a significant association between X and Y, you have to at least suggest whether X causes Y, whether Y causes X, or whether something else causes both.

So, I don't get it. For those of you who don't believe that studies need to argue subjectively, what is it you're thinking?

---

Here's an example to let you be specific. It's an imaginary regression, where A, B, and C are used to predict X. I'm assuming .05 is the threshold for significance, but if you prefer a different level, feel free to change the p-values accordingly.

Here are the dependent variables, the coefficients, and the significance levels. An asterisk means the value is significantly different from zero.

A +0.15 p=0.05 * B +0.13 p=0.08C +0.16 p=0.04 *

What can you conclude?

Sure, you can say, "a unit increase in A was associated with a 0.15 increase in the dependent variable X, and that was statistically significantly different from zero." But that's not really a conclusion, that's just reading the results right off the regression. Papers wouldn't have a "conclusions" section if that was all they contained.

So, now, let me ask you: what would you write in your conclusions that's not subjective?

Tuesday, February 07, 2012

A Don Cherry / Darryl Sittler tracer

Today is the 36th anniversary of Darryl Sittler's 10-point night, and a few websites (like this one) are linking to a Don Cherry clip where he talks about the game.

Cherry says that the Leafs "showed no mercy" on the Bruins, and says you should never embarrass another team, because it might come back to haunt you. He says that the Leafs gave Sittler a silver tea set after the game, "for murdering us like that. I got the paper, I cut out the picture, and every time we came to Maple Leaf Gardens after that, I put this picture up ..."

"For the next three years, we never lost to the Leafs at the Gardens."

So as not to be accused of picking on Ken Dryden, I looked it up.

The presentation of the tea set was Friday, April 9, 1976, before a playoff game against the Penguins (according to the next day's Globe and Mail). The next two games in Toronto against the Bruins were actually Leaf wins:

12/09/76: Leafs 7, Bruins 511/27/76: Leafs 4, Bruins 2

But after that, the Leafs didn't beat the Bruins at home until the 1982-83 season:

So, if Cherry remembers the events correctly, it must be that he started posting the photo only after the game of 11/27/76. And, although it was at least five years until the Leafs won again, rather than just three, Cherry left the Bruins after the 1978-79 season, so he might just be referring to his own tenure there.

Friday, February 03, 2012

Bettors don't regress to the mean enough, investment firm claims

An investment firm has a Super Bowl prediction method that they think can beat the spread, according to this Bloomberg story.

The firm, Analytic Investors LLC, does this: for each of the NFL's 32 teams, they figure out how much money you would have made betting that team during the regular season (betting them outright, it appears, not betting them against the spread).

In 2011, betting the 49ers would have generated an investment return of 52.9 percent this year (I don't know how that's calculated, but that's all they say), making them the team with the highest "alpha". The Colts were the worst, at minus 57.6 percent. The Giants were +32.3 percent, and the Patriots +16.1 percent.

For the Super Bowl, the method says, you should bet on the team that returned the least during the regular season. That's because you should avoid the team with the higher return, because bettors who made more money off their team are "overreacting to information."

"[The Giants] have been the hotter team. They are like the cocktail party stock that everyone’s talking about, that some people have made a lot of money on.”

OK, fair enough. But ... is there evidence that this works? The article doesn't give any, except to say that it's beat the spread for the last eight consecutive Super Bowls. That doesn't mean much, of course, since nobody's claiming the system is accurate enough for eight in a row to be expected.

(Suppose the method predicts with a 60% success rate, which seems way optimistic. Then the chance of 8 in a row is around 1 in 60. At 50%, the chance is 1 in 256.)

This is probably just a publicity stunt to get some exposure for the firm. But it seems like an interesting hypothesis to check out. If a team outperforms expectations during the regular season, it probably did so by luck. And, it seems reasonable to suggest that maybe bettors misinterpret that luck as skill, and overweight the team's future chances.

You'd need a bigger study, of course. Suppose every year you looked at the last three games of the season, for the top 6 and bottom 6 teams in terms of "alpha" in the earlier weeks. That would give you maybe 25 games a year, 500 games over twenty years, 250 games for each group. Worth a shot. I don't have NFL data or betting line data, but I'm sure someone out there does.