Friday, December 22, 2006

Academic peer review can't be counted on

Berri, of course, is one of the economist co-authors of "The Wages of Wins," a book that has had its share of rave reviews, and also its share of criticism.

On the positive side, writer Malcolm Gladwell famously lauded the book in a New Yorker review, and fellow sports economist J.C. Bradbury was similarly praiseful in an academic review. But there were critical reviews from Roland Beech and myself, and critiques of the book's methodology and conclusions appeared in (among other places) an APBRmetrics forum and King Kaufman's column in Salon.

And that's where peer review comes in.

In a recent post on his blog, Berri makes the specific point that his critics have not been peer-reviewed, which is why he is skeptical of the points they make.

He writes,

"Ultimately it is the research in academic forums that we take seriously, and we often are quite skeptical of findings that have not been exposed to this peer review process ... The route those who disagree must follow is ultimatly the same academic route as everyone else. He or she will have to demonstrate that they have empirical evidence that comes to a different conclusion. And this empirical evidence would be submitted to a peer review process before it could be published in an academic forum."

"Had [Beech's] review simply appeared on his website ... we would have been inclined to either ignore his comments or respond on our own website ...

"In the end it is easy to sit back and make claims on a website. There is no peer review process. No one will refuse to publish your work because you misstate facts or fail to provide any evidence at all or because your evidence does not support your claims. In an academic setting, one expects a higher standard."

But, is academic peer review really a higher standard? I'm not so sure. Certainly academia is well-versed in the complex statistical techniques some of these studies use. But many of the academic papers I've reviewed in this blog over the last few months nonetheless have serious flaws, flaws large enough to cast doubt over the studies' conclusions. These papers were all peer-reviewed, and all made it to respected journals, without those flaws being spotted.

And sometimes they're obvious flaws. In "The Wages of Wins," the authors quote a study (co-authored by Berri) that checks whether basketball players "rise to the occasion" by playing better in the playoffs. After a regression on a bunch of factors, the study finds that players' playoff statistics actually fall relative to regular season performance. "The very best stars ... tended to perform worse when the games mattered most."

But what they failed to recognize was the obvious fact that, in the playoffs, players are facing only the best opponents. So, of course their aggregate performance should be expected to drop.

I looked up the original study (I got it free from my library, but here's a pay link). It's a nine-page peer-reviewed paper, published in a respected journal. It's got 34 references, acknowledgements of help from three colleagues, and it was presented to a room full of economists at an academic conference.

And nobody caught the most obvious reason for the findings. I'd bet that if Berri had posted his findings to any decent amateur sabermetrics website, it would have been pointed out to him pretty quickly.

Another example: a few years back, three economists found that overall league HBP rates were a few percent higher in the AL than the NL. They wrote a paper about it, and concluded that NL pitchers were less likely to hit batters because they would come to bat later and face retribution.

It's an intriguing conclusion, but wrong. It turned out that HBP rates for non-pitchers were roughly the same in both leagues, and the difference was that because NL pitchers hit so poorly, they seldom get plunked.

Think about it. The difference between the AL and NL turned out to be the DH – but no peer reviewer thought of the possibility! I think it's fair to say that wouldn’t happen in the sabermetric community. Again, if you were to post a summary of that paper at, say, Baseball Think Factory, the flaw would be uncovered in, literally, about five minutes.

The point of this is not to criticize these authors for making mistakes – all of us can produce a flawed analysis, or overlook something obvious. (I know I have, many times.) The point is that if peer review can't pick up those obvious flaws, it's not doing its job.

So why is academic peer review so poor? As commenter "Guy" writes in a comment to the previous post about a flawed basketball study:

"But this raises a larger issue that we've discussed before, which is the failure of peer review in sports economics. This paper was published in The Journal of Labor Economics, and Berri says it is "One of the best recent articles written in the field of sports economics." Yet the error you describe [in the post] is so large and so fundamental that we can have no confidence at all in the paper's main finding.... How does this paper get published and cited favorably by economists?"

It's a very good question – but if I were an academic sports economist, I wouldn't wait for an answer. If I cared about the quality of my work, I'd continue to consult colleagues before submitting it -- but I'd also make sure I got my paper looked at by as many good amateur sabermetricians I could find. It's good to get published, but it's more important to get it right.

Although certainly better than the earlier work you cite on hit batters, the Bradbury/Drinen work on the "moral hazard" explanation for higher HBP rates in the AL is also seriously flawed.

The analysis fails to account for the fact that a very few players account for a huge proportion of HBP. If you want to know why the AL had more HBP for many years, a big part of the explanation is "Don Baylor". If you want to know why the NL has closed the gap, Biggio and Craig Wilson (and Kendall until recently) explain much of the change. This is no exaggeration. Once you adjust for the fact AL teams have more 'real hitters' in the lineup, the league difference we're talking about is only about 50 HBPs a year. If Don Baylor had been traded to the NL, it would have erased the entire difference between the leagues for much of the 1980s. If Biggio had been developed by an AL team, the AL would still have a higher rate than the NL (the AL 'advantage' has disappeared in recent years).

The authors also assert that the higher AL rate began in 1973 with the arrival of the DH. But the AL rate was already 9% higher than the NL's in the 5 years prior to the DH rule, and increased only to 13% over the next five years -- a very small change. And ALL of that increase can be accounted for by the fact that NLer Ron Hunt coincidentally had has last big HBP year in 1973.

If you simply remove the players in the top 5% of HBP from the analysis, or looked at medians instead of means, I think you'll find there was never any there there. Yet all these academic analyses, as best I can tell, never even considered the implications of dealing with a very skewed distribution.

Berri's point about not responding to critiques which are not peer reviewed is incoherent.

Note that in the peer review system, the reviewer is presumably credentialled, but the reviewer's comments are themselves not peer-reviewed. And yet the author may have to effectively respond to these criticisms from the reviewer if he hopes to see publication.

It's a defensible position in general to "ignore" non-peer-reviewed research purporting to demonstrate a positive result. This is a strategy to maximize the benefit from your investment in reading the research [in mature disciplines, there tends to be a hierarchy among peer-reviewed journals too, and research in the more prestigious journals gets a lot more attention for this reason]. It's another thing entirely to ignore what is simply criticism of your model or assumptions or interpretation or data. Berri's blurring of this distinction is, at its root, nothing more than the credentialism he chooses to attack at one point. He does respond to credentialled criticism, in various un-peer-reviewed forms (journal review; informal review of drafts by academic colleagues; presentations to academic colleagues). Peer-reviewed criticism in published articles is only a subset of the criticism he seeks or must acknowledge.

The peer review procedure is a quality control measure which is a historically fairly recent invention, and it likely to continue to decline in importance with the continued transition to new technologies of publication.

The quality of the peer review process depends largely on the quality of the referees used, which amounts to creating an increased likelihood that flawed research will be corrected or rejected. There's never anything close to a guarantee that flawed work has been weeded out.

Finally, I think it's fair to question the efficacy of academic peer review in a discipline ("sports economcs") in its infancy. The requisite body of expertise and experts don't appear to have been established yet. But let's try to stay away from general attacks on "academics" and "academic peer review." You've been offended by a fairly small subset of academics.

My remarks were directed at academic peer review of *sabermetric-related* work (such as sports economics) only, and not at peer review in general. I have no reason to doubt the efficacy of peer review in other fields, and didn't mean to imply otherwise. Nor do I have any beef with academics in general.

And while it's true that sports economics might be in its infancy as an academic subject, the field of *sabermetrics* is reasonably well advanced. In sabermetrics, there *is* a body of experts and expertise. Almost all of it, however, is non-academic.

I'm sorry; I know the focus of your complaint (and Guy's, in other posts) has been these self-styled experts. But sometimes the language employed against them has been a little general in its reference.

The power of the online community is that it unites in dialogue people with very different experience and analytic toolsets at their disposal. Sabermetrics essentially is a very interdisciplinary project. There are a number of contributors to the dialogue who are routinely insightful and 'know' a lot, and I certainly count you among them.

I try to avoid overly subjective judgments, but I do feel that there is still a lot of "talking past each other" about methods and results taking place; I think "sabermetrics" is past its infancy but still far from mature. The flood of great data to analyse is still very recent, as well as widespread internet access to unite analysts; I think we'll have a lot more agreement several years from now about core truths and methods.

Joe:Like Phil, my observations were solely about sports economics, and more narrowly, that portion of sports economics that's about the "sport" rather than just the economics of the sport (e.g. stadium subsidies). I honestly have no opinion on how well the peer review process works elsewhere. But let me press you a little on the 'talking past each other' observation. What would you identify as the most important contributions from economists or other academic researchers thus far, in terms of advancing our knowledge of the game of baseball? To me, it appears they are generally years or decades behind the 'amateurs.' But I certainly do not stay on top of the academic literature. Where do you feel the academics have made a contribution? (Not a rhetorical question.)

I disagree with you somewhat about the level of maturity of the field of sabermetrics.

You say, "the flood of great data to analyse is still very recent." But many of the core results in sabermetrics need only the evidence of traditional stats. And, in any case, I had Project Scoresheet data in my hands, in electronic format, in 1988.

More importantly, you say, "I think we'll have a lot more agreement several years from now about core truths and methods."

Actually, we DO have a lot of agreement on core truths and methods. Most of the basic results in the field -- marginal values of events, pythagorean theorem, etc. -- have been around since the mid-80s (thanks to Bill James and Pete Palmer).

In every field, you'll have researchers arguing at the margins of existing knowledge, and it's sometimes easy to forget there's a core body of results that pretty much everyone agrees on. For instance, there's apparently a debate about string theory in the theoretical physics community, but that doesn't mean that the core results of physics are in question.

And while the level of organization of the sabermetrics community may look low, it isn't as immature as it appears. Economists know about the phenomenon known as "spontaneous order," where organization seems to be created out of the ether, with no formal outside source of rules or planning. That, I would argue, is what happens in our field. Without journals, without peer review, without any formal mechanism for disseminating results, the community not only produces some excellent work, but also is able to discriminate between the better work and the less-good work, and move forward from there.

The bottom line, as Guy notes, is that almost all the important results in sabermetrics (Guy might leave out the "almost") have come from this apparent anarchy.

Right now, the amateurs are making valuable discoveries. The academics aren't. That's the elephant in the room that Berri isn't acknowledging.

I don't think I said anything previously which committed me to the proposition that academics (via academic publications) have made valuable contributions, or indeed any contribution at all. Of course I do believe they have made unmatched contributions in some domains (eg Charles Alexander and Reed Browning in history; Robert Adair in physics; Andrew Zimbalist on the business of baseball) and even for strategy/evaluation types of problems, I'd point to George Lindsay in the 1960s, whose work influenced Pete Palmer, and more recently "Curve Ball" [2001] by Albert and Bennett (reviewed in By the Numbers 12.1), as genuine contributions. Consultation of Google Scholar shows a number of articles in academic journals on such subjects as streakiness, stolen base strategy, batting order, Markov chain analysis of various questions, and salary studies. Whether some of them are better than or equivalent to non-academic studies, I can't say because I haven't seen them. This leaves aside any non-academically published work by academics - such as Stephen Jay Gould's essay on the decline of the .400 hitter, which I think could fairly be described as influential especially for its argument about the evolution of the quality of competition. Do you wish to count that on the ledger sheet for the non-academics? At any rate, it should be remembered 1) a much greater volume of work has been published in non-academic contexts2) much of the "sabermetric" academic work has been published in journals not routinely accessible to the non-academic audience 3) the failure of the non-academic community to credit that work might not be a reliable indicator of its quality.

Phil: I think you and I may tend to agree on the facts and disagree more on their interpretation. In my view, perhaps nothing at all has been established as a matter of core agreement since the mid '80s, and the reason Palmer and James' work constituted that core is because their views were disseminated in successful mass market books. They had limited competition for "mindshare." I absolutely agree that excellent work has been done since then (replacement level for players; fielding metrics; defense independent pitching; relating salary to player value) but I don't agree that specific results are becoming widely known or widely accepted through the process of internet debate; I'd argue that the environment in which Palmer and James produced their work was not like today's "anarchy" - now there are many more voices, and none nearly as authoritative. This means good work doesn't take hold like it should.

I don't think your analogy with spontaneous order holds as you intend it. That concerns the emergence of "rules"; I don't see how that maps into emergence of research methods or research results, though perhaps at least the necessary infrastructure is springing up "spontaneously."

Anyway, I don't think we want to debate the state of the field as mature vs immature. Fundamentally this labelling amounts to a prediction of what will be coming next in the field compared to what came before. I think we've entered the beginning of a period of further advance; apart from what I mentioned before, I think Moneyball and the revitalization of sabermetric book publishing will be another significant reason for the next steps forward. Likewise, I think "The Book" will aid that advance as well, moreso for its role as an exemplar of method than for its specific results.

I agree with you that good work today doesn't take hold like good work from James or Palmer used to. But I disagree that good work isn't taking hold *at all*. It just takes a little longer, as you'd expect. Would that be different if more sabermetrics was done in academia instead of "anarchy"?

The way a result "takes hold" in the marketplace of ideas is not all that dependent on where the idea originated -- whether the idea first appeared in an academic journal, or the internet. Valid work takes hold, and invalid or unimportant work doesn't.

One advantage of the journal is that it's a formal meeting place where many can see the idea at the same time. One of my points is that informal substitute "meeting places" are slowly emerging on the internet. You're right in pointing out that "spontaneous order" is not the best metaphor for this process, but I do think it's happening.

Another advantage of the journal is that, in theory, peer review raises the expected level of the research. This increases its value as a "meeting place." My argument is that peer review is failing this task so far, and the internet community's informal process of winnowing does work reasonably well as an alternative.