Monday, June 07, 2010

For the second year in a row I have some thoughts to share on the BP annual, and I am not going to write them up in such a way as to feel comfortable calling it a "book review" (although I did give the post that label). It's not nearly formal enough, nor timely enough, nor extensive enough, to qualify for such a description. It is also a list of quibbles rather than a balanced look that praises the strengths of the book. I assume that anyone reading my blog is familiar with BP and doesn't need me to restate the table of contents.

Last year, I along with many others decried the lack of a player index in the book. This, happily, has been rectified, and now I'm just left to fume that Eric Fryer was not given a comment. On the other hand, the book contains no extra essays, for the first time since 1998 at least. Though the essays have been of uneven quality the last few editions, and have drifted away from sabermetric topics, they were always one of my favorite parts of the books. There have been some really good ones through the years, like Keith Woolner's piece on replacement level and Michael Wolverton's on pennants added.

Ditching the essays allows BP to devote more room to player comments (and an index!), and put essay content on their website, but I like baseball annuals that give you something to go back to in the years to come. In 2020, no one is really going to care what BP or anyone else thought of the Orioles or Yovani Gallardo in 2010 (let alone Yuniesky Betancourt). Essays about sabermetrics or the game in general are what give the Abstract or the Hardball Times Annual a shelf-life past June each year. However, I realize that BP is not attempting to be another Abstract (and their cover blurbs finally seem to recognize it too), so to some extent I am wishcasting the kind of book I prefer upon them.

The biggest quibble I have with the BP annual can be boiled down to confusion and contradiction with respect to the statistics used. Go to their website and take a gander at their glossary. It is next to impossible to use, it's disorganized, and it is largely devoid of formulas--even for metrics for which the formulas have been published previously.

The disarray of the glossary is mirrored by the haphazard use of statistics in the book. Written comments often diverge from the statistics listed just above. This is not new to BP 2010, which only makes it more annoying. Here are four ways in which the broad problem manifests itself, three of which I'll illustrate with examples drawn from the Reds team chapter alone:

1. Using both VORP and WARP, which both measure the same thing (value above replacement), with the major differences outside of the respective offensive metrics used being the unit (runs or wins) and fielding (ignored by VORP, included by WARP). This requires clarification of how players rank relative to replacement level, as in the case of Paul Janish:

As poor a hitter as Janish is, he's that good with the glove, enough so that he was able to keep his head above replacement level last year (note the difference between his VORP and WARP totals above, as WARP includes defense and VORP doesn't).

There is something to be said for displaying hitting and fielding value in separate columns when using an uber-metric due to the wider differences in estimates of fielding value between systems, but instead BP lists two separate metrics with two different scales.

2. Comments that use untranslated numbers, juxtaposed with the translated numbers just above them. See Willy Taveras:

Taveras's -14.3 VORP in 2009 was the third-worst in the majors.

Taveras's VORP is listed as -8.0 just seven lines above.

3. Using different metrics than the ones listed, that measure the same thing. See the Joey Votto comment:

Despite all that, August was his only poor month, and he finished the year fourth among all qualified major leaguers with a marginal lineup value per game of .397

MLV/G was previously listed by BP; it was taken out this year. The elimination of MLV/G cleaned up an issue with overall offensive rate duplication analogous to the doubling-up of VORP and WARP discussed above, as EqA is a measure of the same thing. There are significant methodological differences and significant unit differences, but unless there are special circumstances in which those differences are important, there's no need to use both (and simply pointing out that Votto was one of the most effective major league hitters is not one of those cases, as EqA surely concurred with that assessment.)

4. Ignoring BP fielding metrics in favor of other fielding metrics like UZR and Plus/Minus.

I think the authors are well within the mainstream of the sabermetric community if they trust UZR and Plus/Minus more than the BP fielding system, and they should be applauded for being willing to cite metrics published outside of BP. On the other hand, though, if you have so little faith in your own metric that you don't even want to quote it, why include it at all? Why pollute WARP with its presence?

The statistical introduction to the book is not immune from confusion. This year they do acknowledge that they now use Pythagenpat rather than Pythagenport, but the editor still missed a slip-up within just a couple of sentences. This was unintentional, but the issue is further confused by the use of the term generically without specifying whether what BP calls first, second, or third-order inputs are used.

I have to quibble with their choice of ERA-style metrics listed, although this complaint lies squarely in the realm of opinion and not methodological error. Included are Matt Swartz and Eric Seidman's new SIERA, which is based on batted ball inputs, and DERA, which adjusts actual ERA for team defense. That leaves an ERA estimator based on component statistics (H, W, HR, etc.) out of the annual for the first time in many years, as they previously included Peripheral ERA. I would like to see PERA included alongside SIERA, at the expense of DERA if necessary.

The most puzzling glossary comment comes on page xii:

We've transitioned this year to using a measure of VORP that is based on EqA; this has appeared for years on the BP Web site under the label of "RARP".

If true, I would heartily applaud this, as the old VORP is fueled by MVP, which is based on the flawed OBA*SLG*AB model of basic Runs Created. EqA is essentially linear weights-based, and thus a much more robust metric when applied to individuals. However, it doesn't seem as if BP contributors are at all clear on whether this change was actually made for the annual, and the VORP report on the website still appears to be MLV-based.

If in fact VORP listed in the annual is EqA-based, it makes the listing of both VORP and WARP even more curious, as they would be identical except for the inclusion of fielding and the conversion to wins. If both are going to be displayed, it would make all the sense in the world to display them in directly comparable units (be it runs or wins).

I am a fairly hard-core sabermetrician, and I am bewildered by the array of metrics used, so I can only assume that the average reader of BP has absolutely no idea what the difference between VORP and RARP is, let alone how to calculate them. While that may not be necessary to understand the implications of the results, it is necessary to enable people who are interested (like myself) to understand the differences.

Sometime after the book was published, BP changed the name of "Equivalent Average" to "True Average". The timing makes it seem as if this was a spur-of-the-moment choice, as the new annual presumably presents a perfect opportunity to change names. In any event, I'm not particularly fond of this change. I don't really care, as others do, that EqA has a long history--if something can be improved, I don't think there should be a statue of limitations, even when it comes to a name. I just don't think that "True Average" is a very good name.

For one thin, they abbreviate it as "TAv" (the other obvious choice would be "TA"). If I see a sabermetric stat with an abbreviation in that vein, the first one that pops into my head is Tom Boswell's Total Average. While EqA is obviously a better constructed and more useful metric than TA, TA has a much longer history, dating back about thirty years. While I have no problem with changing the name of one's own metric, I'm not crazy about doing so in a manner which potentially impedes on someone else's.

Secondly, "Equivalent Average" was a great name for what the statistic is. It is a measure of the rate of overall offensive production designed to look like an equally impressive batting average--an equivalent batting average. To call it "True Average" misses the mark for me on two counts:

1. A "True Average" should be expressed in meaningful baseball units--not units that have been assigned great meaning by custom (as in the case of BA), but units that actually have a great deal of meaning. Examples that work for me would be runs per out, runs per game, runs per PA, even OBA.

2. Batting average, for all its flaws, is straightforward. It truly is hits per at bat. To call another metric "True Average" implies (to me at least) that it is a truer measure of BA. Of course, it's not attempting to replace BA in the role of a measure of base hit frequency, but just in BA's role as a measure of offensive production.

Ultimately, the name of the statistic really doesn't matter, and this is admittedly a petty quibble. I still think "Equivalent Average" is a much better name.

Again, I'd like to reiterate that this is not a review of BP 2010 as a complete work, just a comment on the performance metrics employed, and the accompanying confusion. As usual, I'm glad I bought the book--I just wish I had a better understanding of where the numbers came from.

6 comments:

For the second year in a row I have some thoughts to share on the BP annual, and I am not going to write them up in such a way as to feel comfortable calling it a "book review" (although I did give the post that label).

People don't do review works like these often enough, so this is welcome. A few random thoughts:

(T)he book contains no extra essays, for the first time since 1998 at least. Though the essays have been of uneven quality the last few editions, and have drifted away from sabermetric topics, they were always one of my favorite parts of the books. There have been some really good ones through the years, like Keith Woolner's piece on replacement level and Michael Wolverton's on pennants added.

Agreed. That's one of the things that made the books worthwhile. But I'm more interested in the theory behind sabermetrics and the history of the game. The decision-making that interests me isn't at the GM level. I'm more interested in pitch selection than who to tender or for how much.

There is something to be said for displaying hitting and fielding value in separate columns when using an uber-metric due to the wider differences in estimates of fielding value between systems, but instead BP lists two separate metrics with two different scales.

I've sometimes felt that a players value should be expressed as a range instead of a hard number. Colin Wyers has said that there's uncertainty even in batting stats in a THT article last year re MVP candidates. Unfortunately, I don't recall any discussion of this afterwards. Maybe I missed it.

Ignoring BP fielding metrics in favor of other fielding metrics like UZR and Plus/Minus.

I'm guessing that they give their authors freedom to use different metrics. I like that better than Fangraphs, where it seems that xFIP is the only way to evaluate a pitcher.

Secondly, "Equivalent Average" was a great name for what the statistic is. It is a measure of the rate of overall offensive production designed to look like an equally impressive batting average--an equivalent batting average.

I'm 42, so when I was growing up, I knew more about batting average than OBA. I still have a better feel for what a good EQA is than what a good wOBA is. Personally, though, I think that wRC is superior to either of these rate stats.

I've sometimes felt that a players value should be expressed as a range instead of a hard number. Colin Wyers has said that there's uncertainty even in batting stats in a THT article last year re MVP candidates. Unfortunately, I don't recall any discussion of this afterwards. Maybe I missed it.

I think most serious practitioners recognize this, and kind of take it for granted. That results in more casual saber-fans not worrying about error at all.

I'm guessing that they give their authors freedom to use different metrics. I like that better than Fangraphs, where it seems that xFIP is the only way to evaluate a pitcher.

That's an interesting way to look at it, and I agree that it's good that they're allowed to cite outside metrics. What I was trying to get at is that hardly anyone ever cites BP's own fielding metrics, which makes one wonder why page space should be spent on them.

I'm 42, so when I was growing up, I knew more about batting average than OBA. I still have a better feel for what a good EQA is than what a good wOBA is. Personally, though, I think that wRC is superior to either of these rate stats.

What do you mean by wRC being superior? That you prefer the total to the rate, or something else? Because wOBA is just wRC as a rate (albeit one no longer expressed in terms of runs).

For what it's worth, re: essays, specific stat nitpicking, etc., I think it's pretty clear that BP's main audience for the annual are casual-to-serious fantasy baseball players who care more about the result stats (RBI, ERA, etc.) that PECOTA spits out than anything else.