Sunday, July 08, 2007

Charlie Pavitt: A long-winded rant concerning the evaluation of fielding

By Charlie PavittI start with Branch Rickey’s famous quote from his (and Allan Roth’s) ground-breaking Life Magazine statistical-analytic article “Goodby to Some Old Baseball Ideas” (August 2, 1954); “there’s nothing anybody can do with fielding” (this and subsequent quotes all from page 83). Rickey/Roth realized that fielding averages were “utterly worthless as a yardstick” because they say nothing about fielding range. But their conclusion that “fielding could not be measured” is surprising given their insights into evaluating batting and pitching. There’s quite a bit we can do with fielding. But we’ve not gotten as far in this regard as we have with batting and pitching; and when I see many of the evaluation methods used by some of our otherwise-best analysts, I am disappointed. My goal in this blog is to rant long-windedly about how I think we ought to be thinking about fielding evaluation in the first place.Let’s start with the basics. We make progress in evaluating an aspect of baseball when we can successfully break the aspect into its component skills, measure each skill, and then combine these measurements in a meaningful common metric. Take batting prowess. There are four major component skills; the ability to get base-hits, the ability to hit for power, the ability to coax base on balls, and the ability to steal bases without being caught. One factor common among the four that suggests a metric is that successful performance occurs when bases are gained without the loss of outs. Each can be measured fairly easily in that way, and the measurements combined, resulting in tools such as OPS (sans the steals) and total average, both of which work fairly well despite their simplicity. Another common metric is number of runs gained, leading to runs created and a large set of regression-based methods.

Defense as a whole works analogously; it is relatively simple to concoct measures of bases or runs given up relative to either outs gained or opportunity (e.g., innings played). The trick is to distinguish the pitching and fielding parts of the equation. Recent work starting with Voros McCracken’s insight implies that pitchers are best evaluated through examination of events that fielders cannot influence. This means that pitching prowess has the following three components; the ability to hit the strike zone (as measured by walks and hit-by-pitches), the ability to miss bats (as measured by strikeouts), and the ability to keep batted balls in play (as measured by home runs allowed). Measures based only on these skills would follow; the Baseball Prospectus people have proposed just that (DIPS ERA) in their recent Baseball Between the Numbers book. (I should add that pitchers do have some influence on batted balls in play, which implies a fourth skill; but this influence is far less than we thought pre-McCracken, and at this time I would agree with those recommending that we ignore this influence for the time being.)

This leaves batted balls in play as the responsibility of the fielder, so that the evaluation of what fielders do with these is the issue at hand. Can we distinguish the component skills? (We need to exclude catcher defense here, which is a whole ‘nother matter.) Surely they include range and sure-handedness on batted balls and throwing ability. For middle infielders, one must consider adeptness as the middle man in a double play; for first basemen, the knack of picking up throws in the dirt. The trick lies in measuring each of them and then combining them into a common metric. This is a challenging project. Baseball differs from football, basketball, soccer, etc. in that it is an individual sport in a team context; i.e. its outcomes are primarily due to the pitcher versus hitter matchup. However, with fielding, particularly infielding, team coordination matters. As a whole it makes sense to credit fielders with the out when they successfully field a ball in their territory. But what do we do with the second out in a double play? Should the fielder get full credit, or the middleman, or should they split it? What do we do with assists from the outfield, especially when cut-off men are involved?

In order to make this rant particularly long-winded, I shall continue with a bit of history. Back in the March 1976 issue of "Baseball Digest", Bill James proposed the seemingly-novel idea that we measure infielding by the number of putouts and assists the infielder makes per game (in so doing reinventing, a hundred years later, an identical but soon-forgotten measure credited to Al Wright). Range factor was clearly an advance over fielding percentage, but it was laced with problems. First, it intermixes different skills without previous reflection concerning each. Infielders amass putouts and assists both through fielding batted balls and through participation in double plays and force outs, but these are reflective of different skills, only the first of which is directly relevant to range. I did a couple of studies in which I attempted to solve this problem by measuring infielding purely by assists, under the assumption that they were a purer measure of range than putouts. Bill published both, respectively in issues 24 (June 1986) and 31 (August 1987) of the "Baseball Analyst", although I believe that he disagreed with my method given the display of range that can be shown when infielders catch pop-ups far from their position. And I knew full well that many assists are racked up as a double-play middleman. Second, and these were the issues that my two studies were really about, range factor ignored the fact that pitching staffs with high strikeout totals limit infielder opportunities to field balls; pitching staffs with a high proportion of innings taken by lefthanded pitchers will face a preponderance of righthanded batters, leading to proportionally more grounders to the third baseman and shortstop and fewer to the second and first baseman, when compared to pitching staffs with few lefty innings. I presented this material at a SABR convention near Washington D.C., if I remember correctly in 1986; during my presentation, an audience member noted that pitching staff groundball vs. fly ball tendencies have analogous implications. Interestingly enough, John Dewan assumed the pitching-handedness bias and presented fielding measures adjusted for this problem at the same SABR convention, for Dewan beginning a concern with this issue that has continued to this time.

It was obvious that if we wanted to measure fielding plays made on batted balls independently of participation in double plays and free of biases due to pitching staff tendencies, we would have to go beyond the standard statistical measures of fielding and use play-by-play data to measure the proportion of balls hit into the portion of the ballpark for which each position is responsible that are successfully fielded. Fortunately, at about this time Project Scoresheet was beginning to supply the needed data, and analysts started using it for this purpose. The earliest effort of which I am aware was Pete DeCoursey’s work on what he called defensive average, first published in the March 1989 issue of a (sadly) short-lived publication called the "Philadelphia Baseball File." I believe others among “amateur” statistical analysts continued in this vein, and would be happy to hear from readers who have information on anybody doing this work during the 1990s. As for the “professionals,” and probably thanks to Dewan, the STATS annual Player Profiles books during the 1990s included a measure called zone rating, which unfortunately gave credit for two plays for fielded balls turned into double plays, in so doing conflating two different skills.

What are the lessons I think we should take home from all this? Let’s start with two do nots. First, do not use the standard indices, because no matter how well they are massaged they do not provide valid information. An example of this is the Defense ratings appearing in the Baseball Prospectus group’s annual. They are not always clear about their methods; from a description in "Baseball Between The Numbers" (page 97), it seems that Clay Davenport’s version of fielding runs begins with the standard measures and then adjusts them for park factor and the pitching staff tendencies mentioned above. As far as I can tell, they do not take the double play problem into account, but otherwise these adjustments are right-headed. But the method doesn’t seem to work. If you glance through their books, their Defense ratings for players differ wildly from year to year, at least by eye-ball analysis far more than random factors would allow. And they don’t trust their own numbers, regularly making verbal comments clearly inconsistent with their own calculations. For two examples from the 2007 book: on page 418 they ask whether Chris Duncan is “the single worst defensive outfielder in modern memory,” but his 2006 ratings are slightly above average (+1) in both left and right field; on page 381, they wittily call Pat Burrell “the Zeno’s Paradox Outfielder, in that no matter how close he seems to be to catching the ball he’s only halfway there,” but his 2006 rating (-2) isn’t all that bad. An interesting case, of course, is, the normally-maligned Derek Jeter. According to Davenport’s numbers, after years of futility (-12 in 1999, -22 in 2000, -17 in 2001, -19 in 2002, -15 in 2003) Jeter improved to -4 in 2004 and became a good shortstop the past two seasons (+12 in 2005, +7 in 2006), and this change is the main theme in Chapter 3-1 of "Baseball Between The Numbers" as a result of having Alex Rodriguez next to him. As I will describe below, we have good evidence that Jeter’s defense has not improved, and, while I like most of what BP does, I don’t trust their fielding numbers for a second.

Second, if you are going to combine indices for the different skills involved in fielding, do not do so arbitrarily. The example here is Bill James. I admire what he attempted to do in his Win Shares book, but much of it is based on what seem to be arbitrary decisions that make no sense to me. To begin, pitching is given 67.5 percent of the credit for defense and fielding the remaining 32.5 percent; the reader is never told where these numbers come from. The division of this 32.5 across positions is performed according to criteria that the author himself admits to be arbitrary. Ratings for each position are made in the context of their different skills; here is the method for infielders:

There is no indication of where these numbers come from: why double plays are the most important part of second base play, why putouts are irrelevant to third base and so low for the other positions; could this be a late recognition that I was correct more than fifteen years before the book was written about removing putouts from range factor? Unless and until we get a convincing rationale for these proportions, as with the BP work I don’t trust any of Bill’s ratings for a second.

What would I like to see? First and foremost, I would like to see all measurements of range and sure-handedness based on play-by-play data. Dewan has continued work in this regard with his Baseball Info Solutions; his book The Fielding Bible is a gold mine of valuable data on defense. I might add that Dewan’s work makes plain Jeter’s continued defensive shortcomings; in 2005, he ranked 31st in Dewan’s metric among 32 rated shortstops. David Pinto’s Probabilistic Model of Range and Mitchel Lichtman’s Ultimate Zone Rating are basically identical with Dewan’s work in this regard.

But more generally, I think it is possible to come up with a fielding metric that does a fairly good job of evaluating most aspects of fielding (catchers excluded) in the context of either bases or runs. Beginning with range and sure-handedness, turning a measure such as Dewan’s or Pinto’s into either a base or run measure should be easy; Lichtman already calculates Ultimate Zone Rating in run metrics. As for the other aspects of fielding: in Volume 10 Number 3 of SABR’s Baseball by the Numbers, Clem Comly proposed a nice method (Average Arm Equivalent Method, or ARM) for evaluating the number of runs outfielders either save or cost their team based on their number of assists relative to their number of opportunities to throw out baserunners. As ARM is already in a run metric, it would merely have to be summed with Ultimate Zone Rating for outfielders. We do need to come up with a good method for dividing up responsibility for the second out on double plays; are there any out there of which I am unaware? I know that Pinto has recently put some attention to the double-play problem. I admit that the first baseman’s ability to turn errant throws into outs gets shortchanged here; I’m not sure whether any of the currently available play-by-play data provides enough detail for us to enter that into the equation. It may not be perfect, but contra Rickey/Roth there’s quite a bit we can do with fielding.

"Dewan’s work makes plain Jeter’s continued defensive shortcomings... David Pinto’s Probabilistic Model of Range and Mitchel Lichtman’s Ultimate Zone Rating are basically identical with Dewan’s work in this regard." [emphasis added]

I took Charlie to be saying these systems agreed with Dewan's in their negative evaluation of Jeter, not that Charlie was saying these systems in general were "identical" to Dewan's.

I don't know (directly) the intellectual history of the late 80s/early 90s fielding systems, but the "Great American Baseball Stat Book" based on Project Scoresheet data and published after the 1987 season looks like a step on the way to Pete deCoursay's work, and apparently was published a year before his. That book contained a system credited to Gary Gillette (and Dave Nichols). It presented a form of adjusted range factor which adjusted for batter handedness and total balls in play while the fielder was in the field. I don't know deCoursay's work from its original publication in Philadelphia Baseball File, but rather from an article he contributed to Brock Hanke's 1990 Baseball Sabermetric after the '89 season. In that article he credits Sherri and David Nichols as co-creators. The advance over the Gillette/Nichols system was to measure opportunity less imprecisely by making use of the fairly general hit location data available in the '88 Project Scoresheet data - instead of any ball in play on the field being a opportunity, balls hit to or through the area near the fielder were counted. Performance on ground and "air" balls were subtotalled for infielders, but no adjustment was made in weighting them for the overall "defensive average." So opportunities were accounted for rather directly but no direct effort was made to adjust for the difficulty of those opportunities. At least in the Baseball Sabermetric article, further information on extra base hits allowed and double plays initiated was also presented and discussed, but was not part of the defensive average itself.

In the early '90s in his Baseball Player and Team Ratings, Mike Gimbel used STATS data and a linear weights-style system for hits and errors to construct a plus/minus defensive run value. He also attempted park adjustments. Because STATS measured hit location in a more granular way than Project Scoresheet, Gimbel had more precise zones, reducing some of the uncertainty about a fielder's real opportunities. In the infield, Gimbel counted only ground balls, finessing the problem of accounting for the variant difficulty of opportunity from different hit types.

If I recall properly, MGL has stated that he was independently developing his system about this same time, using the same granular data from STATS as Gimbel, and that "defensive average" was an inspiration. His advance over Gimbel was to exploit the granular data more fully (measuring difficulty in "subzones"), more elaborate park adjustment, and adjustments to recognize of the impact of contextual factors on fielder position and therefore difficulty (base-out situation and batter-handedness).

Gimbel is mainly infamous for his braggadocio as Dan Duquette's statistical consultant with the Red Sox in the mid-90s, but as far as I know his fielding system was the (published) state of the art in the early 90s. Unfortunately for him, his books were published by a small press in the pre-internet age, and his place in fielding-system-history seems to be forgotten.