Well, allow me to make amends there. I don’t pretend to have the problem solved. I’m not sure any of us will ever see it truly solved. But I think—or at least, hope—this can point us in the right direction.

The Two Problems

We can really subdivide our problems neatly in two. One is the issue of bias, the other of uncertainty.

Let us start with the latter. What we are trying to do here is measure, and then compare, two things:

How many plays a player has made, and

How many plays we think an average player at that position would have made, given the same chances.

The first, we all think we can measure directly—given the record, we can readily come up with a total. We may have some disagreement over what to count, but if we agree on what we should be counting, we can come to an agreement. The second is an estimate and, as such, is subject to error. Over time, the error in our estimate should come down (as a proportion of our estimate, that is).

Now, what modern defensive metrics (one based on observational data, like batted-ball types, hit locations, etc.) are trying to do is to cut down on the effects of measurement error on our estimate of plays made by an average player.

By attempting to reduce measurement error, those metrics have introduced the potential for bias into their estimates, however. The two key ones are:

Park-scorer biases. To the extent that a park influences the scoring of batted balls, that has an impact on our estimates. It could have to do with the identity of the scorer in different parks. It could relate to the vantage point of the scorers in each park. Regardless of the source, it distorts the estimates of a fielder’s chances.

Range biases. To the extent that a fielder’s range (or the range of his teammates) influences the scoring of batted balls (either by type or location), that also distorts the picture of a fielder’s abilities. The most obvious possible effect is that a good fielder will raise the number of estimated chances he gets by getting to more balls (or at least getting closer to them)—and vice versa for a poor fielder. This would both artificially compress the observed spread of fielding performance, and systemically underestimate fielders with good range (and overestimate fielders with poor range).

So what we have is some presumption of increased accuracy, in exchange for additional bias. What we do not know, as of yet, is how much accuracy we are gaining, at the expense of how much bias. And I think that’s an important thing to know—if your gain in accuracy is less than the amount of bias you’re introducing, you haven’t actually gotten better, you’ve gotten worse.

And we know how to solve the accuracy problem—get more data! Over a long enough timeline, the estimates will improve on their own. Adding more data doesn’t make bias any better, though—in fact, over time, the effect of bias becomes more powerful.

Just the Facts, Ma’am

So let’s take a different approach. Let’s try to design a fielding metric with no bias—or, at least, attempt to minimize the effect of bias. What we can do is:

Restrict ourselves to looking at only factual data—data we can validate objectively. That means no batted-ball data, no hit location data, etc.

For estimating the amount of plays an average player at that position would have made, ignore data about the outcomes of batted balls whenever possible.

Err on the side of caution when deciding whether or not to adjust—in other words, make as few adjustments as possible. We can allow the data to be expressive by getting the metric out of its way whenever we can.

Over time, the potential inaccuracies of our data should wash out, and because we think we are minimizing our potential for bias, over a long period of time we should be able to be confident of our measure of a fielder’s ability.

Figuring Plays Made

Looking at play-by-play data available from Retrosheet, we can start off with counting the plays a player actually made on the field.

Ideally, what we would do is separate the fielding of balls hit on the ground (OK, OK—ground balls) from balls hit in the air (pop-ups, liners, and fly balls). But we’ve already committed to not using that sort of data. Is there anything we can do, simply looking at facts, to determine what sort of plays a player made?

For outfielders, it’s a simple matter. We just count an outfielder’s unassisted putouts as his plays made. (His assists we can examine separately at a later date.)

For an infielder, how are we to determine whether he caught the ball on the fly or fielded it on the ground, without resorting to batted-ball categorizations? It’s simple (if a bit messy for first basemen and pitchers):

An assist by the infielder who first fielded the ball counts as a play made on a “ground ball.” (This is not always the case—a fielder who deflects a ball that is then fielded by another player for an out is credited with an assist. But this is rare enough that over time we can ignore it, and in the short run we can do little about it.

An unassisted putout of a baserunner, other than the batter, by an infielder is a play made on a “ground ball.” For catchers, second basemen, third basemen, and shortstops, an unassisted putout of the batter is a play made on an “air ball.” There are rare occasions, mostly for second basemen, where this isn’t the case, but again over time we shouldn’t have to worry about this.

For first basemen, an unassisted putout of the batter is a “ground-ball” out when it was either on a bunt attempt or hit by a left-handed batter. For pitchers, an unassisted putout of the batter is a “ground-ball” out on a bunt attempt only. All others are classified as “air-ball” outs. This is probably the least-confident part of the system, but for now we’ll leave it as it is.

So this gives us, at the team level, outs on the ground versus outs in the air. And what we see is a strong negative relationship between ground plays and air plays, with a correlation of -0.77. So when a team makes a lot of ground-ball plays, the most likely explanation is that they saw a lot of ground balls.

So, let’s adjust for that. What we can do is look at how many plays a team made in total, compared to the average team, and then look at how many ground-ball plays a team made compared to how many air-ball plays they made. A team with superior ground-ball fielders will not only have more ground-ball plays but likely more plays made overall.

So for a team that’s above-average on making ground-ball plays but below-average in making total plays, we “shift” the responsibility toward the ground-ball plays (in other words, inflate the amount of ground-ball plays we think the team should have made, but deflate the amount of air-ball plays we think the team should have made), while keeping the total number of plays we think the team should have made constant.

This is, for lack of a better term, our “ground-ball rate” adjustment. It’s a bit of a misnomer, because we ignore any scorer data on the number of ground balls a defense saw. And it is possible that including that scorer data could improve the process here as well. But for now, let’s err on the side of excluding that data.

Breaking Down the Fielders

What we do now is apply the process from above to individual fielders. As we did for teams, we break down outfielder plays, infielder plays on the ground, and infielder plays in the air. That tells us how many plays each fielder made.

Then we look at each batted ball and estimate the likelihood that each fielder makes a play on it. The only data we are considering right now is the handedness of the batter who hit the ball. (For first basemen, we’re also considering whether or not they had to hold a runner at first.) We aren’t considering who eventually fielded the ball, whether or not the ball was an out or a hit, etc. Why? Because the outcome of the batted ball is a potential source of bias. By giving up some accuracy in the short run, we allow truly great fielders to look truly great—otherwise, we artificially compact the spread of the impact of top fielders over time.

So we have our measure of plays made, and our estimate of chances. We can leave off there, at least for infielders (Outfielders will require a bit more work, I’m afraid—and that will have to wait for another day). But we discussed uncertainty—can we at least try and measure it?

Ignore, at least for now, uncertainty about actual plays made—for first basemen and pitchers especially we do have some, but enough that we can afford to at least set it aside for a while. But for our estimate of how many plays a fielder should have made, we know there is a margin of error. What we can do is calculate the uncertainty of our estimate per ball in play, and use that to figure our total uncertainty for any given player.

What I did is figure the root mean square error between the average number of plays made and the actual plays made, on an individual basis.

For example: In 2009, with a right-handed hitter batting, a shortstop will make a play on a ball in play roughly 12 percent of the time. (For a left-handed batter, a shortstop will make a play on a ball in play roughly six percent of the time.) But the margin of error around our estimate of how often a shortstop will make any single play is about 30 percent. (Notably the error is asymmetrical—obviously there is no chance of a shortstop making a negative play, even if in exasperation I may have accused Alex Gonzalez of it during the ’03 playoffs.)

To attribute that margin of error over a number of chances, we take:

What’s interesting about this is that the margin of error per BIP drops, the more BIP we observe. So, after 100 BIP, the margin of error for any one play drops all the way to three percent.

(That’s why, to me, uncertainty is preferable to bias—with enough statistical power, we can plow through uncertainty readily. Without an accounting of what the bias is, we’re essentially powerless against it.)

Some Examples

After taking you all this way, surely I wouldn’t leave you without something to look at, would I? Here are the top 10 seasons by a shortstop since 1950, according to our new fielding metric:

Name

Year

Chances

Plays

AvgPlays

+/-

MOE

+/-R

MOE_R

Guillen, Ozzie

1988

4480

515

442.3

72.7

19.5

55.9

15.0

Ryan, Brendan

2009

2507

325

259.0

66.0

14.5

53.7

11.8

Fermin, Felix

1989

4217

480

411.0

69.0

19.0

53.3

14.7

Belanger, Mark

1975

3996

467

403.9

63.1

18.9

49.2

14.7

Tulowitzki, Troy

2007

4294

490

432.0

58.0

19.1

48.4

15.9

Sanchez, Rey

1999

3666

391

336.9

54.1

17.5

46.4

15.1

Thon, Dickie

1983

4271

481

423.0

58.0

19.3

45.5

15.1

Smith, Ozzie

1980

4618

570

512.0

58.0

20.2

45.2

15.7

Martinez, Felix

2000

2818

318

265.8

52.2

15.4

45.2

13.4

Sanchez, Rey

2000

3785

394

342.8

51.2

17.9

44.3

15.5

I’ve provided a tentative conversion of plays to runs, although it still needs a little work. Note, for instance—Ozzie Guillen is being credited with about 73 plays above the average shortstop for 1988. That’s pretty impressive. It’s also pretty imprecise, with a margin of error around 20 plays.

What’s important to note is that the error is not symmetrical—we think there’s practically no chance that Guillen really made over 90 plays above average, for instance.

So, on a single-season level, we see some quizzical results. (Brendan Ryan? Really?) The important thing to remember is—we aren’t very confident in those results! Our confidence increases as we move to the career level, though:

Name

+/-R

MOE_R

Smith, Ozzie

322.1

61.4

Belanger, Mark

237.2

50.1

Sanchez, Rey

217.7

37.6

Russell, Bill

190.6

49.7

Valentin, Jose

177.4

43.3

Guillen, Ozzie

168.3

52.3

Templeton, Garry

150.3

53.2

Groat, Dick

144.0

49.0

Maxvill, Dal

139.3

37.5

Gagne, Greg

130.4

49.8

It isn’t to say there’s no uncertainty. We can say, given the statistical evidence we have at hand, there’s a small (but not impossible) chance that Mark Belanger saved more runs compared to the average shortstop than Ozzie Smith did. And after that, well, nobody else is in the running.

Well, obviously I have to produce outfielder measurements as well. And there are probably still some tweaks to be made to this system that could improve it.

But past that—these values cannot simply be used in place of FRAA to calculate WARP the way we’re doing it now. We have this measure of uncertainty. We can similarly compute uncertainty for our offensive metrics (and it’s quite a bit smaller on a per-play basis). We cannot, in coming up with a single value to express a player’s season, add defense to offense as though we are equally certain of both.

So we’re going to be revising WARP to account for this uncertainty. Along the way, we’ll be adding some other enhancements to WARP as well. And we’ll be looking at pitching—after all, a lot of what we’ve always thought was pitching is fielding, isn’t it? And so any uncertainty we’ve had in measuring fielding spills over into pitching as well.

So, consider this a beginning, not an end.

Notes and Asides

I should give a nod to Bill James’ Fielding Win Shares, which served as an inspiration for some of my efforts here. I should also give a nod to the work Smith has done on TotalZone, which was also something I spent a lot of time thinking about.

For some discussion on the spread of defense, I cannot recommend these enough:

Thanks, for the great work. I'd love to see the complete list for infielders, but I just have to know what your system thinks of Nick Punto. Is Ron Gardenhire's deep and abiding love for Punto's getafteritness warranted?

Great piece (again). Every time I see 'Colin Wyers' on the main page nowadays, I make a beeline for that article.

A commonly-referenced rule of thumb for the other defensive metrics out there is that three seasons worth of data are required before a person can start to draw some meaningful conclusions. As you stated in the article, estimates should improve over time using this approach. Would you consider three years to be a reasonable timeframe to start drawing meaningful conclusions?

As Colin mentioned, the margin of error is proportional the square root of the number of balls in play. The margin of error as a percentage, then, is proportional to the inverse of the square root of the number of balls in play.

So if you multiply your balls in play by three (i.e., three seasons instead of one), your percentage error will drop by 1.7x. (The square root of 3 is about 1.7.) If you multiply balls in play by 10, your percentage error will drop by about 3x.

I'm trying to digest this and see if there is any way to squeeze any more adjustments into it without adding too much bias.

Particularly, I'm wondering if there's a way to adjust for pitcher tendencies without looking to the results of the plays made. The only ones I can think of where we have historical data are pitcher handedness and the groundball-airball split (as you defined it) by pitcher.

Colin, as I mentioned on Twitter, can you use these numbers to estimate the magnitude of range bias for various advanced fielding systems (and at various positions)? Over a large sample of players, the park-scorer bias should become much less important.

If the ~70 run difference for Ozzie Smith is due to range bias, and 1 play = 0.8 runs, and Ozzie played about the equivalent of 17 seasons, then 70 / 0.8 / 17 = about 5 runs per season due to range bias.

If we apply the same method to a large group of players, we ought to be able to estimate the range bias.

Then, I wonder--thinking out loud here--if the range bias for large samples of players is known, could you then turn around and estimate the park bias?

For example, placing 95th percentile ranges around such stats as MORP would be informative. Hypothetically, Ben Sheets may have a $6M Morp with a WIDE range while, oh... Mike Pelfrey may have a similar morp with low variance.

Perhaps I'm misunderstanding the methodology, but I don't understand why one should expect that "calculating the uncertainty of our estimate per ball in play, and use that to figure our total uncertainty for any given player" will give you the correct distribution of the error. Unless you assume that, in the sample size of five or ten years or the length of a career, the error will begin converging to an easy-to-play with distribution (i.e. one without too much kurtosis), your assertions about the error term don't have a very strong foundation.

Normally, when sabermetrics parameterizes a model or tests a hypothesis, these questions are mostly irrelevant. But when the argument is based on the idea that the estimator is consistent or unbiased, they become central. Perhaps your methodology allows the bias to disappear asymptotically, but "asymptotically" can mean "after 10,000 seasons of a player with unchanging defensive ability" not "fifteen during which the player ages and/or learns". Intuitively, I would prefer to introduce some measure of bias if in the pre-asymptotics real world it on average gives us more information than it takes away.

Do you have evidence that the estimate converges fast enough for individual players to outweigh the additional information current methodologies provide?

Well, it's an interesting question - are there persistent factors that would keep a fielder's chances from converging over time?

They would have to be external factors, of course - we don't suspect that a fielder's very presence on the field changes the distribution of batted balls (and if we do, we probably want to measure that as part of a fielder's skill). So what persistent effects are there that could prevent the estimate from converging over time?

I wasn't saying it's not going to converge eventually, but we don't know when that eventually is. At a sample size of infinity, a method like yours will dominate methods that require biases and judgment calls. Before infinity, it is completely possible (probable) that making adjustments which can introduce range or park effect biases will give us better estimators.

I would be more comfortable if you could say what the pre-asymptotic distribution of the errors might look like, because that would give a much more genuine basis for evaluating its accuracy against what we have now. The error bar that you suggested sounds like a good rule of thumb after having established the errors are (for instance) normally distributed, but by itself I don't think it provides a basis for evaluating it for individual players.

I have a stupid question and/or comment. With the "ground ball adjustment," we're hurting above-average infielders who are paired with below-average outfielders, right? And vice versa?

I think the reply is that the system is worse without that adjustment, and I would intuitively agree. But it seems that it ought to be noted that this is a possible problem with the system. It would also seem to set up the next question: can we make the system better by using simple batted ball types, or do you feel there is too much systematic bias in even that data? What's the tradeoff?

And yes, there's potential for batted ball data, used in a coarse sense, to improve upon what I have right now. What I want to do first is finish the system the way it is for outfielders, and then examine that issue more closely.

It's one standard deviation, which is about 68% of potential outcomes, right?

Dave, you might enjoy the thread that Tango started in response to Colin's article at the Book blog where similar questions are being discussed:
http://www.insidethebook.com/ee/index.php/site/comments/reducing_bias_in_fielding_metrics/#comments

Well, you can state the minimum amount included within a given number of standard deviations of the mean using Chebyshev's inequality. It has the benefit of applying even to non-normal distributions. However, it's pretty conservative and usually is an underestimate so is of limited useful value.