Skewed Left

Saberizing the Gold Gloves

the archives are now free.

All Baseball Prospectus Premium and Fantasy articles more than a year old are now free as a thank you to the entire Internet for making our work possible.

Not a subscriber? Get exclusive content like this delivered hot to your inbox every weekday. Click here for more information on Baseball Prospectus subscriptions or use the buttons to the right to subscribe and get instant access to the best baseball content on the web.

So we won this weekend. At least I think we won. At least I think they told me we won.

It was announced that the Gold Glove Awards will add a metric component to the traditional voting of major-league managers and coaches, a presumed victory for everyone who prefers the analytical and objective over the judgment of the human eye.

So why no celebration in this virtual household, which stands for just that?

First of all, the release didn’t give much information about the metric itself. Here’s a portion of the release from the Society for American Baseball Research:

As part of the multi-year collaboration beginning with the 2013 season, SABR will develop an expanded statistical resource guide that will accompany the Rawlings Gold Glove Award ballots sent to major league-level managers and coaches each year. In addition, SABR will immediately establish a new Fielding Research Committee tasked to develop a proprietary new defensive analytic called the SABR Defensive Index™, or SDI™. The SDI will serve as an “apples-to-apples” metric to help determine the best defensive players in baseball exclusively for the Rawlings Gold Glove Award and Rawlings Platinum Glove Award selection processes
.…
Beginning in 2013, the managers/coaches vote will constitute a majority of the Rawlings Gold Glove Award winners’ selection tally, with the new SDI comprising the remainder of the overall total. The exact breakdown of the selection criteria will be announced once the SDI is created later this summer.

In other words, they’re working on it.

But this isn’t a critique of SABR’s motives, which are absolutely in the right place—taking a step toward greater likelihood of getting it “right” and spreading knowledge of defensive statistics. Nor is it really even about the uncertainty of what will come out of the SABR conclave.

It’s about the certainty of the ugly process that will ensue.

1. There will be a fight when the metric comes out.
The problem with giving this committee the task of inventing a metric by modifying/splicing existing metrics is that it’s virtually impossible. It’s not like the analytics community hasn’t been trying.

Using data provided by Baseball Info Solutions, Mitchel Lichtman pioneered the Ultimate Zone Rating, which uses a fielder’s capacity inside and outside of a given zone to make plays. We at Baseball Prospectus use Fielding Runs Above Average as the defensive component of our value statistics, focused more on total plays made given conditions such as pitcher’s ground-ball rate, batter’s handedness, ballpark and base-out scenario.

Both are exhaustively researched and justified, and yet the choice of using one or the other in Gold Glove voting would lead to totally different results. At the positions where UZR is calculated and available at Fangraphs.com, battery not included, here is the breakdown of the top players from 2012 who played the whole season in the same league.

Of 14 positions, UZR and FRAA agree on the Gold Glover in exactly three of them, which is a huge issue for the mainstream public appeal of this vote. (Not even mentioning what would happen in the discourse if a part-time player like Dyson or Bourjos were ranked first at a position.)

It will be a very hard sell for the analysts out there to peddle this idea when it creates division within the sabermetric media and surely within SABR’s membership itself.

But that will only be the first step.

2. There will be a fight when the first vote comes out.
There already are gripes, and legitimate ones, when the award comes out. Remember Rafael Palmeiro as a DH in 1999? Derek Jeter winning all five of those awards? Adding the statistical component saves us from having designated hitters win, and that is certainly a good thing.

But instead of uniting them in some awkward arranged marriage, this has the potential to pit the traditionalists against the statistical analysts. When an award comes out where the vote doesn’t match the SDI, that will become a binary-outcome referendum on both parties.

The coaches got it wrong or the numbers got it wrong. One of them had to get it wrong, and dammit, we need to know who it was.

Instead of a celebration of the winner, it’s an examination of the process, which will become very tiresome very quickly. We don’t need any more “WAR, What Is It Good For” columns. Even if SDI has never been in a song lyric, let’s not take that chance over this. There are plenty of more worthy fights for the importance of analytical thinking.

3. We’ll argue over whether stats should be applied directly to more awards.
This is a tough one, because if statistics are going to be applied to any award voting, Gold Gloves might be the worst ones to start with. One-year defensive metrics are notoriously unstable—Alfonso Soriano’s 7.9 FRAA in 2012 came after a -6.4 in 2011 and a -8.0 in 2012. And the difference between metrics as mentioned above make the Gold Glove possibly the worst award to add a statistical component to now. (Okay, Manager of the Year is worse.)

When the statistical community puts its stamp on this award, it has to be prepared to stand behind it. The stats say Soriano, always thought to be a poor defender, was the most accomplished left fielder in the league last year. The stats say a part-time player was the most accomplished center fielder in the league last year.

Is a statistic that research says can take three years to stabilize really the one we want imprinted on a single-season award? There’s an argument to be made that adding a WARP/WAR component to the Baseball Writers Association of America’s MVP awards—or one of the lesser-known MVP-type honors, or maybe even the Hall of Fame—would be a better step.

But SABR doesn’t control any of the awards, and the BBWAA has not been looking to cede any control over its awards (disclosure: the author is a member of both organizations). So this isn’t a knock on SABR, which is doing what it can to advance the discussion, just an unfortunate case of who came calling and had some ground to make a deal.

We hate the process of most awards, yet we haven’t come up with a much better one either here or in the case of the BBWAA awards. If coaches have proven to be the worst voters of any electorate—and it’s really neck-and-neck between them and fans—then change the electorate. Have the people in front offices paid to assess value hand out awards for defensive value.

Perfecting the Gold Glove and other awards is a noble pursuit. This is a small step toward that goal that might be missed in the very predictable reactions to every part of the process.

So from the defensive metric numbers from last year we are to conclude that Alfonso Soriano is a better defensive player than Carlos Gonzalez. And not just a little better--a whole lot better.
What a complete joke.
While there is some merit to what you all do with the defensive metrics. When a result like the above comes out--you need to take a serious look at what you are doing.

So what would you do when the metrics don't agree with your subjective observations? Veto the numbers? Have a subjective fudge factor? Invalidate all metrics that have outliers and idiosyncracies?

Nobody believes Darin Erstad was a .355 hitter, so while there's some merit in what people are trying to do with offensive metrics, when a result like that comes out someone needs to take a serious look at what they're doing.

Batting average is a very simple stat that measures exactly one thing. What it says about Erstad's merits as a player is up for debate, but it is a factual statement that for one season, 240 of his 676 of his at-bats ended with a hit.

Advanced fielding metrics incorporate a lot of things and reflect various weightings and interpretations. I can't compute UZR without a spreadsheet -- come to think of it, I can't compute it AT ALL because it includes proprietary data.

Bingo. If just one of these two methodologies said that Soriano was the best LF, I would have just thought it was some bizarre systemic anomaly and ignored it. But when both of them say it I have wonder what caused this to happen - I mean Alfonso Soriano...really? The problem is that I can't see the numbers, I can't see if there is a flaw they have in common, or if they are really meaningful. I just have to trust that these things work. I don't trust FRAA, so I am dubious that their inclusion in the Gold Glove (or Platinum Glove, whatever that is) voting will be useful.

That reliance on trust is the ultimate problem. And it ties in with Russell Carleton's piece from Monday. If Rafael Palmeiro 1999 is a damning indictment of the current voting structure, shouldn't we consider Alfonso Soriano 2012 a serious shortcoming for FRAA and UZR?

I don't think so. Palmaeiro won a popularity poll which he shouldn't have even qualified for; Soriano is based on statistical evidence, which carries a degree of objectivity.
Soraiano was the leader in those two statistical measures,that is just a fact. The question is how worthwhile a fact that is, which depends on how much faith you have in those respective statistics as a measure of defensive value.

By the way, the system in these comments needs fixing, Just because bluesman98's comment received a certain number of negative votes it was grayed out. Given that it was only a -4, that seems extreme to me, but worse, since it started the thread, graying it out took this whole thread with it. That is asinine, as this is a fairly interesting discussion. I voted bluesman back up, in an attempt to restore the thread. Please fix this.

"One-year defensive metrics are notoriously unstable—Alfonso Soriano’s 7.9 FRAA in 2012 came after a -6.4 in 2011 and a -8.0 in 2012."

...

"The stats say Soriano, always thought to be a poor defender, was the most accomplished left fielder in the league last year. The stats say a part-time player was the most accomplished center fielder in the league last year.

Is a statistic that research says can take three years to stabilize really the one we want imprinted on a single-season award?"

As I understand it, the exact use of the fielding stats won't be known and will be one component of the voting. If I were to do it, I'd use some sort of agglomeration of the statistical fielding metrics for that component of the voting.

We can do that, and have done that (see this article by Colin Wyers, for instance). It makes sense. The potential problem with doing that is that you run the risk of confusing people who expect to see one number, and it's also sort of a pain from a display standpoint.

You could order the players at each position by UZR and FRAA, weight by rank order and use the sum of the weighted scores or something like that. My big problem with any of the defensive metrics is that they jump around so much season to season. That might just be due to nagging injuries, or it could be that they are just fundamentally flawed.

I hear you 100%. The tendency for these things to generate ratings which run the gamut is hard to grab ahold of as a fan with limited sabermetric understanding (meaning myself). Taking it two steps further and urging casual fans and old-school managers to adopt it is going to lead to fireworks. Boring, boring fireworks.

That said, the idea that this year-to-year instability in the metrics is a definitively bad result strikes me as arbitrary and sorta silly (not directed at you, rweiler - hope that's obvious). Why is it that we can't handle volatility among our star defensive players? We see it in Cy Young and MVP ballots, so why do we expect fielding to create different results? I am more than willing to believe that defensive ability is not simply a flat, innate talent. Given that variables like health, park effects, weather, opponents lineups, etc. all factor into every single play, it would be funny to expect consistent annual results. And this is even before the bias of the observer (e.g. the bias towards the flashy play instead of the smooth play) is recognized as playing a larger role in how we judge defense vs. hitting and pitching. I'd think we should be MORE skeptical of an award which features the same names year in and year out in this instance. But that's not how we're trained to see our star athletes, so it'll be a hard sell all the same.

The problem is that we do see considerable year to year stability in pitching and hitting metrics, though pitching metrics are substantially more variable due to injuries. That doesn't seem to be the case with fielding metrics.

This.
Why can't we just accept that fielding events took place, and whatever else happened, these events may not accurately gauge a player's innate fielding ability. Just something that he did - like driving in a baserunner.

I suppose the philosophical question at the heart of the debate is what is the Gold Glove award *for*? If it's to say "this guy had the best defensive performance in a given year" then the fact that a guy is "usually" not a great defender is irrelevant if that year he gets the best "score" by some metric.

In the ideal world, we'd have a metric that correlated to defensive ability, but in the real world maybe defensive *performance* is simply volatile season to season.