Measuring the umpire’s effect on the game

We all know umpires make mistakes, especially when calling balls and strikes. While some people will argue that those mistakes are part of the game, there are few who are able to make a convincing argument to support that. On the other hand, the blogosphere is litteredwith arguments against umpires and for a computerized zone.

I’m not here to take sides (I’m actually firmly in the camp for human umpires), instead, I wanted to take a look at how much of an impact umpires actually have. First, I wanted to see how accurate umpires are. To do that, I went into my Pitch f/x data and grabbed all pitches that were called strikes and all pitches that were called balls. Then, I mapped them out onto an approximation of a major league strike zone.

I used the average top and bottom hitter zones provided by Gameday, 1.6 and 3.4 feet above ground respectively, as the vertical ends of the strike zone. Than I used the official major league horizontal zone (17 inches) and added two inches of leeway to each side. I also normalized the vertical position of each pitch ball to batter height.

Here is what I got:

Remember that this is from the catchers point of view.

As you can see, there is significant overlap. While umpires are pretty good at judging the high end of the strike zone, they are absolutely dreadful at judging the bottom and the sides, especially the third base side. Overall, 9.1% of pitches that were called balls were inside of the strike zone, and 21.7% of pitches that were called strikes were outside of the strike zone. I find that second figure outstanding, especially given that I am already giving the umps a pretty lenient strike zone.

If I change the perimeters of the zone to 2 feet both ways, than those percentages become 16.5% and 11.6% respectively. So, assuming that the umpires have no bias towards hitters or pitchers, the “real” zone is likely somewhere in between that.

John Walsh already did some great work a couple of years ago on figuring out the “real” strike zone, and I may try to update that later having the benefit of more accurate Pitch f/x data. However, for now, I wanted to take a look at this from another angle.

Those percentages I quoted above are huge numbers. Any way you swing it, it appears that the umpires are only about 85% accurate, at least this year. That leaves a lot of room for random variation among players. How much? Well, let’s find out.

I queried all pitchers this year who have thrown at least 500 pitch in baseball this year, 329 in total, and sorted each pitcher by the number of pitches called strikes that were outside of the strike zone minus the number of pitches called balls that were inside of the strike zone. Then I divided by total pitches to get it on a rate stat. Then I multiplied that by 100 pitches, or roughly one game, and named that “Gift Rate”. Here are the results shown graphically:

In case it isn’t clear, the x axis is all pitchers who’ve thrown at least 500 pitches this year.

You can read that as the number of “gifts” minus the number of “squeezes” each pitcher receives per game. You can see that despite the old adage, it does not all even out. Some of that may be due to measurement error, as I don’t profess my strike zone to be very thorough and there still may be problems with the Pitch f/x data (namely park effects), and there may be some sampling error as well; however, it’s clear that umpires effect some pitchers more than others.

The standard deviation of Gift Rate among pitchers this year is about 1.6, which means that 68% of pitchers will have up to 1.5% difference in their strike rate based on umpires alone. That may not sound like a lot, but consider that, based off of this years data alone, there is an R^2 of about .62 on strike% vs. BB/9. The average difference in walk rate among guys with a 1.6% difference in their strike% is about .4 which is pretty significant.

Going by the FIP formula, if you added .4 and subtracted walks per 9 to a league average pitcher, their FIP would rise by about .20 points. Obviously this doens’t consider how strike% effects K Rate, and other factors. So in order to get more actionable numbers, a more rigorous study needs to be applied. However, it serves a reasonable illustration of the impact that umpires can have.

Now, here are the pitchers who have been getting the biggest help this year:

It’s hard to see any sort of bias in those lists. Among the leaders, you have two of the best pitcher in baseball (Mariano and Vazquez) and two of the worst (Hernandez and Weathers). The trailers are filled with guys with abysmal control, like Willis, and guys with good control, like League. For those who want it, here is the complete list (pitchers are labeled by their Elias ID and my SQL is acting wonky right now, so you’ll have to do some translating).

The next step, along with creating a more accurate strike zone, is finding how much of an impact those missed calls have. We all know that there a certain missed calls more significant than others; however, as I showed earlier, the potential impact of a lost or gained strike may be pretty significant in itself.

We use FIP, tRA and other such metrics to eliminate defense and other kinds of luck from pitcher ability. However, it’s possible that umpires themselves may have as big, if not more, of an effect.

Comments

Is there any trend from year-to-year for the same guys to be helped or hurt by their calls? For example, maybe Mariano Rivera has sharp enough control that he can consistently stretch his personal strike zone a bit.

I can make sense of why umpires would miss more on the low pitches. What doesn’t make sense to me are the inside pitches to right-handed batters, since any umpire clinc I’ve been to has taught me to set up on the INSIDE part of the plate. I would guess that even if the data is split by RHB/LHB, there’s still a really high number of pitches being called strikes that are too far inside according to Gameday. Could this be umpire hubris, miscalibrated cameras, or something I’m not even thinking of? I’m sure many will lean towards the first option, but I’m not sold on that.

The problem with this data is the strike zone. You simply cannot use an approx SZ or an average SZ, it just doesn’t work. SZ sizes can change dramatically from batter to batter. To get an accurate feel of an ump’s impact on a game bases on his strike/ball calls you have to really break it down batter-per-batter.

Dave – I adjusted for individual batter height by normalizing the pz values of each pitch to the league average strike zone. MLB fortunately gives us an approximation of each top and bottom zone of each hitter.

So, For example, if MLB gives me a bottom zone of 2.0 and a top zone of 3.8, and the pitch thrown was at a veritcal position of 2.5, to adjust that to the league average zone of 1.6 and 3.4, I would use this formula…

(2.5 + (((3.4 – 3.8) + (1.6 – 2.0))/2))

… so my adjusted vertical position would be 2.1. Unfortunately, there isn’t much way to tell the “right” horizontal position, besides splitting it up by handedness.

His FIP might be just under 6, but his (StatCorner) tRA while with Cincy was 4.37. FIP ignores a whole lot of stuff, and given that it’s nowhere near any of his previous years, I’ll go with that as the outlier. His FIP and FanGraphs tRA is all over the place year-to-year, while his StatCorner tRA is much more consistent, and therefore (I would posit) more indicative of his true talent level.

JR- I just checked Statcorner. He really has done a good job of avoiding Line Drives!

For what it’s worth, FanGraphs has his tRA with the Reds at 5.36 this year, using different batted ball classifications (one of the reasons that tRA remains a somewhat unreliable in small sample size)

At any rate, he is 40 years old, and ZIPS projects a 4.50 FIP going forward, which is the definition of replacement level for a reliever. He may not be one of the worst pitchers in baseball, but he’s not good either.

Anyways, that line was a complete throwaway line made to compare him to Javier Vasquez and Mariano Rivera. I didn’t know David Weathers had so many fans at THT

Nice work.
One comment though: you say they are especially bad on the third base side. It may make sense to distinguish between LHP and RHP and LHB and RHB. In other words, are they always bad on the 3rd base side or are they bad when a pitcher is pitching inside?

Recent correlation studies show tRA and FIP to have approximately the same amount of error in determining pitching performance. Given sample size issues on batted ball data, I’m much more willing to trust FIP in these instances, which has shown David Weathers to be far, far worse.

Additionally, tRA shows Weathers to be essentially a replacement-level pitcher for every year of his career sans 2003. In any case, I don’t think this is a snub of Blyleven-level proportions.

Due to the batted ball data tRA relies upon, I’m inclined to believe it’s less reliable in a small sample size than FIP is. Strikeouts, walks, and homeruns all stabilize much more quickly than LD%, FB%, etc.

Re John in the first comment, there are a nontrivial number of stringer errors in the data where the pitch type (Ball, Called Strike, In Play, etc.) is not matched with the correct set of PITCHf/x data, particularly if you download the data the night after the game was played.

Good points about the small sample size issue re tRA. However, I was under the impression that tRA* was supposed to regress all the random stuff that pitchers, as a whole, have no control over to the league average. Weathers’s tRA* of 4.27 is actually a shade better than his tRA which suggests that it’s not necessarily fluky. His xFIP is also significantly better than his straight FIP, suggesting the same.

If you want a contrary example of an umpire just missing a call, look at the 1-0 pitch to Orland Cabrera in the first inning of the July 30, 2009 game. It’s right down the middle, and the umpire calls it a ball. Oddly, neither Lester nor the catcher object at all. Cabrera offers bunt and then pulls the bat back; perhaps that confused the umpire.

Another example of a bad miss by an umpire is the 1-1 pitch to Hunter Pence in the top of the first inning May 14, 2009. It’s right down the middle but called a ball, but again nobody appears to argue, and again Pence made a check swing, which perhaps confused the ump?

I know it’s a bit of a nightmare, but don’t we (did you?) have to make some sort of correction for curves bending ‘around’ the plate? I think I remember a Josh Kalk article on THT saying that could amount for up to two inches of seemingly bad calls from umps.

That makes sense Jonothan. I didn’t make a correction for curves, I just adjusted for batter height and threw up a rough approximation of a strike zone.

Right now I’m working on gauging the “correct” strike zone. Splitting it up by pitch type and batter hand would certainly help. Then I could make corrects on each pitch from there. And then there is the bad data that Mike mentioned. Ugh… this is gonna take awhile.