Bayesian Umpires: The coolest sports-statistics idea since the hot hand!

Baseball fans have long known, or at least suspected, that umpires call balls and strikes differently as the count changes. At 0-2, it seems that almost any taken pitch that is not right down the middle will be called a ball, while at 3-0 it feels like pitchers invariably get the benefit of the doubt. One of the earliest discoveries made possible by PITCHf/x data was the validation of this perception: Researchers confirmed that the effective size of the strike zone at 0-2 is only about two-thirds as large as in a 3-0 count.

One common explanation offered for this pattern is that umpires don’t want to decide the outcome of a plate appearance. Preferring to let the players play, this argument goes, umpires will only call “strike three” or “ball four” if there is no ambiguity about the call.

But Molyneux has another theory that he claims better fits the data. The theory is that umpires are using Bayesian reasoning. I love it!

The argument goes as follows:

– It’s 3-0. You know and I know and the umpire knows that the pitcher’s probably going to throw a strike. Now a pitch comes in that’s a close call. The “base rate” (in Tversky and Kahneman terms) is that most of those 3-0 pitches are strikes. The Bayesian thing to do is to multiply the likelihood ratio by the ratio of base rates, and the result is that, as the umpire, you should call these close ones as strikes.

– Or, it’s 0-2. We know the pitcher is likely to throw this one away. The base rates is that the pitch is likely to be a ball. Now you take the exact same close call as before—the same “likelihood ratio,” in statistics jargon—but now you multiply it by this new ratio of base rates, and it’s rational to call this one as a ball.

It’s statistical discrimination, baseball-style. Rational in each individual case, with a predictable bias in the aggregate.

This is really a beautiful argument of Molyneux. I’ve never seen it before, but it makes so much sense. And that’s why I say it’s the coolest sports-statistics idea since Miller and Sanjurjo’s work on the hot hand.

26 Comments

Interesting. But shouldn’t the “base rate” depend on how close the earlier pitches were and how well the batter swung? Wouldn’t the Bayesan argument for the 3-0 case be stronger if the earlier pitches were closer to the strike zone? Otherwise, a drunken pitcher would be assumed to be as good as Nolan Ryan. Similarly, at 0-2 wouldn’t the pitcher be more likely to try to throw one away if the batter had fouled the other two pitches. Otherwise, why not just bring the heat if you know the batter will whiff anyway?

Opposing players argued that umpires tended to give batter Ted Williams, with his famous eye for the strike zone, a break. Supposedly when one catcher complained to the umpire about a ball call, he was told, “When your pitcher throws a strike, Mr. Williams will let you know by swinging at it.”

But that sounds more like something Jeeves would say than an umpire would say, so I don’t know if it’s true.

There is evidence that better, more established pitchers do, in fact, get the benefit of the doubt on calls. Papers below each find evidence of this. As Steve notes below, there are also apparent effects of better/longer tenured batters getting the benefit of the doubt, too.

No mention of the batter’s stance on that page. Without accounting for the batter changing their stance depending on the count I don’t think this means much. It seems like this would be in that PITCHf/x data, or else I don’t see how it can be accurately determining the strike zone.

Batters generally don’t change their stances within at bats or even from one at bat to the next. PITCHf/x includes an estimate of the height of the top and bottom of the strike zone based on the batters stance but it often only estimates this once per batter-game. You can also see size of the strike zone also changes both vertically and horizontally.

This surprises me. Interestingly, it looks like the strike zone used to be determined by the batter’s “usual stance”, but not anymore:

1988 – “The Strike Zone is that area over home plate the upper limit of which is a horizontal line at the midpoint between the top of the shoulders and the top of the uniform pants, and the lower level is a line at the top of the knees. The Strike Zone shall be determined from the batter’s stance as the batter is prepared to swing at a pitched ball.”

1969 – “The Strike Zone is that space over home plate which is between the batter’s armpits and the top of his knees when he assumes a natural stance. The umpire shall determine the Strike Zone according to the batter’s usual stance when he swings at a pitch.”

It’s interesting to me how much the strike zone definition changed over the years, and how that makes it totally impossible to do naive comparisons across different eras of baseball.

But, I’m not much of a pro sports person (I’d much rather play amateur sports than watch professional), and especially not a baseball person (soccer yes!, I’d even rather watch golf than baseball), so I guess maybe this is all well known to people who follow baseball.

I provided the data for Guy on the project, and it uses a fixed zone I define in some of my previous work. and I’ve done some analysis using a *fixed* zone and a height-adjusted zone. We get pretty similar results.

Unfortunately, the data in Pitch F/X on top and bottom of the zone (drawn in by stringers/scorers for each at bat) is not very reliable (and is very noisy). It’s been getting better, but then we mistake improvements in umpire calls for improvements in stringer adjustments over time, so I tend to prefer the fixed version (or the height-adjusted fixed zone, which doesn’t account for stance).

Stance matters, it’s just not very easy to integrate reliably into the data.

My biggest concern with the argument is that the analysis considers only called pitches. That means missed swings and batted balls are being treated as missing at random given the count. Molyneux explicitly notes changes in hitters’ strategy as the count changes. Shouldn’t at least some of the observed effects be attributed to batters swinging at more/fewer of the borderline pitches that would have been more likely than average to be called inaccurately?

Good point, I think there’s data on the locations of pitches that were swung at also, so they should investigate this. But wouldn’t protecting the plate with 2 strikes (which is what batters do) lead to fewer borderline pitches?

Another possibility is that pitchers might tend to throw pitches with more movement on 0-2 counts, and those might tend to be called balls incorrectly more often because they’re out of the strike zone by the time the catcher catches them.

David: You raise a good point. The proportion of strikes among called pitches in a given count will not be the same as the proportion for all pitches thrown in that count. At 0-2, for example, the hitter will swing at anything close to the strikezone, so while called pitches are almost all located outside the strike zone, the distribution for *all* 0-2 pitches is probably not quite as lopsided. My assumption was that umpires’ inclination to guess ball or strike on close pitches would be influenced by the distribution of pitches they actually have to call. But I suppose it could be influenced instead by the likelihood that the next pitch will be a strike, regardless of whether the batter swings. My guess is that including pitches the hitter swung at would not change the story very much, but it would be interesting to look at.

I have to study the articles more closely, but given what I know about pitch distributions and basebal, I would say there is an unnoticed confounder that is confusing the whole story.

On 3-0 counts taken pitches that are in the strike zone are much more likely to be in the center of the strike zone and on 0-2 counts taken pitches that are in the strike zone are much more likely to be on the edge of the strike zone. Thus the accuracy is going to be nearly 100% for 3-0 pitches and much closer to 50% for 0-2 pitches. You could even imagine a simpson’s paradox whereby the umpires are actually much more likely to call a strike on an 0-2 conditional on location but when aggregated the probability of strike is much smaller than on a 3-0 count. I made a scatter plot (https://dl.dropboxusercontent.com/u/46739455/count_locations.png) of the data which pretty much proves my first point: Taken pitches on an 3-0 count (right frame) cluster around the center of the strike zone- so they are easy calls to make. On 0-2 counts the taken pitches are arranged in a donut around the strike zone so conditionally on being in the zone they are really hard to get right since they are on the boundary.

Ultimately, while I think you are right to be skeptical of the aggregate “in zone” and “out of zone” numbers for the reasons you mention, we luckily have the location of the pitches and can evaluate strike probability conditional on more granular location measurements.

I think this is only an issue if the location measurement data are noisy (and they are).

There’s been some work looking at the variability in measurement of pitch location, with a +/- 0.5 inches on the given spot.

So, there is *some* selection bias at the edges of the plate related to this (measured as in the zone, when actually outside of it, and batter doesn’t swing…umpire calls it a strike). Given the size of the differences in 0-2 and 3-0, this could likely only account for some of the effects we see in the data. But it shouldn’t be completely ignored.