Monday, January 26, 2015

Are umpires biased in favor of star pitchers? Part II

Last post, I talked about the study (.pdf) that found umpires grant more favorable calls to All-Stars because the umps unconsciously defer to their "high status." I suggested alternative explanations that seemed more plausible than "status bias."Here are a few more possibilities, based on the actual coefficient estimates from the regression itself.(For this post, I'll mostly be talking about the "balls mistakenly called as strikes" coefficients, the ones in Table 3 of the paper.)---1. The coefficient for "right-handed batter" seems way too high: -0.532. That's so big, I wondered whether it was a typo, but apparently it's not. How big? Well, to suffer as few bad calls as his right-handed teammate, a left-handed batter would have to be facing a pitcher with 11 All-Star appearances.The likely explanation seems to be: umpires don't call strikes by the PITCHf/x (rulebook) standard, and the differences are bigger for lefty batters than righties. Mike Fast wrote, in 2010,

"Many analysts have shown that the average strike zone called by umpires extends a couple of inches outside the rulebook zone to right-handed hitters and several more inches beyond that to left-handed hitters."

That's consistent with the study's findings in a couple of ways. First, in the other regression, for "strikes mistakenly called as balls", the equivalent coefficient is less than a tenth the size, at -0.047. Which makes sense: if the umpires' strike zone is "too big", it will affect undeserved strikes more than undeserved balls. Second: the two coefficients go in the same direction. You wouldn't expect that, right? You'd expect that if lefty batters get more undeserved strikes, they'd also get fewer undeserved balls. But this coefficient is negative both cases. That suggests something external and constant, like the PITCHf/x strike zone overestimating the real one.And, of course, if the problem is umpires not matching the rulebook, the entire effect could just be that control pitchers are more often hitting the "illicit" part of the zone. Which is plausible, since that's the part that's closest to the real zone.---2. The "All-Star" coefficient drops when it's interacted with control. Moreover, it drops further for pitchers with poor control than pitchers with good control. Perhaps, if there *is* a "status" effect, it's only for the very best pitchers, the ones with the best control. Otherwise, you have to believe that umpires are very sensitive to "status" differences between marginal pitchers' control rates. For instance, going into the 2009 season, say, J.C. Romero had a career 12.5% BB/PA rate, while Warner Madrigal's was 9.1%. According to the regression model, you'd expect umpires to credit Madrigal with 37% more undeserved strikes than Warner. Are umpires really that well calibrated?Suppose I'm right, and all the differences in error rates really accrue to only the very best control pitchers. Since the model assumes the effect is linear all the way down the line, the regression will underestimate the best and worst control pitchers, and overestimate the average ones. (That's what happens when you fit a straight line to a curve; you can see an example in the pictures here.) Since the best control pitchers are underestimated, the regression tries to compensate by jiggling one of the other coefficients, something that correlates with only those pitchers with the very best control. The candidate it settles on: All-Star appearances. Which would explain why the All-Star coefficient is high, and why it's high mostly for pitchers with good control. ---3.The pitch's location, as you would expect, makes a big difference. The further outside the strike zone, the lower the chance that it will be mistakenly called a strike. The "decay rate" is huge. A pitch that's 0.1 feet outside the zone (1.2 inches) has only 43 percent the odds of being called a strike as one that's right on the border (0 feet). A pitch 0.2 feet outside has only 18 percent the odds (43 percent squared). And so on.*(* Actually, the authors used a quadratic to estimate the effect -- which makes sense, since you'd expect the decay rate to increase. If the error rate at 0.1 feet is, say, 10 percent, you wouldn't expect the rate for 1 foot to be 1 percent. It would be much closer to zero. But the quadratic term isn't that big, it turns out, so I'll ignore it for simplicity. That just renders this argument more conservative.) The regression coefficient, per foot outside, was 8.292. The coefficient for a single All-Star appearance was 0.047. So an All-Star appearance is worth 1/176 of a foot -- which is a bit more than 1/15 of an inch.That's the main regression. For the one with the lower value for All-Star appearances, it's only an eighteenth of an inch. Isn't it more plausible to think that the good pitchers are deceptive enough to fool the umpire by 1/15 inches per pitch, rather than that the umpire is responding to their status? Or, isn't it more likely that the good pitchers are hitting the "extra" parts of the umpires' inflated strike zone, at an increased rate of one inch per 15 balls? ---

4.The distance from the edge of the strike zone is, I assume, "as the crow flies." So, a high pitch down the middle of the plate is treated as the same distance as a high pitch that's just on the inside edge. But, you'd think that the "down the middle" pitch has a better chance of being mistakenly called a strike than the "almost outside" pitch. And isn't it also plausible that control pitchers will have a different ratio of the two types than those with poor control? Also, a pitch that's 1 inch high and 1 inch outside registers as the same distance as a pitch over the plate that's 1.4 inches high. Might umpires not be evaluating two-dimensional balls differently than one-dimensional balls?And, of course: umpires might be calling low balls differently than high balls, and outside pitches differently from inside pitches. If pitchers with poor control throw to the inside part of the plate more than All-Stars (say), and the umpires seldom err on balls inside because of the batter's reaction, that alone could explain the results.------ All these explanations may strike you as speculative. But, are they really more speculative than the "status bias" explanation? They're all based on exactly the same data, and the study's authors don't provide any additional evidence other than citations that status bias exists.I'd say that there are several different possibilities, all consistent with the data:1. Good pitchers get the benefit of umpires' "status bias" in their favor.2. Good pitchers hit the catcher's glove better, and that's what biases the umpires.3. Good pitchers have more deceptive movement, and the umpire gets fooled just as the batter does.4. Different umpires have different strike zones, and good pitchers are better able to exploit the differences.5. PITCHf/x significantly underestimates umpires in their opinions of what constitutes a strike. Since good pitchers are closer to the strike zone more often, they wind up with more umpire strikes that are PITCHf/x balls. The difference only has to be the equivalent one-fifteenth of an inch per ball.6. Umpires are "deliberately" biased. They know that when they're not sure about a pitch, considering the identity of the pitcher gives them a better chance of getting the call right. So that's what they do.7. All-Star pitchers have a positive coefficient to compensate for real-life non-linearity in the linear regression model.8. Not all pitches the same distance from the strike zone are the same. Better pitchers might err mostly (say) high or outside, and worse pitchers high *and* outside. If umpires are less likely fooled in two dimensions than one, that would explain the results.------To my gut, #1, unconscious status bias, is the least plausible of the eight. I'd be willing to bet on any of the remaining seven, that they all are contributing to the results to some extent (possibly negatively). But I'd bet on #5 being the biggest factor, at least if the differences between umpires and the rulebook really *are* as big as reported. As always, your gut may be more accurate than mine.