Monday, April 07, 2008

The Hamermesh umpire/race study revisited -- part I

Back in August, a study came out finding that umpires appear to have racial preferences when calling balls and strikes. It found that when the umpire was of the same race (white, black, asian or hispanic) as the pitcher, he was more likely to call a strike.

Since then, a few things have happened. First, a new version of the study came out. Second, I've been thinking about things a bit more. And, third, I'll be presenting some comments to the study at the SABR convention in Cleveland, and this is a good chance to get some feedback in advance. So here we go.

Hamermesh et al looked at all major-league called (not swung on) pitches from the 2004-2006 seasons. They found what percentage of those were strikes, and which were balls. They then collated the results according to the race of the umpire and the race of the pitcher. Here is that data. I'm going to call it the "Table 2 matrix," since it's taken from Table 2 of the paper.

(I've left out Asian pitchers, as did the study itself. There are no Asian umpires.)

There are a few important things to note about the data.

First, all umpires call more strikes for white pitchers than hispanic pitchers, and also for hispanic pitchers over black pitchers. It seems safe to assume, then, that the white pitchers as a group are better than the hispanics, who are in turn better than the blacks.

Second, it's not obvious from above, but white umpires called more strikes than hispanic umpires, and hispanic umpires called slightly more strikes than black umpires. (It's just a coincidence that the umpires happen to land in the same race ranking as the pitchers.)

Third, just from glancing at the table, there does seem to be a slight tendency for umpires to call more strikes for pitchers of their own race. (When the umpire matches the pitcher in race, the study calls this a "UPM," for "umpire-pitcher match".) White pitchers get the most strike calls from white umps; hispanic pitchers get the most strikes called from hispanic pitchers, and black pitchers get *almost* the most strikes called from black pitchers.

Fourth, there are a LOT more white pitchers and umpires than any other race. 87 percent of umpires are white, and 71 percent of pitchers. In fact, there are only 3 hispanic and 5 black umpires out of 93 total. So (for instance) a black umpire calling a black pitcher is rare; there were only 1,765 such called pitches in the three years of the study. On the other extreme, there were 741,729 white-on-white pitches. Full data (in hundreds of pitches):

There are 877 times as many pitches involving in the white/white cell than in the black pitcher/Hispanic umpire cell.

----

Okay, so: how can we tell if there's discrimination going on here? The study's authors used a regression, which I'll get to. But, for now, let's try to figure this intuitively (in a rehash of what I did back in my August posts). What would the table look like if there were no racial bias?

Perhaps all the umpires would see all pitches the same. That would look something like this (call this matrix 1):

Obviously, there are no racial preferences showing up here. But this is an extreme case – there's no need to make the assumption that all the umpires are so perfectly identical. After all, every ump has his own strike zone, and, even if there's no racial component to strike-zone judgement, the races would be different just by random chance.

So an unbiased set of umpires might instead look like this (call this matrix 2):

Pitcher ------ White Hspnc Black--------------------------------White Umpire-- 32.06 31.47 30.61Hspnc Umpire-- 31.91 31.32 30.46Black Umpire-- 31.93 31.34 30.48Now, every umpire calls 0.59% more strikes for white pitchers than for hispanic, and 0.86% more for hispanic than for black. White umpires have a bigger strike zone than minority umps, but they're consistent with the other umpires in how they treat the races.

So how does the Table 2 matrix vary from this "ideal" one? By this much:

Here, too, it looks like there might be a bit of bias: all the positives (actual higher than expected) are for UPMs [umpire/pitcher racial matches], and the negatives are from non-UPMs.

But the differences aren't that big. Let me convert them to actual pitches. Start with the black/black cell. In that cell, there were 1,765 pitches. 0.28 percent of that is 5 strike calls too many, or +5.

So, over three seasons, and over a million pitches, it turns out that changing only 108 total pitches would lead to a completely unbiased result. Intuitively, it does seem that there's little evidence of serious racism here.

Looked at another way: the same-race umpires called 40 more strikes than expected. The mixed-race umpires called 62 fewer strikes than expected. The difference: 102 pitches.

Now, there is no real reason that we had to choose Matrix 2 as our example of what unbiased umpires would look like; we could have chosen Matrix 1 instead. Or, we could have chosen any other matrix where none of the umpires show racial bias. Basically, if you start with unbiased Matrix 1, pick any number, and add it to each of the entries in any row, you get another unbiased matrix. Or if you add the same number to each of the entries in any *column*, you get an unbiased matrix.

My argument is that if ANY of those matrices are sufficiently similar to real life, you have to argue that the data doesn't show any evidence of bias. And that's what I'd argue for Matrix 2. The real life data in Table 2 was only 108 pitches away from matching the unbiased one, and none of the individual discrepancies was statistically significant. And if we were to test the difference between the same-race and different-race numbers – which came out to 102 pitches – we'd also find that it's not statistically significant.

----

OK, that was my quick, intuitive analysis, using a specific "unbiased" matrix, selected out of an infinity of possibilities. But we can go more formal, and use linear regression to investigate the issue.

Here's what I did. Using the pitch data from the authors' Table 2 (but divided by 100 because my 20-year-old DOS-based software can't handle a million rows), I regressed each pitch using indicator variables for umpire and pitcher race. What that translates to, in plain English, is basically that I asked for a matrix where the umpires are unbiased.

(Technical note: after dividing by 10, I added 2 called balls to the Hispanic Ump/Black Pitcher cell, and one called ball to the Black/Black cell. This makes the percentages closer to the original ones, fixing rounding errors caused by dividing by 100.)

But if there is an infinity of such matrices, which one will the regression choose? Well, the nature of the regression is that it will insist that:

(a) the total number (or average proportion) number of strikes has to be the same in the unbiased matrix as in the original – that is, no adding or subtracting strikes, just redistributing;

(b) no adding or subtracting strikes in any individual row or column of the matrix, either; and

(c) subject to the two restrictions above, choose the unbiased matrix that is the closest to the original, based on sums of squares.

Overall, the same-race pairs resulted in 37 more strikes than expected, and the different-race pairs 37 fewer strikes. That's a difference of 74 strikes, which doesn't seem like a lot. I can't actually give you a p-value, because I had to use only 1/100 of the full sample (if anyone knows of any good, free regression software for Windows, let me know), but it's definitely below 95% significance.

Another thing to notice is that, even though there does appear to be a 37 strike bias on the diagonal, all of it appears to be in the H/H and B/B cells. When a white umpire calls a white pitcher, he actually calls FEWER strikes than expected! This is kind of the opposite of what you might expect if the effect was really caused by racial preferences; racism, historically, has been inflicted by whites on minorities, but, here, white umpires actually appear to be showing no bias at all! (I'll return to this point in a future post.)

Again, this exercise is something I did for myself. It's not quite what the Hamermesh study did, yet.

----

Before we do get to the study itself, let me do one more regression, but add an indicator variable for "same race" – or, as the study puts it, UPM. That is, instead of just fitting the umpire and pitcher tendencies, let's also add a variable for whether the umpire matches the pitcher, and see how much that improves the fit of the model. In English, what this regression is saying is this:

1. Start with a matrix where all the cells are exactly the same.2. Adjust all the rows by various amounts (one amount per row) to reflect that different (races of) umpires have different (collective) strike zones.3. Adjust all the columns by various amounts (one amount per column) to reflect that different (races of) pitchers have different abilities to throw strikes.4. Adjust the same-race diagonal, each cell by the same amount, to reflect any racial bias among umpires for their same race (UPM bias).5. In doing all these adjustments, don't add or subtract any total strikes – just readjust the strikes that are already there. 6. Choose all the adjustments to come as close as possible to the original matrix, as measured by sum of squares.

It isn't important because what we really care about is the value of the UPM variable. Is it zero, which means no race bias? Is it positive, which means (favorable) same-race bias? Is it negative, which means unfavorable same-race bias? And how big is it?

30.5466%--- plus .1799% if the ump is white-- minus .0409% if the ump is hispanic--- plus 1.221% if the pitcher is white--- plus .7469% if the pitcher is hispanic--- plus .1169% UPM [if the umpire matches the pitcher]

It turns out that, even with over a million pitches, the UPM coefficient of 0.1169 it's not statistically significantly different from zero. We can't conclude any racial bias.

But how baseball-ly significant is it? How important is a UPM coefficient of .1169? And what does it mean?

At first glance, it might seem that 0.1169% of pitches are affected. But that's not what it means.

The 0.1169 means that, after adjusting for the overall tendencies of the umpire and picher, a same-race pairing will produce 0.1169 percentage points more strikes, as compared to a different-race pairing. Put into the language of the 3x3 matrix, the regression tells us that the three cells on the diagonal average 0.1169 percentage points higher than the average of the other six cells. (If you want to verify, just do the calculation on the "expected" matrix a few paragraphs up.)

Also, the 0.1169 is a *relative* value. It's the *difference* between the same-race case and the different-race case. We don't know if the 0.1169 comes from "too many" strikes called in the same-race cases, "too few" strikes called in the different-race cases, or (most likely), a combination of both.

Suppose the entire effect is extra strikes by the same-race pairs. There are 750,817 pitches on the same-race diagonal. Multiplying that by 0.001169 gives 878 pitches.

On the other hand, suppose the entire effect is caused by too few strikes (too many balls) called in the different-race pairs. There are 348,189 pitches in those six mixed-race cells. Multiplying that by 0.001169 gives 407 pitches.

If the effect is mixed between extra strikes and extra balls, the number of affected pitches will be somewhere between 407 and 878.

Since there were 1,099,006 total called pitches, the effect is somewhere between one pitch in 1,252, and 1 pitch in 2700.

At 70 called pitches per game, that's somewhere between one race-related pitch every 18 games, and one every 39 games. Not very baseball significant at all.

----

So far, and based on only the raw data in the paper's Table 2, we've determined that:

-- the data looks pretty close to non-race-related;-- if anything, any race bias appears to be in the hispanic/hispanic case and the black/black case, with the white/white case almost perfectly unbiased;

-- adding the UPM variable gives a coefficient that is slightly positive for same-race bias, but

-- the coefficient is not statistically significant (I haven't shown that yet, but trust me for now), and

-- the coefficient is not very significant in the baseball sense, either.

So far, this is all preliminary – I'm trying to lay some groundwork before we look at the actual Hamermesh regressions. Nothing I've written here, so far, is a direct comment on anything in the actual paper. That will come in the next post.

Thanks, Guy. Nope, don't know if it's being published anywhere ... there's nothing in the paper indicating it. It's dated December ... how long would it normally take to get through peer review and be accepted somewhere?