Sabermetric Research

Phil Birnbaum

Sunday, June 26, 2011

How can we separate referee bias from other sources of home-field advantage? Part II

You have two competing theories about home field advantage (HFA). Theory one is that HFA is almost all the result of referee bias. Theory two is that referee bias is part of the cause, but only a small part, and that there are other factors involved.

What kind of evidence would distinguish between the two? You need something that's independent of referee or umpire influence. That's harder than it looks. For instance, in the NHL, you'd expect most of the referee bias to come from penalty calls. However, even if you leave out power plays, you can see home teams doing significantly better than road teams. With both teams at full strength, the home side still outscores the visiting side.

Does that prove that something other than officiating is at work? No, it doesn't. Because it's possible that, knowing that the referees are likely to call more infractions against them, road teams are forced to play a less aggressive style of hockey, one that costs them goals even at full strength. So that part of HFA could still be referee bias at work, even if you don't see it in the penalty calls.

In order to more legitimately show that there are other factors at work, you need to find some statistic where there's no plausible story about how the refereeing could be the cause.

For basketball, we have foul shooting. The evidence shows that HFA in foul shooting exists, and is about what you'd expected it to be. Can we do the same for baseball? A couple of weeks ago, I asked for suggestions. Someone e-mailed me to look at fastball speed. That's a great idea, but I don't know where to find data (can anyone help)? So, I tried something else: wild pitches.

It's hard to see how a wild pitch (or passed ball) could be the result of umpire bias. It's a straightforward call based on what happens on the field. I suppose that it's possible that if the catcher tries to throw out a baserunner, and the umpire is more likely to call that runner out, that would be a bias for the home team (since a WP is not charged when a runner is thrown out at the next base). But that happens so infrequently that it's not an issue.

Also, it has been found (by John Walsh, in the 2011 Hardball Times Annual), that home plate umpires favor home teams in their pitch calling (by 0.8 pitches per game). And so, you could argue that maybe visiting pitchers have to pitch a little differently to overcome that disadvantage, and that's what causes any increase in wild pitches.

That's possible. However, you'd expect any effect to go the other way, to *fewer* wild pitches, wouldn't you? If the strike zone is smaller for the visiting pitcher, he's more likely to compensate by generally being closer to the strike zone. It's hard to see how that would lead to more pitches in the dirt.

So, all things considered, I think wild pitches are a pretty good candidate. For all MLB games from 2000 to 2009, I figured how many wild pitches were thrown by home teams and road teams. I omitted wild pitches on third strikes, for reasons I'll explain later.

The raw numbers:

Road: 6943Home: 6829

So far, it looks like a definite HFA exists. But, since a wild pitch can only occur with runners on base, maybe it's just that home teams aren't in that situation as much as visiting teams (since home teams generally pitch better). Also, maybe road teams throw more pitches per batter than home teams, for whatever reason.

So I adjusted for the number of pitches thrown with runners on base (leaving out intentional balls). The results, per 100,000 pitches:

Road: 460 WP per 100,000 pitches Home: 448 WP per 100,000 pitches

That's a 2.7% difference, which is reasonably large. For passed balls, the difference was even wider:

Road: 97 PB per 100,000 pitchesHome: 91 PB per 100,000 pitches

The difference is 6.6% this time.

In case you're interested, there were 1,506,500 pitches for the visiting team, and 1,522,647 pitches for the home team. The difference is mostly the home team always having to pitch the ninth inning, partially offset by the fact that the home team faces proportionally fewer situations with men on base.

-------

One other possible objection is that the probability of a wild pitch could vary by count. Maybe "wasting a pitch" on 0-2 leads to more wild pitches than at other counts.

So, I rechecked, but included only pitches at 0-0 counts, which seems like a reasonable control. The results:

Road: 428 WP per 100,000Home: 401 WP per 100,000

Road: 116 PB per 100,000Home: 110 PB per 100,000

Very similar. I'll check the statistical significance before I present this at SABR next week, but I'm pretty sure that the chance of this happening by luck is pretty low.

-------

What does this mean for HFA?

For the 1988 American League (which is all I have handy right now), the linear weight for a wild pitch was about 0.27 runs. For the entire sample, the total difference was 181 pitches (combining WP and PB). So that's 49 runs, which is about 5 wins.

5 wins divided by 30 teams, divided by 10 years, divided by 81 home games, equals 0.0002 wins per game. Since the total HFA per game is about .040, that means that wild pitches (+PB) make up one-half of one percent of home field advantage.

That seems reasonable to me.

So, I think we have some good evidence that there's HFA in aspects of the game not influenced by the umpire -- namely, wild pitches.

Monday, June 20, 2011

Five percent significance is often too weak

In academic research, why is the standard significance level set to 5 percent?

I don't know, but I'm guessing a consensus emerged that 0.05 is a reasonable threshold beyond which it's OK to assume any effect you found is real. The idea might be that, when searching for an effect that doesn't actually exist, academia figures that one false positive out of 20 is something they can live with -- but no more than that.

The problem is that the "1 out of 20" isn't generally true if you're looking at more than one variable.

Suppose, for instance, that I want to see whether hitters do better on one specific day of the week more than others. So I set up a regression with six variables for days of the week (the seventh is the default case, and doesn't get a variable), and I figure out that batters do the best on Monday, compared to the other days. And, moreover, the effect is statistically significant, at p<0.05.

Should I accept the result? Maybe not. Because, instead of one variable, I had six, and I didn't know in advance what I was looking for. So I had six chances for a false positive, not just one.

Obviously, if I have enough variables, the chances increase that at least one of them will appear statistically significant just by luck. Imagine if I had 100 independent variables in my regression. Then, five of them would show up as significant just by chance. If I stick to the .05 criterion, I should expect five false positives, on average.

In the "day of the week" case, I didn't have one hundred variables -- just six. What are the chances that at least one of six variables would be significant at 5 percent? The answer: about .26. So I thought I was using a .05 confidence level, but I was really using a .26 confidence level.

If I want to keep a 5 percent probability of having *any* false positive, I can't use the threshold of 0.05 any more. I have to use 0.01. Actually, a little less than that: 0.0085. Intuitively, that's because six variables, multiplied by 0.0085 each, works out to about .05. (That's not the precise way to calculate it, but it's close enough.)

But suppose I don't do that, and I stick with my 0.05. Then, if I have a regression with six meaningless variables, there's still a 26.5 percent chance I'll have something positive to report. If I have ten variables, that goes up to 40 percent! Ten variables is actually not uncommon in academic studies.

-----

And it's actually even higher than that in practice. For one thing, if you do a regression, and you get close to significance for variable X, you can try adding or removing a variable or two, and see if that pushes X over the threshold. You see that you're at 0.06, and you think, "wait, maybe I didn't control for home/road!" And you think, "yeah, I really should have controlled for home/road." And you're right about that, it's a good idea. So you add a new variable for home/road, and, suddenly, the 0.06 moves to 0.049, and you're in. But: if you'd got the 0.049 in the first place, you probably never would have thought of adding home/road. So, really, you're giving yourself two (or more) kicks at the can.

------

As an aside .... there's another thing you can do to try to get a significant result, which, in my opinion, is cheating, a bit. Here's how it works. When you choose which six of the days of the week to include as variables, you leave out the one that's most extreme. That bumps all your estimates, and your significance levels.

For instance, suppose that, compared to the average, the observed effects of days of the week are something like this (numbers made up):

Thursday is now -23, which is almost 3 standard deviations below zero -- clearly significant! But that's because of the way you structured your study. Effectively, you arranged it so that you wound up looking at Monday-Thursday, the most extreme of all the possible comparisons. There are 21 such comparisons, so, at a 5 percent chance of finding a false positive, you'd expect 1.05 false positives overall, which means a pretty good chance that the most extreme comparison will wind up looking significant.

As I said, I think this is cheating. I think you should normalize all your days of the week to the overall average (as in the top table above), to avoid this kind of issue.

------

Anyway, the point is that the more variables you have in your study, and the more different sets of variables you looked at before publishing, the higher your chances of finding at least one variable with the required significance. So, when a study is published, the weight you give the evidence of significance should be inversely proportional to the number of variables looked at.

Also, it should also depend on how many rejiggerings the author did before finally settling on the regression that got published. As for that, there's no way to tell from just the published study.

Again, if I wanted to cheat, I could try this. I run a huge regression with 200 different nonsense variables. I take the one that came out the most significant -- it'll probably be less than 0.01 -- and run a regression on that one alone. It'll probably still come in as significant, even without the other 199 variables. (If it doesn't, I'll just take the next most significant and try that one instead.)

Then, I write a paper, or a blog post, suggesting that that particular variable should be highly predictive, based on some theory I make up on the spot. I might say something like this:

"Thursdays are getaway days, which means the team is in a bit of a hurry. Psychological theory suggests that human cognitive skills are reduced with perceived time pressure from work authorities. That means batters should concentrate less. Therefore, you'd expect batting to be worse on Thursdays."

I'll also add a paragraph about why that doesn't apply to pitchers, and why batters should hit well on Mondays.

A month or two later, I publish a working paper that looks only at Thursdays and Mondays, and finds the exact effect I predicted!

I hasten to add that I don't think this kind of conscious cheating actually goes on. I'm just trying to make the point that an 0.05 is not an 0.05 is not an 0.05. The weight you give to that 0.05 has to take into account as much of the context as you can figure out.

------

And so, I'd like to see authors use a more stringent threshold for significance the more variables they use. They could choose a level so that the chance of finding significance is not 0.05 per variable, but 0.05 for the paper as a whole. That means that with six new variables, you use 0.0085 as your new significance level. Heck, round it up to 0.01 if you want.

Let 0.01 be the new 0.05. That would be appropriate for most studies, I think.

------

One last point here, and this is something that really bugs me. You'll read a paper, and the author will find variable X significant at 0.05, and he'll go ahead and write his conclusion as if he's 100 percent certain that the effect he found is real.

But, as I've just argued, there's a pretty good chance it's not. And, even if you disagree with what I've written here, still, there's always *some* chance the finding is spurious.

So why don't authors mention that? They could just throw it in casually, in a sentence or two, like this:

"... of course, it could be that we're just looking at a false positive, and there's no day-to-day effect at all. Remember, we'd see this kind of result about 1 in 20 times even if there were no effect at all."

Why won't the authors even consider a false positive? Is it just the style and standards of how academic papers work, that when you find a 5 percent significance level, you have to pretend it removes all doubt? Maybe it's just me. But when I see such confident conclusions based on one 0.05 cherry-picked from a list of fifteen variables, I can't help feeling like I'm being sold a bill of goods.

------

UPDATE: Commenter Mettle (fifth comment -- don't know how to link to it directly) points out that much of what I'm saying here isn't new (and referred me to this very relevant Wikipedia page).

However, the question remains: if this is a well-established issue, why don't authors routinely address it in their papers?

Sunday, June 12, 2011

Why we only like to watch live sporting events

"Replays of historic games on ESPN Classic get very low ratings. That must be because fans don't like to watch games where they already know who wins. So you should expect fans to be happier when there's lots of doubt as to who wins. Fans must want to watch games where teams are evenly matched, because that's where the outcome is most in doubt."

I never bought into that logic at all. For one thing, there's a huge difference between a game where you 100% know who wins, and one where you only know 99% who's going to win (think lottery ticket buyers). For another thing, there's more to "uncertainty of outcome" than who wins ... I've watched womens' hockey games where uncertainty is whether Canada will outscore Bulgaria by 19-0, or 22-0, or only 16-1.

How can we separate referee bias from other sources of home-field advantage?

Are umpires and referees biased in favor of home teams?

The evidence seems to say they are. In "Scorecasting," Tobias Moskowitz and Jon Wertheim listed many bits of evidence that suggest such bias. While I don't agree with them that refereeing is the *only* thing that causes home-field advantage, they make a pretty good case that it at least causes *some* of it.

Next month, I'll be giving a presentation on this topic at the SABR convention in Long Beach. My tentative plan is, first, to show evidence (mostly baseball) showing umpire bias. For that, I'll use examples from "Scorecasting." In addition, there's a great study from John Walsh in the 2011 Hardball Times book, which uses Pitch f/x data to show that umpires miscall the strike zone in favor of the home team by about 0.8 pitches a game (which is worth .015 in winning percentage, or about one-third of observed HFA). And, there was some work done my Mitchel Lichtman at "The Book" blog.

Second, I plan to show evidence that some of the performance differential is very unlikely to be caused by umpires or referees. For instance, suppose I find out that when putting the ball in play with nobody on and nobody out on a 2-1 count, home teams have significantly better outcomes than road teams. If that's the case, it can't have much to do with umpires, right? Because if the ball is put in play, the umpire doesn't have to make a call, except, perhaps, on a safe/out play, and we know those are rarely missed.

That sounds reasonable, but it's not necessarily right. Perhaps the reason road teams have worse ball-in-play outcomes (if in fact, they do) is that, knowing that umpires are biased against them, they have to swing at worse pitches to avoid taking an unwarranted strike. For that reason, it could be that the effect of umpires is much more than the 0.8 pitches a game. Indeed, it's logically possible that the *entire* home field advantage is caused by umpires. For instance, if the batter knows that he'll *never* get a called ball on a certain outside pitch, he'll swing at all of them. The bias will then show up only in the outcome of balls in play, and in extra (swinging) strikes, but not in extra called strikes.

So, you can't really be sure that the 0.8 figure isn't a biased, lowball estimate of the umpire's bias.

Still, you can probably argue that a large extra effect of this sort is implausible. If two-thirds of umpire bias was hidden by batters compensating, it would leave some evidence. You'd see a lot more swinging strikes for the visiting batters. You'd see a lot more balls in play for the visiting batters. You'd see lower pitch counts for the home pitcher, because the visiting hitters would be making contact more often.

Or would you? Maybe even those effects are too small to be seen with the naked eye, as it were. If it took many years of Pitch f/x data, and many years of waiting for John Walsh to notice an umpire bias of .015, who's to say that an additional umpire bias of .030 can immediately be seen in other stats?

Does anyone see a way out of this dilemma, or have any suggestions on what kind of evidence would help resolve the issue? The only one I can think of is NBA foul shooting, where there's absolutely no referee influence, but still a significant HFA. For baseball, though ... there's nothing like that that I can think of.

I might just have to bite the bullet and go with the argument I'm making here: that a large hidden umpire bias effect is implausible, but not impossible.

Monday, June 06, 2011

Interpreting regression interaction terms

Last post, after talking about the results from the "choking foul shooter" study, I mentioned that there was one additional assumption I had to make. That assumption was that, in the regression, the coefficients for "last 15 seconds" and "down 1-4 points" were close to zero.

The easiest way to explain that is to go through what an interaction term means in a regression. (Warning: This is boring statistics stuff, no sports content until the end.)

------

Suppose I want to figure out if stimulants help a student do better on an exam. So I run a regression to predict the exam score. I use a bunch of variables, like age, time studying, performance on other exams, grades on assignments, number of classes missed, and so on, but I also include a dummy variable for whether the student had (both) coffee and Red Bull before the exam.

After the exam, I run the regression, and I find the coefficient for "both coffee and Red Bull" is -3, and statistically significant. I conclude that if I were a student, I might consider not taking both coffee and Red Bull.

Fair enough, so far.

But, now, suppose I do the same experiment again, but, this time, I add a couple of new dummy variables -- whether or not the student had coffee (with or without Red Bull), and whether or not the student had Red Bull (with or without coffee). I don't remove the original "had both" variable -- that stays in.

I run the regression again, and, again, the coefficient for "both coffee and Red Bull" comes out to -3 -- exactly the same as last time. What am I able to conclude this time about the desirability of drinking both coffee and Red Bull?

The answer: almost nothing. That coefficient, *on its own*, does not give much useful information at all about how performance is affected by the coffee/Red Bull combination.

Let me explain why.

-------

(First, a quick not on terminology. In a regression, the "both coffee and Red Bull" variable would be referred to as "the interaction of coffee and Red Bull". That would be written as "coffee x Red Bull," or a suitable abbreviation (In fact, I'm going to start referring to "coffee" as C, "Red Bull" as R, and "Coffee x Red Bull" as CxR). The "x" is a multiplication sign -- it's there because you can get the coefficient by multiplying together the dummy values for C and R. That is, if either C or R is zero, CxR equals zero; if both coffee and Red Bull are 1, then CxR equals one. That's exactly what we want.)

-------

In a regression result, the simplest way to interpret the coefficient of a dummy variable is, "what happens when you change the value from 0 to 1 and leave all the other variables the same." In the first regression, that works fine. But in the second regression, it can't work. Because if you change CxR and leave everything else constant, your data and regression become inconsistent. You wind up with CxR being 1 (meaning both coffee and Red Bull), but you'll have either C=0 (no coffee) or R=0 (no Red Bull). Those three variables are tied together, so you can't just change CxR and leave the other two constant.

Put another way, there are four possible combinations for C, R, and CxR:

You can't change CxR from 0 to 1, and still have a combination that's on the list. So the "change CxR but leave all other variables the same" strategy no longer works. If you change CxR from 0 to 1, you'll have to change one of the other variables, too.

-------

Which ones should you change? It depends what question you're trying to answer. For example, suppose you do the regression and you get these coefficients:

C = -5R = -10CxR = -3

If you're trying to ask, "what's the effect of taking coffee alone versus nothing at all," it's like asking, "what is the effect of changing (C=0, R=0, CxR=0) to (C=1, R=0, CxR = 0)?" The answer is -5.

If you're trying to ask, "what's the effect of taking both coffee and Red Bull versus nothing at all?", it's like asking, what's the effect of changing (C=0, R=0, CxR=0) to (C=1, R=1, CxR =1)?" The answer is -18.

And so on. But none of those kinds of questions lead to the answer of -3 points, because none of these questions can be answered by changing CxR alone.

So what does the -3 represent? The non-linearity of the coffee and Red Bull variables. Or, put another way, the "increasing or diminishing returns" to combining coffee and Red Bull. Or, put a third way, the effects of the *interaction* of coffee and Red Bull, independent of their individual effects. Or, put a fourth way, the amount of effects *duplicated* from both coffee and Red Bull, that you can't count twice even if you take both drinks.

The -3 is NOT any indication of whether it's a good thing to take coffee and Red Bull together. Even though the coefficient of the interaction is negative, coffee and Red Bull together might be a positive thing. Suppose the regression coefficients had looked like this:

In this case, you're still going to want to take both coffee and Red Bull. What the -3 is telling you is, there are diminishing returns to taking both. You might think that, since coffee improves you by 10, and Red Bull improves you by 20, that, if you take both, you'll improve by 30. That's not right. There are diminishing returns of -3, so, if you take both, you'll only improve by 27.

-------

Of course, if the coefficients of C and R are both zero, then the CxR variable is indeed the entire effect. So if coffee does nothing, and Red Bull does nothing, but, when you take them together, you lose 3 points ... in that case, the CxR variable actually IS the effect of taking both C and R.

-------

This is fairly standard stuff, I would think ... I looked for an explanation on the web, so I wouldn't have to type all this, but I couldn't find one.

Anyway, going back to the choking study ... there, I looked at a variable called "Last15 x Down1_4", which was the interaction of shots that happen in the last 15 seconds (a dummy variable called "Last15"), and with the shooting team up by 1 to 4 points (dummy variable "Down1_4").

It turned out that the coefficient for that was -0.058 (as compared to 11-point+ blowouts). I implied that meant that shooters were 5.8 percentage points worse in those clutch situations than in blowouts.

But that wasn't right, because "Down1_4" and "Last15" were also in the regression. It's like the "Coffee / Red Bull / Both" case. If I want to compare the effects of shooting in the last 15 seconds down by 1-4 points, against shooting where *neither* of those is true, I have to add up all three coefficients:

Down1_4 = ALast15 = BDown1_4 x Last 15 = -0.058

To get the true clutch effect, I have to compute A + B - 0.058. It could turn out that A and B are huge: maybe they're +7 points each! In that case, the effect would be 0.07 + 0.07 - .058, which would be +0.082 -- which would mean shooters were GREAT in the clutch.

The study doesn't give us A and B. However, the authors do tell us (and author Dan Stone reiterated in the comments to the previous post) that almost all the omitted coefficients are less than 0.01.

Still, suppose they are as high as exactly +0.01. That means that A + B - 0.058 would be -0.038, which would less significant a choke effect than I thought. Or, suppose they were as low as negative 0.01. In that case, players would be even chokier -- at -0.078.

That's why I added a note to the end of my post, saying I had to make one additional assumption. That assumption is that A and B were both close to zero. If they're exactly zero, the -0.058 stands.