Friday, August 04, 2017

Deconstructing an NBA time-zone regression

Warning: for regression geeks only.----Recently, I came across an NBA study that found an implausibly huge effect of teams playing in other time zones. The study uses a fairly simple regression, so I started thinking about what could be happening. My point here isn't to call attention to the study, just to figure out the puzzle of how such a simple regression could come up with such a weird result. ------The authors looked at every NBA regular-season game from 1991-92 to 2001-02. They tried to predict which team won, using these variables:-- indicator for home team / season-- indicator for road team / season-- time zones east for road team-- time zones west for road teamThe "time zones" variable was set to zero if the game was played in the road team's normal time zone, or if it was played in the opposite direction. So, if an east-coast team played on the west coast, the west variable would be 3, and the east variable would be 0.The team indicators are meant to represent team quality. ------When the authors ran the regression, they found the "number of time zones" variable large and statistically significant. For each time zone moving east, teams played .084 better than expected (after controlling for teams). A team moving west played .077 worse than expected. That means a .500 road team on the West Coast would actually play .756 ball on the East Coast. And that's regardless of how long the visiting team has been in the home team's time zone. It could be a week or more into a road trip, and the regression says it's still .756.The authors attribute the effect to "large, biological effects of playing in different time zones discovered in medicine and physiology research." ------So, what's going on? I'm going to try to get to the answer, but I'll start with a couple of dead ends that nonetheless helped me figure out what the regression is actually doing. I should say in advance that I can't prove any of this, because I don't have their data and I didn't repeat their regression. This is just from my armchair.Let's start with this. Suppose it were true, that for physiological reasons, teams always play worse going west, and teams always play better going east. If that were the case, how could you ever know? No matter what you see in the data, it would look EXACTLY like the West teams were just better quality than the East teams. (Which they have been, lately.) To see that argument more easily: suppose the teams on the West Coast are all NBA teams. The MST teams are minor-league AAA. The CST teams are AA. And the East Coast teams are minor league A ball. But all the leagues play against each other.In that case, you'd see exactly the pattern the authors got: teams are .500 against each other in the same time zone, but worse when they travel west to play against better leagues, and better when they travel east to play against worse leagues.No matter what results you get, there's no way to tell whether it's time zone difference, or team quality.So is that the issue, that the regression is just measuring a quality difference between teams in different time zones? No, I don't think so. I believe the "time zone" coefficient of the regression is measuring something completely irrelevant (and, in fact, random). I'll get to that in a bit. ------Let's start by considering a slightly simpler version of this regression. Suppose we include all the team indicator variables, but, for now, we don't include the time-zone number. What happens?Everything works, I think. We get decent estimates of team quality, both home and road, for every team/year in the study. So far, so good. Now, let's add a bit more complexity. Let's create a regression with two time zones, "West" and "East," and add a variable for the effect of that time zone change.What happens now?The regression will fail. There's an infinite number of possible solutions. (In technical terms, the regression matrix is "singular." We have "collinearity" among the variables.)How do we know? Because there's more than one set of coefficients that fits the data perfectly. (Technical note: a regression will always fail if you have an indicator variable for every team. To get around this, you'll usually omit one team (and the others will come out relative to the one you omitted). The collinearity I'm talking about is even *after* doing that.)Suppose the regression spit out that the time-zone effect is actually .080, and it also spit out quality estimates for all the teams.From that solution, we can find another solution that works just as well. Change the time-zone effect to zero. Then, add .080 to the quality estimate of every West team. Every team/team estimate will wind up working out exactly the same. Suppose, in the first result, the Raptors were .400 on the road, the Nuggets were .500 at home, and the time-zone effect is .080. In that case, the regression will estimate the Raptors at .320 against the Nuggets. (That's .400 - (.500 - .500) - .080.)In the second result, the regression leaves the Raptors at .400, but moves the Nuggets to .580, and the time-zone effect to zero. The Raptors are still estimated at .320 against the Nuggets. (This time, it's .400 - (.580 - .500) - .000.)You can create as many other solutions as you like that fit the data identically: just add any X to the time-zone estimate, and add the same X to every Western team.The regression is able to figure out that the data doesn't give a unique solution, so it craps out, with a message that the regression matrix is singular.------All that was for a regression with only two time zones. If we now expand to include all four zones, that gives six different effects each direction (E moving to C, C to M, M to P, E to M, C to P, and M to P). What if we include six time-zone variables, one for each effect?Again, we get an infinity of solutions. We can produce new solutions almost the same way as before. Just take any solution, subtract X from each E team quality, and add X to the E-C, E-M and E-P coefficients. You wind up with the same estimates.------But the authors' regression actually did have one unique best fit solution. That's because they did one more thing that we haven't done.We can get to their regression in two steps.First, we collapse the six variables into three -- one for "one time zone" (regardless of which zone it is), one for "two time zones," and one for "three time zones". Second, we collapse those three variables into one, "number of time zones," which implicitly forces the two-zone effect and three-zone effect to be double and triple, respectively, the value of the one-zone effect. I'll call that the "x/2x/3x rule" and we'll assume that it actually does hold.So, with the new variable, we run the regression again. What happens?In the ideal case, the regression fails again. By "ideal case," I mean one where all the error terms are zero, where every pair of teams plays exactly as expected. That is, if the estimates predict the Raptors will play .350 against the Nuggets, they actually *do* play .350 against the Nuggets. It will never happen that every pair will go perfectly in real life, but maybe assume that the dataset is trillions of games and the errors even out.In that special "no errors" case, you still have an infinity of solutions. To get a second solution from a first, you can, for instance, double the time zone effects from x/2x/3x to 2x/4x/6x. Then, subtract x from each CST team, subtract 2x from each MST team, and subtract 3x from each PST team. You'll wind up with exactly the same estimates as before.-------For this particular regression to not crap out, there have to be errors. Which is not a problem for any real dataset. The Raptors certainly won't go the exact predicted .350 against the Nuggets, either because of luck, or because it's not mathematically possible (you'd need to go 7-for-20, and the Raptors aren't playing 20 games a season in Denver).The errors make the regression work.Why? Before, x/2x/3x fit all the observations perfectly. So you could create duplicate solutions by adding and subtracting X and 2X from the teams, and adding X and 2X to the one-zone effects and two-zone effects. Now, because of errors, not all the observed two-zone effects are exactly double the one-zone effects. So not everything cancels out, and you get different residuals. That means that this time there's a unique solution, and the regression spits it out.-------In this new, valid, regression, what's the expected value of the estimate for the time-zone effect?I think it must be zero.The estimate of the coefficient is a function of the observed error terms in the data. But the errors are, by definition, just as likely to be negative as positive. I believe (but won't prove) that if you reverse the signs of all the error terms, you also reverse the sign of the time zone coefficient estimate.So, the coefficient is as likely to be negative as positive, which means by symmetry, its expected value must be zero.In other words: the coefficient in the study, the one that looks like it's actually showing the physiological effects of changing time zone ... is actually completely random, with expected value zero.It literally has nothing at all to do with anything basketball-related!-------So, that's one factor that's giving the weird result, that the regression is fitting the data to randomness. Another factor, and (I think) the bigger one, is that the model is wrong. There's an adage, "All models are wrong; some models are useful." My argument is that this model is much too wrong to be useful. Specifically, the "too wrong" part is the requirement that the time-zone effect must be proportional to the number of zones -- the "x/2x/3x" assumption.It seems like a reasonable assumption, that the effect should be proportional to the time lag. But, if it's not, that can distort the results quite a bit. Here's a simplified example showing how that distortion can happen.Suppose you were to run the regression without the time-zone coefficient, and you get talent estimates for the teams, and you look at the errors in predicted vs. actual. For East teams, you find the errors are+.040 against Central+.000 against Mountain-.040 against PacificThat means that East teams played .040 better than expected against Central teams (after adjusting for team quality). They played exactly as expected against Mountain Time teams, and .040 worse than expected against West Coast teams.The average of those numbers is zero. Intuitively, you'd look at those numbers and think: "Hey, there's no appreciable time-zone effect. Sure, the East teams lost a little more than normal against the Pacific teams, but they won a little more than normal against the Central teams, so it's mostly a wash."Also, you'd notice that it really doesn't look like the observed errors follow x/2x/3x. The closest fit seems to be when you make x equal to zero, to get 0/0/0.So, does the regression see that and spit out 0/0/0, accepting the errors it found? No. It actually finds a way to make everything fit perfectly!To do that, it increases its estimates of every Eastern team by .080. Now, every East team appears to underperform by .080 against each of the three other time zones. Which means the observed errors are now -.040 against Central-.080 against Mountain-.120 against PacificAnd that DOES follow the x/2x/3x model -- which means you can now fit the data perfectly. Using 0/0/0, the .500 Raptors were expected to be .500 against an average Central team (.500 minus 0), but they actually went .540. Using -.040/-.080/.120, the .580 Raptors are expected to be .540 against an average Central team (.580 minus .040), and that's exactly what they did.So the regression says, "Ha! That must be the effect of time zone! It follows the x/2x/3x requirement, and it fits the data perfectly, because all the errors now come out to zero!"So you conclude that (a) over a 20-year period, the East teams were .580 teams but played down to .500 because they suffered from a huge time-zone effect.Well, do you really want to believe that? You have at least two other options you can justify: (b) over a 20-year period, the East teams were .500 teams and there was a time-zone effect of +40 points playing in CST, and -40 points playing in PST, but those effects weren't statistically significant.(c) over a 20-year period, the East teams were .500 teams and due to lack of statistical significance and no obvious pattern, we conclude there's no real time-zone effect.The only reason to choose (a) is if you are almost entirely convinced of two things: first, that x/2x/3x is the only reasonable model to consider, and, second, that 40/80/120 points is plausible enough to not assume that it's just random crap, despite the statistical significance.You have to abandon your model at this point, don't you? I mean, I can see how, before running the regression, the x/2x/3x assumption seemed as reasonable as any. But, now, to maintain that it's plausible, you have to also believe it's plausible that an Eastern team loses .120 points of winning percentage when it plays on the West Coast. Actually, it's worse than that! The .120 was from this contrived example. The real data shows a drop of more than .200 when playing on the West Coast!The results of the regression should change your mind about the model, and alert you that the x/2x/3x is not the right hypothesis for how time-zone effects work.-------Does this seem like cheating? We try a regression, we get statistically-significant estimates, but we don't like the result so we retroactively reject the model. Is that reasonable?Yes, it is. Because, you have to either reject the model, or accept its implications. IF we accept the model, then we're forced to accept that there's 240-point West-to-East time zone effect, and we're forced to accept that West Coast teams that play at a 41-41 level against other West Coast teams somehow raise their game to the 61-21 level against East Coast teams that are equal to them on paper.Choosing the x/2x/3x model led you to an absurd conclusion. Better to acknowledge that your model, therefore, must be wrong.Still think it's cheating? Here's an analogy:Suppose I don't know how old my friend's son is. I guess he's around 4, because, hey, that's a reasonable guess, from my understanding of how old my friend is and how long he's been married. Then, I find out the son is six feet tall.It would be wrong for me to keep my assumption, wouldn't it? I can't say, "Hey, on the reasonable model that my friend's son is four years old, the regression spit out a statistically significant estimate of 72 inches. So, I'm entitled to conclude my friend's son is the tallest four-year-old in human history."That's exactly what this paper is doing. When your model spews out improbable estimates for your coefficients, the model is probably wrong. To check, try a different, still-plausible model. If the result doesn't hold up, you know the conclusions are the result of the specific model you chose. ------By the way, if the statistical significance is concerning you, consider this. When the authors repeated the analysis for a later group of years, the time-zone effect was much smaller. It was .012 going east and -.008 going west, which wasn't even close to statistical significance. If the study had combined both samples into one, it wouldn't have found significance at all.Oh, and, by the way: it's a known result that when you have strong correlation in your regression variables (like here), you get wide confidence intervals and weird estimates (like here). I posted about that a few years ago. -------The original question was: what's going on with the regression, that it winds up implying that a .500 team on the West Coast is a .752 team on the East Coast?The summary is: there are three separate things going on, all of which contribute:1. there's no way to disentangle time zone effects from team quality effects.2. the regression only works because of random errors, and the estimate of the time-zone coefficient is only a function of random luck.3. the x/2x/3x model leads to conclusions that are too implausible to accept, given what we know about how the NBA works.

-----UPDATE, August 6/17: I got out of my armchair and built a simulation. The results were as I expected. The time-zone effect I built in wound up absorbed by the team constants, and the time-zone coefficient varied around zero in multiple runs.

1 Comments:

I think there might be ways to get at this using actual data, rather than simulations. Consider, for example, professional golf. We know how many people play on the PGA tour (or other tours) in a year. We can gather information of their scores on every hole if we want to. But even at the level of every round, there's a lot of data. (I can't find a downloadable database right now, but they almost certainly exist). What I do have is current average scores per round for 210 PGA golfers, through last week's tournament. The mean score per round is 71.28 with a s.d. of 1.09. The golfer with the lowest average score (68.89) is 2.19 s.d. better than average; the worst scoring golfer has an average of 76.71 (s.d. worse than average 4.99). Anyway...this is just to suggest an additional approach to this analysis...