Monday, December 03, 2007

Does pace impact defensive efficiency? Don't use r-squared, use the regression equation

When someone runs a regression, they will wind up reporting a value for r or r-squared. If the value is small, they'll argue that the two variables don't really have much of a relationship. But that's not necessarily true.

Before I talk about why, I should say that if the result goes the other way – the r or r-squared is high, and statistically significant -- that *does* mean there's a strong relationship. If the correlation between cigarettes smoked and lung cancer is, say, 0.7, that's a pretty big number, and we can conclude that lung cancer and smoking are strongly related.

But a low value doesn't necessarily mean the opposite.

For instance: inspired by the smoking example, I looked at another lifestyle choice. Then, I ran a regression on expected remaining lifespan, based on that lifestyle choice. The results:

r = -.17, r-squared = .03

What is the effect of that lifestyle choice on lifespan? It looks like it should be small. After all, it "explains" only 3% of the variance in years lived.

But that wouldn't be correct. The lifestyle choice really does have a large effect on lifespan. In fact, the lifestyle choice I used in the equation is (literally) suicide.

Here's what I did. I took 999 random 40-year-olds, and assumed their expected remaining lifespan was 40 years, with an SD of about 8. Then, I assumed the 1,000th person jumped in front of a moving subway train, with an expected remaining lifespan of zero. (These numbers are made up, by the way.)

The results were what I showed above: an r of only –.17.

Why does this happen? It happens because the r, and the r-squared, do NOT measure whether suicide and lifespan are related. Rather, they measure something subtly different: whether suicide is a big factor in how long people live.

And suicide is NOT that big a factor in how long people live. Most people don't commit suicide; in my model, only 1 in 1000. The r-squared shows how much effect suicide has *as a percentage of all other factors*. Because there are so many other factors – in real life, heart disease and cancer are about 40 times as common as suicide -- the r-squared comes out small.

If you want to know the strength of the relationship between A and B, don't look at the r or the r-squared. Instead, look at the regression equation. In my suicide experiment, the equation turned out to be

Lifespan = 40.0 – 40.0 (suicide)

That is, exactly what you would expect: the lifespan is 40 years, but subtract 40 from that (giving zero) if you commit suicide.

And, even though the r-squared was only 0.03, that r-squared is statistically significant, at greater than 99.99%.

Again: the r-squared is heavily dependent on how "frequent" the lifestyle choice is in the population. But the significance level, and the regression equation, is not.

To prove it, let me rerun my experiment a few times, with different percentages of suicide in the population:

The r-squared varies a lot – but all the experiments tell you that suicide costs you 40 years of life, and that the result is statistically signficant.

The moral of the story:

The r-squared (or r) does NOT tell you the extent to which A causes B, or even the strength of the relationship between A and B. It tells you the extent to which A explains B relative to all the other explanations of B.

If you want to quantify the effect a change in A has on B, do not look at the r or r-squared. Instead, look at the regression equation.

------

Which brings us to today's post at "The Wages of Wins." There, David Berri checks whether teams who play a fast-paced brand of basketball (as measured by possessions per game) wind up playing worse defense (as measured by points allowed per possession) because of it. Berri quotes Matthew Yglesias:

"For example, there’s a popular conception of a link between pace and defensive orientation — specifically the idea that teams that choose to play at a fast pace are sacrificing something in the defense department. On the most naive level, that’s simply because a high pace leads to more points being given up. But I think it’s generally assumed that it holds up in efficiency terms as well. The 2006-2007 Phoenix Suns, for example, were first in offensive efficiency, third in pace, and fourteenth in defense. But is this really true? If you look at the data season-by-season is there a correlation between pace and defense?"

Berri runs a regression for 34 years of team data. So, is there a relationship? He writes,

"The correlation coefficient between relative possessions and defensive efficiency is 0.17. Regressing defensive efficiency on relative possession reveals that there is a statistically significant relationship. The more possessions a team has per game - again, relative to the league average - the more points the team’s opponents will score per possession. But relative possessions only explains 2.8% of defensive efficiency. In sum, pace doesn’t tell us much about defensive efficiency ... " [emphasis mine]

But I don't think that's right. As we saw, the r-squared of 2.8% (or the r of 0.17) means only that, historically, pace is small *compared to other explanations of defensive efficiency.* And that makes sense. Even if pace has a significant impact on defense, we'd expect other factors to be even more important. The players on the team, for instance, are a big factor. Luck is also a big factor. The coach's strategy probably has a large impact on defensive efficiency. Compared to all those things, pace is pretty minor. And we probably knew that before we started, that personnel matters more than pace.

And so I would guess that's not really what Yglesias wants to know. What I bet he's interested in, and what teams would be interested in, and what I'd be interested in, is this: if a team speeds up the pace by (say) 2 possessions per team per game, how much will its defense suffer? That's an important question: if you're evaluating how good a team (or player) is on defense, you want to know if you can take the stats at face value, or if you have to bump them up to compensate for fast play, or if you have to discount them for teams who play a little slower. It's like a park factor, but for defensive efficiency. The regression should be able to tell you just how big that park factor is.

That's the real question, and the r-squared doesn't answer it at all. Given the data Berri gives us, the effect of pace on defensive efficiency could be small, or it could be large. After all, the effect of suicide on lifespan was huge, even though the r-squared was small. And just like in the suicide case, if a lot more teams suddenly decide to start playing at a different pace, the r and r-squared will go up – but the relationship between pace and defense will likely not change.

To really understand what effect pace has on defense, we need the regression equation. Berri doesn't give it to us. He does tell us the result is statistically significant, so we do know there *is* some kind of non-zero effect. But without the equation, we don’t know how big it is (or even whether it's positive or negative). All we know is that pace *does* signficantly impact on a team's defensive stats, and that the effect (as judged by statistical signficance) appears to be real.

I haven't looked at Berri's article, but I know you aren't going to get a significant (in any sense) result looking at season-level data, which it looks like he was using from your description. If you have a regression model like

off.ppp_game ~ off.ppp_season + def.ppp_season + off.pace_season

(where ppp = points/possession, pace = poss/g) only the ppp predictors will be significant. (Since possessions alternate, the number of offensive possessions is just about equal to defensive possessions.)

If your model looks like

off.ppp_game ~ off.ppp_season + def.ppp_season + off.pace_game

that is a different story. In this case, the coefficient of the pace variable is negative -- that is, holding general team offense and opponent defense constant, a higher number of possessions within a game is associated with a lower efficiency.

Yeah, that won't work. (I just tried it to make sure, but there's all kinds of stuff you won't see by just assuming all your opponents are average.) You need game-level detail, which is what I did. Once you take opponent defense into account, which you really can only do using game data, you can see that a team that has a lot of possessions in a single game generally shot worse than their season average.

ed: I'm confused. If you just do the two columns, it doesn't work? What do you mean by "doesn't work"? Do you mean you don't get signficance, or the result isn't meaningful because you haven't controlled for enough variables?

But I think I see what you're saying ... you're saying that if a team plays fast, its opponents must play just as fast. So both the team and its opponents should have the same effect on their DE. And treating all the opponents as a single team, their DE is your team's OE, so the regressions on DE and OE should give the same result.

Is that what you're saying? Good call, I didn't notice that.

One possible objection is that a team that plays fast might be different, in some ways, than the aggregate opponent, because they have their players/strategy are optimized for fast play. But your point remains.

Actually, all I meant is that you won't get valid statistical significance by using a single variable as a predictor. By doing it this way you are implicitly assuming all opponents are the same, when in fact there are systematic differences in opponents due to scheduling. (This doesn't even touch on the strategic desicions to force a faster or slower pace against certain teams -- again, something you can't see at the season level.) Thinking about it a bit, this didn't necessarily need to be the case, but it was.

Sure, I agree that the model with a single variable is a bit oversimplified. But isn't it a decent start? Doesn't it tell you *something*, and then you can go from there?

It's like Pythagoras ... someone could say, well, you know, teams face different opposition with different run patterns. Some teams have great stoppers, and some let their relievers give up 12 runs in blowouts.

And, yes, you're making some implicit assumptions, but it turns out they don't matter much, everything balances out. You couldn't assume that, you had to check, but it does turn out that way. Couldn't it be the same here?

Oh, absolutely. That is what I was trying to get at in my last sentence. In fact, I am working on something you can only see at the season level -- it's too noisy at the game level. So colour me hypocritical!

Phil, I don't know why you would want to "park adjust" a team's DE for pace. Their DE is their DE, regarldess of the reason, right?

You park adjust players and teams in baseball in order to put everyone on a level playing field so you can compare apples to apples, etc. if you want to fairly compare players, teams, etc. Right? If a team has a DE of X, that should not change in terms of evaluating how "good" it is overall, whether it is a fast paced or slow paced team, right?

So even if DE and pace are related, I am not sure I see an practical uses for using this info.

Also, a team's pace (poss per game) is likely a result of a fast-paced offense and has little to do with defense, right? So if a team has a fast paced offense, why would that affect its defense? Could it be that they are deliberately (or even subconsciously) playing fast on offense to "make up for" a poor defense (DE)?

Or could it be that a team that has a poor defense actually lets the other team score more quickly and hence has a slightly higher pace? IOW, that the relationship between pace and defense happens to be ONLY with respect to their defensive pace (and that it ruens out that defense DOES afftect pace)? Or in general, a slow defensive pace tends to be a better defense.Or simply that if you have a good defense, the other team holds the ball longer (can't get off shots as early). Could that be the entire "reason" for a correlation between pace and DE?

Anyway, my main question (other than the "reason" for the correlation) was why you would think that you would want to adjust a team's DE for its pace, like a park effect? Isn't that like saying I'll adjust a player's HR rate according to his K rate (since they are correlated), like a park effect?

mgl: You're right -- you may not want to adjust. It depends on the reason for the correlation.

I was thinking that it might be the case that (a) pace is the result of a deliberate strategy choice, (b) that strategy choice causes poor defensive positioning which causes a low DE.

In which case, if you want to figure out how good the team's defense is, you adjust for the strategy. It's kind of like, if a football team gains 3 yards per play normally, but 2 yards per play on fourth-and-1, you take that into account.

But, as you point out, it could be like K and HR. The question, "how does DE improve when you slow the pace" would be like "how much does K go down if you stop swinging for the fences?" Which is still an interesting question, but, as you say, not like a park effect.

In any case, that was just off the cuff ... I was trying to illustrate that the point of the exercise is to ask "how much does pace affect DE?" rather than "is pace an important determinant of DE relative to other factors?"

And so perhaps my choice of "park effect" to explain what I meant was not the best.

Maybe I'm thinking about this incorrectly, but it seems to me that there are factors that directly impact both pace and efficiency, and you would want to control for those before seeing if there were any other relationship. First is FG%: the higher it is, the faster the pace (because team will rebound c. 30% of missed shots, extending the length of that possession). Second is ORB%: the higher it is, the slower the pace (again, extends duration of possession). Perhaps these are exactly offsetting factors, which is why Berri found no correlation with OE. But it's not clear to me that has to be the case.

Gents, I'm not exactly sure what you're talking about. Efficiency is not affected by pace, by design: we measure it as pts/possessions, where possessions alternate (ie an offensive rebound does not begin a new possession, it extends the possession). Historically, pace has been weakly correlated with efficiency at the season level, moreso at the game level. But this need not be the case, and it is easy to imagine a slow, deliberate team that is also very efficient on both ends -- lets call them the Spurs.

Ed: I don't see how they can be entirely separate. I assume pace is a function of time elapsed per possession. If a team misses a lot of shots on offense, won't that increase the number of seconds per possession (since a FG always ends the possession, but only 70% of misses end the possession)? And won't a team with a high ORB% extend the average duration of their possessions?

guy -- I think you're overthinking it. If a team increases their number of possessions per minute, that doesn't affect their efficiency -- efficiency is calculated as pts/possession. In the same way ERA isn't affected by a pitcher getting more innings.

"If a team increases their number of possessions per minute, that doesn't affect their efficiency"

I'm arguing the reverse: that certain components of efficiency will impact possessions/minute. For example, if a team grabs a lot of OREBs, it will have more long possessions and that will reduce pace. Now, maybe this is a small factor compared to other sources of variance in pace -- I don't know. But I don't think the metrics are fully independent. To take an extreme case, a high FG%/low ORB%/high OppFG%/high DRB% team should have a lot of possessions per minute.

Guy -- I assumed that you were saying that there was a necessary connection between efficiency and pace. But that isn't what you're saying, so never mind what I said earlier. As to what you actually said, yes, certain offensive stategies are correlated with pace. But from a coaching perspective, the causality goes the other way: a coach decides he wants a face paced offense -- how does he maximise efficiency in that context? From that perspective, the only thing that really matters is that offensive and defensive efficiency are not necessarily affected (which they aren't). Without this necessary effect, you cannot justify pace-adjusting efficiency.

I know very little about basketball, although your suicide example violates a couple pre-conditions for correlation. Suicide is a dichotomous variable. Correlation is a game played between at least ordinal scale variables (better if they're interval, but such is life). Also, 1 suicide and 999 live-out-the-rest-of-natural-life folks is a highly skewed sample. This could certainly be run by an irresponsible statistician and mis-interpreted by a novice reader, but this example is more of a case of the inappropriate use of a statistical technique when a better one (t-test?) is available. It's the GIGO rule. Garbage in, garbage out.