Thursday, January 20, 2011

Sabermetric basketball statistics are too flawed to work

You know all those player evaluation statistics in basketball, like "Wins Produced," "Player Evaluation Rating," and so forth? I don't think they work. I've been thinking about it, and I don't think I trust any of them enough put much faith in their results.

That's the opposite of how I feel about baseball. For baseball, if the sportswriter consensus is that player A is an excellent offensive player, but it turns out his OPS is a mediocre .700, I'm going to trust OPS. But, for basketball, if the sportswriters say a guy's good, but his "Wins Produced" is just average, I might be inclined to trust the sportswriters.

I don't think the stats work well enough to be useful.

I'm willing to be proven wrong. A lot of basketball analysts, all of whom know a lot more about basketball than I do (and many of whom are a lot smarter than I am), will disagree. I know they'll disagree because they do, in fact, use the stats. So, there are probably arguments I haven't considered. Let me know what those are, and let me know if you think my own logic is flawed.

------

The most obvious problem is rebounds, which I've posted about many times (including these posts over the last couple of weeks). The problem is that a large proportion of rebounds are "taken" from teammates, in the sense that if the player credited with the rebound hadn't got it, another teammate would have.

We don't know the exact numbers, but maybe 70% of defensive and 50% of offensive rebounds are taken from a teammates' total.

More importantly, it's not random, and it's not the same for all players. Some rebounders will cover much more of other players' territory than others. So when player X had a huge rebounding total, we don't know whether he's just good at rebounds, whether he's just taking them from teammates, or whether it's some combination of the two.

So, even if we decide to take 70% of every defensive rebound, and assign it to teammates, we don't know that's the right number for the particular team and rebounder. This would lead to potentially large errors in player evaluations.

The bottom line: we know exactly what a rebound is worth for a team, but we don't know which players are responsible, in what proportion, for the team's overall performance.

------

Now, that's just rebounds. If that were all there were, we could just leave that out of the statistic, and go with what we have. But there's a similar problem with shooting accuracy.

I ran the same test for shooting that I ran for rebounds. For the 2008-09 season, I ran regression for each of the five positions. Each row of the regression was a single team for that year, and I checked how each position's shooting (measured by eFG%) affected the average of the other four positions (the simple average, not weighted by attempts).

It turns out that there is a strong positive correlation in shooting percentage among teammates. If one teammate shoots accurately, the rest of the team gets carried along.

Here are the numbers (updated, see end of post):

PG: slope 0.30, correlation 0.63SG: slope 0.40, correlation 0.62SF: slope 0.26, correlation 0.27PF: slope 0.28, correlation 0.27-C: slope 0.27, correlation 0.43To read one line off the chart: for every one percentage point increase in shooting percentage by the SF (say, from 47% to 48%), you saw an increase of 0.26% in each of his teammates (say, from 47% to 47.26%).

The coefficients are a lot more important than they look at first glance, because they represent a change in the average of all four teammates. Suppose all five teammates took the same number of shots (which they don't, but never mind right now). That means that when the SF makes one extra field goal, each teammate also makes an extra 0.26, for a team team total of 1.04 extra field goals.

That's a huge effect.

And, it makes sense, if my logic is right (correct me if I'm wrong). Suppose you have a team where everyone has a talent of .450, but then you get a new guy on the team (player X) with a talent of .550. You're going to want him to shoot more often than the other players. For instance, if X and another guy are equally open for a roughly equal shot, you're going to want to give the ball to X. Even if Y is a little more open than X, you'll figure that X will still outshoot Y -- maybe not .550 to .450, but, in this situation, maybe .500 to .450. So X gets the ball more often.

But, then, the defense will concentrate a little more on X, and a little less on the .450 guys. That means X might see his percentage drop from .550 to .500, say. But the extra attention to X creates more open shots for the .450 guys, and they improve to (say) .480 each.

Most of the new statistics simply treat FG% as if it's solely the achievement of the player taking the shot, when, it seems, it is very significantly influenced by his teammates.

------

Some of that, of course, might be that teams with good players tend to have other good players; that is, it's all correlation, and not causation. But there's evidence that's not the case, as illustrated by a recent debate on the value of Carmelo Anthony.

Last week, Nate Silver showed that if you looked at Carmelo Anthony's teammates' performance, and then looked at that performance when Anthony wasn't on their team, you see a difference of .038 in shooting percentage. That's huge -- about 15 wins a season.

Dave Berri responded with three criticisms. First, that Silver weighted by player instead of by game; second, that Silver hadn't considered the age of the teammates (since very young players improve anyway as they get older); and, third, that if you control for age and a bunch of other things, the results aren't statistically significant from zero. (However, Berri didn't post the full regression results, and did not claim that his estimate was different from .038.)

Finally, over at Basketball Prospectus, Kevin Pelton ran a similar analysis, but within games instead of between seasons (which eliminates the age problem, and a bunch of other possible confounding variables). He found a difference of .028. Not quite as high as Silver, but still pretty impressive. Furthermore, a similar analysis of all of Anthony's career shows similar improvements in team performance, which suggests the effect is real.

To be clear, this kind of analysis is the kind that, I'd argue, works great -- comparing the team's performance with the player and without him. What I think *doesn't* work is just using the raw shooting percentages. Because how do you know what those percentages mean? Suppose one team is all at .460, and another team is all at .490. The .490 means that you have more players on the team above average than below average. But, the above average players are lifting the percentages of the below average players, and the below-average players are reducing the percentages of the above-average players. But which are which? We have no way of telling.

Here's a hockey example. Of Luc Robitaille's eight highest-scoring NHL seasons, six of them came while he was a teammate of Wayne Gretzky. In 1990-91, Robitaille finished with 101 points. How much of the credit for those points do you give to Robitaille, and how much of the credit do you give to Gretzky? There's no way to tell from the single season raw totals, is there? You have to know something about Robitaille, and Gretzky, and the rest of their careers, before you can give a decent estimate. And your estimate will be that Gretzky that should get some of the credit for some of Robitaille's performance.

Similarly, when Carmelo Anthony increases all his teammates' shooting percentages by 30 points, *and it's the teammates that get most of that credit* ... that's a serious problem with the stat, isn't it?

------

So far, we've only found problems with two components of player performance -- rebounds and shooting percentage. However, those are the two biggest factors that go into a player's evaluation. And, additionally, you could argue that the same thing applies to some of the other stats.

For instance, blocked shots: those are primarily a function of opportunity, aren't they? Some players take a lot more shots than others, so the guy who defends against Allen Iverson is going to block a lot more shots than his teammates, all else being equal.

------

Still, it could be possible that the problems aren't that big, and that, while the new statistics aren't perfect, they're still better than existing statistics. That's quite reasonable. However, I think that, given the obvious problems, the burden of proof shifts to those who maintain the stats still work.

The one piece of evidence that I know of, with regard to that issue, is the famous study from David Lewin and Dan Rosenbaum. It's called "The Pot Calling the Kettle Black – Are NBA Statistical Models More Irrational than 'Irrational' Decision Makers?" (I wrote about it here; you can find it online here; and you can read a David Berri critique of it here.)

What Lewin and Rosenbaum did was try to predict how teams would perform last year, based on their previous year's statistics. If the new sabermetric statistics were better evaluators of talent than, say, just points per game, they should predict better.

As you can see, "minutes per game" -- which is probably the closest representation you can get to what the coach thinks of a player's skill -- was the second highest of all the measures. And the new stats were nothing special, although "Alternate Win Score" did come out on top. Notably, even "points per game," widely derided by most analysts, finished better than PER and Berri's "Wins Produced."

When this study came out, I thought part of the problem was that the new statistics don't measure defense, but "minutes per game" does, in a roundabout way (good defensive players will be given more minutes by their coach). I still think that. But, now, I think part of the problem is that the new statistics don't properly measure offense, either. They just aren't able to do a good job of judging how much of the team's offensive performance to allocate to the individual players.

Now that I think I understand why Lewin and Rosenbaum got the results they did, I have come to agree with their conclusions. Correct me if I'm wrong, but logic and evidence seem to say that sabermetric basketball statistics simply do not work very well for players.

-----

UPDATE: some commenters in the blogosphere are assuming that I mean that basketball sabermetric research can't work for basketball. That's not what I mean. I'm referring here only to the "formula" type stats.

I think the "plus-minus"-type approaches, like those in the Carmelo Anthony section of the post above, are quite valid, if you have a big enough sample to be meaningful.

But, just picking up a box score or looking up standard player stats online, and trying from that which players are how much better than others (the approach that "Wins Produced" and other stats take) ... well, I don't think you're ever going to be able to make that work.

UPDATE: I found a slight problem with the data: one team was missing and one team I entered twice. I've updated the post. The conclusions don't change.

For the record, the wrong slopes were .30/.39/.31/.25/.24. The corrected slopes, as above, are .30/.40/.26/.28/.27.

The wrong correlations were .59/.58/.37/.26/.40. The corrected correlations are .63/.62/.27/.27/.43.

33 Comments:

Phil, I was wondering if you have seen any sabermetric type research on soccer. This came to mind when I was reading an op-ed on how the U.S. was 26th out of 32 teams in the World Cup in completed passing %. The guys conclusion was that if the U.S. wants to seriously contend then they must improve on this stat. So much was flawed about this argument that I was flabbergasted. Can you point me to any articles or sites that try to use sabremetric principles to explain soccer success.

I think the "big question" you're asking is not whether or not "sabermetric style" statistics and analysis can be applied to basketball -- I think it can -- but which measurements are useful. Someone like Wayne Winston would probably argue that you're absolutely correct that there's too much noise and outside influence in box score statistics to massage them into something useful. However, I think he'd also argue that you can massage +/-, adjusting for opponent, and get an idea of which five-man units work well together and which don't. It'd be nice to know WHY Landry Fields makes such a huge difference for the Knicks, but until we get a better handle on how to measure his individual contributions, we can see that the five-man units on which he plays are significantly better than the five-man units without him.

J-Doug: If you're saying that team rebounds never show up in any player's stats, then they shouldn't affect my results, because I never use them. For team REBs in my previous posts, I used the sum of the five positions.

Phil, +/- approaches can't really do much within a single season--the multicollinearity of the sample just kills it. Take a look at RAPM for a ridge-regression approach, which is about as good as you can do with raw +/- data. That's at http://stats-for-the-nba.appspot.com/, though no standard errors are available since it's a brute-force method.

Perhaps the best approach to take, one that hasn't been done as far as I know, would be to use a box-score statistic to inform a Bayesian prior that would then be used to stabilize the adjusted +/- regression.

Phil - I think you credit the Rosenbaum/Lewin study too much. Their methodology for comparing the various 'statistics' is flawed, and does not mirror Wins Produced in the way they think it does. Your other concerns may be valid, but hopefully that paper doesn't inform your opinion too much.

Hey, I have a question. I was wondering if there is also a flaw in the plus/minus. I generally like the stat much more than others, but say there was a player that received many more minutes than some others on his team, and because he received more minutes he spent a lot more time on the court, playing without his teams' other best players. He would end up playing a lot with the bench, who are bound to have lower plus/minus ratings, and therefore, even if he was helping to keep the bench's ratings higher, this players' would actually drop. I assume there is a way to break it down and see that it was clearly affected by those teammates, but how would that be done? And could it be done in a way that would accurately judge the influence of their play on the team's play while said player was on the court?

Phil: Although Rosenbaum and Lewin found that advanced metrics didn't improve prediction of future wins, PER, Efficiency, and Alternate WS all did a better job than MP at predicting players' plus-minus ratings. The idea here is that although PM is very noisy for individual players, it should represent an unbiased estimate of player productivity in the aggregate. Doesn't that suggest the metrics are capturing valuable information about individual player contributions?

Alex: I'm flabergasted. You wrote a whole post at your blog on the Rosenbaum/Lewin paper, demonstrating mainly that you don't understand the paper or its methodology. You confirmed their core point, but then reported your result as refuting the paper! Now you want to continue criticizing a paper you still don't get? This is verging on self-parody....

Also take a look at predicting current season record here (Jan 16 post): http://sonicscentral.com/apbrmetrics/viewtopic.php?t=2618&start=90. It appears that several metrics (including Dsmok1's) do a much better job than just regressing prior season record. Of course it's just one partial season, so maybe none will outperform in the long run. But I think some will.

You'd *expect* the metrics to beat playing time. In the Lewin/Rosenbaum study, they didn't (in the aggregate) when measured less precisely. In these other studies, they do when measured more precisely.

Fair enough. But that's a pretty low hurdle. Batting average will predict next year's wins (and next years WAR) better than playing time. Points in hockey will predict next year's wins better than playing time. That doesn't mean those are wonderful stats.

Suppose one team is 20 FG above average and another team is 20 below. If a metric assigns those to the wrong players, then every player will be wrongly evaluated: but the correlation to next season will still be positive, because, overall, you got the direction right. That's not a huge victory for the metric.

Not that I'm saying the improved correlation doesn't matter at all, but, as I say, it's a low hurdle for the stat to have cleared.

Phil, to be (even) more rigorous about it, wouldn't it make sense to look at the change in eFG% of teammates relative to when that player is on/off the floor?

When the player is on the floor, teammates will have a certain eFG%. When that player is off the floor, if he has a large effect, teammates eFG% should change more. Isn't *that* the effect we are interested in? This would separate teams that simply have a lot of good players from teams that rely on a single player to lift teammates (say Cleveland circa 2009).

Seems to me this is more along the lines of the recent studies by Silver, Pelton, etc..

Sure, you could do more complicated "plus-minus" type analyses to get the answer, too. But those are more work, and I don't have the data.

And whatever comes out of studies like that won't contradict the simpler studies. The only way things can come out different that I can think of is if teams do in fact vary so much that good players very, very often play with other good players on the team.

That's the only way I can think of that your proposed study would contradict this one, but maybe you can think of others.

Why couldn't you use some form of usage rate to deal with the problem of shooting efficiency? You said, in your hypothetical scenario, that a team would want a player with more talent to shoot more than the other players. Why not take that into account?

Part of the problem is that the linear roll-up metrics in question here - WP48, PER, Win Score - well, they're kind of primitive and out-dated by today's standards.

PER and WP were both more or less finished ten years ago. Win Score is just a simple at-a-glance formula which doesn't claim to capture player skill in any sophisticated way.

Win Shares, the other oft-used roll-up metric, is explicitly an estimate of an estimate. It assigns a lot of team defensive performance to individuals based on minutes played because the field goal defense stats you'd need to do it properly aren't recorded anywhere for free.

The other part of the problem is how aggressively marketed Wins Produced is. Hollinger will readily admit that defense is not captured by PER because it's not captured by box score statistics. The Wages of Wins bunch will adamantly insist the opposite.

I think the simplest way of summarizing all of this is that the flaw in applying Sabermetric tools to basketball largely rests on the assumption that events measured are in fact independent random events. In baseball, it's close enough to true to be a useful assumption, in basketball, it's fairly obviously flawed due to the more dynamic nature of the game relative to the stop-start nature of baseball.

Everyone implicitly realizes this, as you don't get people saying things like "Ray Allen shooting a 3 is the best possible outcome for an offensive possession for the Celtics, therefore, every possession should end in a 3 by Allen," because it's immediately evident that such a "pure" strategy doesn't reflect the way basketball is actually played. But people haven't really internalized the implications as they continue to try and decontextualize events which rely heavily on context for their meaning and/or predictive value.

The proper use of statistics in evaluating the worth of baseball players is, quite literally, a whole different ball-game from attempts to do likewise for basketball players.

In basketball, the best 'sabremetricians', so-to-speak, have ALWAYS been the real-life elite level coaches who understand how the game actually works in a different way from almost everyone else, based in large part on scores and scores of individual match-ups and mis-matches. Cheers

I agree with Jacob. Throughout this whole process, you've set up basically a straw man by looking at win score, PER, etc. *Real* APBRmetricians do not take those metrics seriously. They are, as I've said before, novelty stats designed to sell books. I resent the fact that you have besmirched the reputation of NBA stat analysis by taking down these straw men, because all the average person will see is your hyperbolic blanket statements about how stat analysis doesn't work for basketball. You have made the job that much harder for the rest of us who are using legitimate metrics and doing legitimate research instead of trying to sell books.

As DMok1 stated, a box score based bayesian prior for adjusted plus minus increases predictive power greatly when combined with APM. Also, a method of adjusting Oliver's offensive rating for usage using skillcurve tradeoffs was validated by Goldman's research at the Sloan convention. I agree that this isn't baseball, but there are some box score methods that do add to our understanding. The problem I suppose is that the most high profile metrics - PER, wins produced - are garbage and almost all APBR people will tell you that. Instead of taking down the strawmen of those metrics, you should have dug deeper and highlighted the better metrics.

It is, though. You take adjusted +/-, regress box score stats onto it and come up with a predicted +/- based only on box score stats. This explains only 25-30% of defense but 70% of offense, and adds to the predictive accuracy of raw adjusted +/-, especially for high standard error players. That approach is actually empirical as well, unlike PER where the weights are made-up.

OK, I'm not a statistician, but here is the statistic I think I would want: something that measures the success (in points generated) of the offensive possessions in which the player touches the ball. That would certainly show the differences between PGs, but it would also indicate more than the coaches decision whether to have a player on the floor, but also the other players decisions whether or not to involve that player in a play, and how successful the results are if that player has their hands on the ball at all during a possession. Because a turnover can be attributed to the last player who had the ball, but sometimes is the fault of the most recent passer. The +/- takes into account defense, but doesn't take into account the players involvement with offensive success. Showing offensive success per minute a player is on the court is good, but perhaps we should be able to subtract from that the possessions in which that player didn't touch the ball.