The Randomness of Ratings

Fresh off a Boston Celtics sweep of the Indiana Pacers, across Twitter phrases of “Remember that the Boston starters were -3.5 Net Rating for the series” piped up. The implication was that the Boston starters should have lost the series, and the most common rumblings came about the depth of Indiana’s bench costing them the series. There may be some truth to that. But my immediate reaction was “So? -3.5 is almost nothing at all.” After all, I wasn’t fed the information of how many possessions were played or who was on the court for Indiana during those possessions. Even then, it’s a small sample size and can be blanketed by noise.

So let’s break down what makes a -3.5 rating…

Recall that net rating is calculated by

This is just the difference of offensive and defensive ratings. This is merely a linear stretching of points per possession to per 100 possessions, to give the effect of if these players played a whole game at this uniform consistency. And that’s okay; it’s mainly there for readers to digest the information in an easier manner.

Rarely does a rotation play more of one type of possession over another; particularly within a four game series. For starters, we typically see three-to-four stints per game for a starting rotation. Rake that over 4 games, and we expect the starters to play 12-16 stints. Therefore at its worst, possession difference is would be 32 possessions. In reality, its much closer to zero.

Using these facts, we can begin to construct what a -3.5 rating really means: a differential of -.035 points per possession. What does this number actually mean? This actually means every 28 possessions played, the Boston starters needed and extra offensive possession to match what their defense was giving up. Does this mean the Boston starters were outscored? Without extra information, possibly.

Example: Boston starters have 114 offensive possessions to Indiana’s 109 offensive possessions with a final score of 110 – 109 leads to the starters outscoring their competition while maintaining a -3.5 net rating.

While this may not be the reality of the Boston starters; the discussion here is to not fall into the trap of comparing ratings without context.

Negative Ratings can be Wins; But They’re Also Random

A bigger challenge with ratings is the randomness of it all. Over the past couple years, different methods of smoothing have been used to reduce the noise in ratings. One of the most-used forms is luck-adjusted rating. Even this is just a regression methodology at the zeroth-order level with a little first-order effects mixed in. Other models such as Adjusted Plus-Minus and all of its various add-ons/follow-ons/hierarchical or Bayesian updates/etc. are again just regression methods applied at the first-order level. Interaction methods developed by guys like myself or a couple of my past collaborators (and teams) are still just again regression methods applied at the higher-order levels. The point is, every single of of these methods treat stints as observations and then apply the smoothing at the response level. Every single one of the methods above are a marked improvement over citing raw net ratings but even they too fail at understanding the randomness of an actual stint.

Let’s take a deep look at a single stint from the Boston-Indiana series.

Game Three Starters: Boston 29, Indiana 18

At the start of game three, the Celtics lit up the floor by scoring on 12 of their first 18 possessions to race out to a 29-18 lead. Buoyed by five three point field goals, Boston maintained an offensive rating of 161.11 for their first stint. In contrast, the Pacers spent half their possessions turning the ball over through bad passes and missed field goals, only converting 44% of their possessions into field goals en route to 18 points; an offensive rating of 100.00. The differential suggests that the Celtics had a net rating of 61.11; indicating the starters were vastly superior to their opponents. A little troubling for a teams that ended up with a -3.5 when all was said and done.

When all was said and done, the distribution of points per possession are given as

Boston Celtics

0 points: 6 possessions

1 point: 0 possessions

2 points: 7 possessions

3 points: 5 possessions

Indiana Pacers

0 points: 9 possessions

1 point: 1 possession

2 points: 7 possessions

3 points: 1 possession

Let’s play a little game with this “training data.”

Replay the First 18 Possessions One Million Times

By supposing the distribution of points scored per possession are given above by the Celtics-Pacers stint, we can simulate the 18 possession stint over and over to understand the randomness of the data. Of course, we assume there is noise on the above data, so we will apply a basic Bayesian filter for multinomial data. Furthermore, we won’t even apply luck adjustments to bias everything we can towards Boston.

The idea here is to look at a net rating an understand, given the randomness of scoring, how noisy that rating really is.

Here, we apply a simple algorithm that samples the distribution of points scored from the multinomial-Dirichlet model trained by the Celtics’ +61.11 net rating.

Running the simulation, we see that even with this absurd differential, the Pacers are expected to win more than 5% of these stints! The probability of a Pacers win under these scoring distributions are 5.2%. Now this doesn’t mean that when Boston posts up a +61.11 net rating, the Pacers will win 5% of the time. This means when Boston plays like a +61.11 net rating team, the Pacers are still expected to win more than 5% of the time.

Therefore, the net rating doesn’t indicate that Boston is 61 points better, it’s merely a symptom of whatever the true net rating is. In fact, let’s take a look at the distribution of offensive ratings:

Distribution of Offensive Ratings for the Celtics (Green) and the Pacers (Blue) by replaying the stint 1,000,000 times using the +61.11 Boston net rating as a seed.

We see there is significant overlap in the two distributions. In fact, to illustrate the symptom effect described above, Indiana played at 72.7 offensive rating but yet they latched onto a 100.00 offensive rating. Similarly, Boston’s distribution of scoring reflects a 131.84 offensive rating despite the 161 that was posted. What this shows is, the teams are symptomatic of “luck.”

(Note: For those who are fully aware of statistical analysis and resulting continuity correction being applied by the Dirichlet-Multinomial model above, luck is being defined as points over/under expectation, inflated at small probability regions. In this case, it’s free throws and three point field goals; hence the drops just noted.)

The more important takeaway is that the style of play from Boston led to a larger variance in play. That is, their ratings have a standard deviation of 28 points. Compare this to the Pacer’s much smaller 20 points, and we see that ratings follow a heteroskedastic process.

With that in mind, we can look at the net ratings for the Boston starters:

Net rating for the Boston starters, trained off the first stint in Game 3 against the Indiana Pacers.

What ends up happening is the phenomenon that beats up most regression analyses on ratings: skewness. Here, we can actually see the skewness as the distribution is left-tailed. In fact, due to randomness we see that the game with a given true net rating of +61.11 could produce a net rating of-100.

So What’s the Point?

The point here is, a -3.5 net rating is relatively meaningless. It’s just another descriptive number that needs a lot more context. Negative net ratings still produce wins. That’s a problem when trying to understand how well a unit works together.

Furthermore, even if a very high net rating is used as truth, we can still get wildly varying net ratings.

In fact, a former Sloan presenter one told me that “Six possessions are enough to invoke Central Limit Theorem” which I’ve never seen as true. Above is yet another example where we even triple the size and still get a heavy skewness in the results using the tests derived from Columbia University , skewness for this sample is strong with p-value 4.38 x 10(-29) for one million samples.

Lastly, ratings are heteroskedastic. Meaning every regression model poorly reduces noise if heteroskadasticity is not taken into account.

How We Remedy This…

More importantly, the argument is to identify the ratings are symptoms of other phenomenon. Instead, we should focus on transactional interactions such as actions and scenarios that feed into points per possession from possession to possession. This isn’t to suggest using a singular point per possession, but rather develop an artifical-intelligence-based approach to understanding the decision making process of a collective unitgiven the state of the gaming system.

Currently, several teams are approaching this venture. Some is developed on play-to-play analysis such as live and dead ball turnovers thanks to Mike Beuoy and Seth Partnow. Some is developed by tracking such as trying to quantify actions as competing risk models thanks to Dan Cervone. These are just a handful of examples in existence, and even then they struggle to maintain fidelity to the game; a fact of the ever changing landscape of how points are scored.

Until we are able to represent the stochastic partial differential equation that defines basketball, we are left nibbling at its edges with summary statistics, regression models, and partial “solutions.” And that’s okay for now.

Just remember that a 61.11 positive net rating match-up is expected to lose over 5% of the time.