Tuesday, August 16, 2011

The Tango method of regression to the mean -- a proof

To go from a record of performance to an estimate of a team's talent, you have to regress its winning percentage towards the mean. How do you figure out how much to regress?

Tango has often given these instructions:

-----

1. First, figure out the standard deviation of team performance. For MLB, for all teams playing at least 160 games up until 2009, that figure is 0.070 (about 11.34 wins per 162 games).

Second, figure out the theoretical standard deviation of luck over a season, using the binomial approximation to normal. That's estimated by the formula

Square root of (p(1-p)/g))

For baseball, p = .500 (since the average team must be .500), and g = 162. So the SD of luck works out to about 0.039 (6.36 games per season).

So SD(performance) = 0.070, and SD(luck) = 0.039. Square those numbers to get var(performance) and var(luck). Then, if luck is independent of talent, we get

var(performance) = var(talent) + var(luck)

That means var(talent) equals 0.058 squared, so SD(talent) = 0.058.

2. Now, find the number of games for which the SD(luck) equals SD(talent), or 0.058. It turns out that's about 74 games, because the square root of (p(1-p))/74 is approximately equal to 0.058.

3. That number, 74, is your "answer". So, now, any time you want to regress a team's record to the mean, take 74 games of .500 ball (37-37), and add them to the actual performance. The result is your best estimate of the team's talent.

For instance, suppose your team goes 100-62. What's its expected talent? Adjust the record to 137-99. That gives an estimated talent of .581, or 94-68.

Or, suppose your team starts 2-6. Adjust it to 39-43. That's an estimated talent of .476, or 77-85.

-----

Those estimates seemed reasonable to me, but I often wondered: does this really work? Is it really true that you can add 74 games to a 162 game season, and it'll work, but you can also add 74 games to an 8 game season, and that'll work too? Surely you want to add fewer .500 games when your original sample is smaller, no?

And why always add the exact number of games that makes the talent SD equal to the luck SD? Is it a rule of thumb? Is it a guess? Again, that can't be the mathematically best way, can it?

It can, actually. I spent a couple of hours doing some algebra, and it turns out that Tango's method is exactly right. I was very surprised. Also, I don't know how Tango figured it out ... maybe he use an easier, more intuitive way to figure out that it works than going through a bunch of algebra.

But I can't find one, so let me take you through the algebra, if you care. Tango, is there an obvious explanation for why this works, more obvious that what I've done?

Let v^2 =var(overall), and let t^2 = var(talent). Also, let "g" be the number of games.

From the binomial approximation to normal, we know var(luck) = (.25/g). So

v = SD(overall)t = SD(talent)sqr(.25/g) = SD(luck)

Suppose you run a regression on overall outcome vs. talent. The variance of talent is t^2. The variance of overall outcome is v^2. Therefore, we know that talent will explain t^2/v^2 of the variance of outcome, so the r-squared we get out of the regression will be t^2/v^2. That means the correlation coefficient, "r", will be equal to the square root of that, or t/v.

There's a property of regression in general that implies this: If we want to predict talent from outcome, then, if the outcome X is y standard deviations from the mean, talent will be y(t/v) standard deviations from the mean. That's one of the things that's true for any regression of two variables.

So:

Expected talent = average + (number of SDs outcome is away from the mean) (t/v) * (SD of talent)

That last equation means that when we look at how far the observation is from average, we "keep" t^2/v^2 of the difference, and regress to the mean by the rest. In other words, we regress to the mean by (1 - t^2/v^2), or "(100 * (1 - t^2/v^2)) percent".

Now, if we regress to the mean by (1 - t^2/v^2), that's the exactly the same as averaging

For instance, if you're regressing one-third of the way to the mean, you can do it two ways. You can (a) move from the average to the observation, and then move the other way by 1/3 of the difference, or (b) you can just take an average of two parts original and one part mean.

But how does that translate, in practical terms, into how many games of average performance we need to add?

From above, we know that:

For every t^2/v^2 games of observed performance, we want (1 - t^2/v^2) games of average performance.

And now a little algebra:

For every 1 game of observed performance, we want (1 - t^2/v^2)/(t^2/v^2) games of average performance.

Simplifying gives,

For every game of observed performance, we want (v^2-t^2)/t^2 games of average performance.

Multiply by g:

For every "g" games of observed performance, we want g(v^2-t^2)/t^2 games of average performance.

But, from equation 1, we know that (v^2-t^2) is just the squared SD of luck, which is .25/g. So,

For every "g" games of observed performance, we want g(.25/g)/t^2 games of average performance.

The "g"s cancel, and we get,

For every "g" games of observed performance, we want .25/t^2 games of average performance.

And that doesn't depend on g! So no matter whether you're regressing a team over 1 game, or 10 games, or 20 games, or 162 games, you can always add *the same number of average games* and get the right answer! I wouldn't have guessed that.

--------

But how many games? Well, it's (.25/t^2) games.

For baseball, we calculated earlier now that t = 0.058. So .25/t^2 equals ... 74 games. Exactly as Tango said, the number of games we're adding is exactly the number of games for which SD(luck) equals SD(talent)!

Is that a coincidence? No, it's not. It's the way it has to be. Why? Here's a semi-intuitive explanation.

As we saw above, the number of games we have to add does NOT depend on the number of games we started with in the observed W-L record. So, we can pick any number of games. Suppose we just happened to start with 74 games -- maybe a team that was 40-34, or something.

Now, for that team, the SD of its talent is 0.058. And, the SD of its luck is also 0.058. Therefore, if we were to do a regression of talent vs. observed, we would necessarily come up with an r-squared of 0.5 -- since the variances of talent and luck are exactly equal, talent explains half of the total variance.

That means the correlation coefficient, r, is the square root of 0.5, or 1 divided by the square root of 2. For every SD change in performance, we predict 1/sqr(2) SD change in talent. But the SD of talent is exactly 1/sqr(2) times the SD of performance. Multiply those two 1/sqr(2)'s together and you get 1/2, which means for every win change in performance, we predict 1/2 win change in talent.

That's another way of saying that we want to regress exactly halfway back to the mean. That, in turn, is the equivalent of averaging one part observation, and one part mean. Since we have 74 games of observation, we need to add 74 games of mean.

So, in the case of "starting with 74 games of observation," the answer is, "we need to add 74 games of .500 to properly regress to the mean."

However, we showed above that we want to add the *same* number of .500 games regardless of how many observed games we started with. Since this case works out to 74 games, *all* situations must work out to 74 games.

QED, I guess.

--------

And, of course, and again as Tango has pointed out, this works for *any* binomial variable, like batting average or hockey save percentage. The only thing you have to keep in mind is that the ".25" in the formula for luck is based on an average being .500. It's really p(1-p), which works out to .25 if your p equals .500. If your p doesn't equal .500, use p(1-p) instead. So, in hockey, where a typical save percentage is .880, use (.880)(.120) = .1056 instead.

--------

Sorry this is so ugly to read in blog form. Maybe I'll make the equations nicer and rerun this in "By the Numbers." Let me know if I've done anything wrong, or if I've just duplicated Tango's proof. For all I know, Tango has already explained all this somewhere else.

But this is still kind of complicated. Tango, do you have a more intuitive explanation of why this works, one that doesn't need all this algebra?

--------

(Update, 11:30pm: part of the explanation above "QED" was wrong ... now fixed.)

14 Comments:

Re: number of games to add being constant: if the team has played very few games, that's almost no information about the team's true talent, so the talent estimate should be assumed to be close to average/dominated by the regression term. If the team has played a lot of games, the talent estimate should be dominated by the actual record. So it's not so surprising that the regression number of games is constant.

Re: picking the number of regression games such that SD(luck) = SD(talent): I'm not a statistician, but I'd guess that there is some elegant explanation in terms of maximum likelihood, since we're trying to find the point on the talent distribution that best explains our team's current performance.

I didn't read everything, but this looks very similar to the a technique used by Actuaries (at least it's part of the standard exams - not sure how often they are used): "Buhlmann Credibility". If you pick up a book by Buhlmann (A Course in Credibility Theory and its Applications) it should have a proof. Loss Models (Wiley Series) also has a proof. This model states that the best "reasonable estimate" for a given group is:x' = u*(v/a)/(n+v/a) + (mean of x)*n/(n+v/a)v = expected value of the process variance (natural variation of results for a given team - generally binomial for sports teams winning probabilities).a = variance of the hypothetical means ("true" standard deviation between teams - excluding natural variations).Generally speaking: Var(total) = v + a. (it gets a lot more complicated when these things are estimated).u = entire populations mean (expected value of the hypothetical means) - for sports teams should be 50%

For sports teams v = p(1-p), so it is easy. "a" is a little harder, but can be derived from Var(total).

That the number of games has to be constant is intuitive in retrospect (I claim) because it obviously works. It's equivalent to a team that's already won and lost a certain number of games, and we know how to (maximum likelihood) estimate the talent by dividing wins by games. The rule works, and there's only one correct answer, so no other rule works.

Whether it's prospectively intuitive I'm not sure, but maybe it kinda is because it can't matter if you update your estimate in one step or two steps (and it doesn't matter what order), and we know that if we just did wins/games then number of steps and order don't matter.

Say an MLB team wins W-W-L. If you update their estimate in one step, its

If you flipped a weighted coin one hundred times and got 60 heads, what is the weighting of the coin?

Obviously we can't say for sure. The best guess is that it's weighted to flip heads 60% of the time. But it may be you just got lucky with a 50% oin, or unluky with a 68% coin.

The first equation in my comment, the binomial likelihood eqn, is the mathematical representation of that.

If that was all we knew we'd be done. But, if I knew that you pulled the coin from a hat, and I knew the sd of the distribution of coin weightings of all the money that was in said hat (equation 2 in my comment)... then we're not done.

We have to multiply 1 and 2 together.

That's what your math is doing as well, just less explicitly. Agree?

If you do it graphically, by turning those two curves into bar charts ... you would barely need a calculator. And it would become apparent quickly that the form we've chosen for the distribution (Gaussian) probably isn't right, just convenient.

As Vic mentioned, understanding this subject as a standard Bayesian inference model is much more intuitive.

The phrase "regression to the mean", though we all use it, is actually a little misleading. We are not really taking the observed data and regressing it to the mean. We are taking the prior and adjusting it based on new information.

We are not finding the "probability of the data, given the prior" (that would be a standard frequentist p-value), we are finding the "probability of the prior, given the data."

So the original query of why we add the same constant number to our sample size, regardless of whether our sample is large or small, should be thought of the other way around.

In other words, we are starting with an original assumption, and that number is large or small depending on how strong the assumption is. And then we adjust that assumption as we observe new data. If we observe only a little bit of new data, then we adjust the assumption only slightly. If we observe a lot of new data, then we adjust the assumption more significantly.

So in this case, we are starting with the assumption of "baseball teams are .500 in talent". (Whatever percentage you choose is referred to as "eta" in the Bayesian beta model.)

We are also putting an estimate on the strength of our assumption by essentially describing "how many games worth of .500" you think the teams are. You've deemed that number to be 74 here. (This value is referred to as "K" in the Bayesian beta model.)

I think it's important to remember that these eta and K values are subjective. Nothing in the universe makes them inherently "correct" or "incorrect." They are assumptions designed to make your model as predictive as possible. They might be awesome assumptions, perhaps the most predictive that anyone has yet shown, or they might be shit. You haven't really broached that subject here. (Subtracting binomial variances to obtain the assumptions is a method that has empirically given decent results with certain baseball stats, but in no way is it "certain truth" that a particular sports skill must follow the mathematical laws of the binomial distribution.)

Regardless, what follows after the assumptions is just math. You have your 74 and .500 numbers as your prior, and if you now observe a team play 10 games at a .600 clip, you slightly adjust your prior upwards to get a talent estimate for that team. If you observe a team play 162 games at a .600 clip, you adjust your prior more heavily to get their talent estimate.

That helps ... the Bayesian way of thinking about it does make it a bit more intuitive.

However, there's still this: how is it that you can start with 37-37, and just add each game to the W-L record?

That is, suppose the first game is a W. You adjust your Bayesian estimate, and, by coincidence, it goes from the equivalent of 37-37 to 38-37. Then you win again, and do the Bayesian calculation again, and the expected value of your posterior distribution goes from 38-37 to 39-37. And so on.

Wow! That's a coincidence that isn't intuitively obvious. More generally, why does "start with a certain number of wins and losses and then just adjust the W-L record" work at all?

I mean, I understand the math, but it isn't apparently obvious why something so simple would be the solution to the Bayesian process.

I see what you're saying. I think it becomes more obvious if you consider the original assumption. Specifically, that we are treating baseball games as Bernoulli trials (binomial and independent).

To use Vic's example, let's say a coin is weighted and you dont know the weighting. You flip it three times and get one head. So right now your best guess is that the coin's true talent heads% is .333. You flip it again and get tails. What is your best guess now? .250 of course. How did you arrive at that answer? By simply adding to the numerator and/or denominator.

But now lets say I tell you that before I gave the coin, I actually flipped it myself 74 times and got 37 heads. How will you incorporate this information? The same way as before, by just adding to the numerator and/or denominator.

That's essentially all a prior is. It can be thought of as a prior experiment, with its own sample size (K) and mean result (eta). Jim Albert himself explained it to me like that, and it turned on a light bulb in my head.

That example makes absolute sense. It's a good intuitive explanation of why, when you start with a uniform prior, your best estimate of the mean of the posterior is just the observed success rate h/N.

However, now, instead of a uniform prior, your prior is normally distributed N~(.500, .058). Why does it now (intuitively) follow that the mean of the posterior is (h+37)/(N+37)?

Just to reiterate, I agree with you and Vic that the Bayesian way is a nice way to look at this ... my wonderment is that the result comes out so neat and clean, whereas the Bayesian math is complicated. Usually where there's a simple answer, there's a simple way of looking at things. Usually, but not always.

The math distills down to a simple equation because we destined it to do so. This with our choice of assumptions (in the case of my math below, the assumption that league-wide team talent is distributed in beta form. In your post the subtle conversion of binomial probability into it's Gaussian cousin and team talent of the same form).

It's the E=mc^2 principle: Intuitively we believe that if a bigass page of math reduces down to trivial arithmetic ... it is surely profound. That's the sleight of hand at play here, Phil.

It's a bit like showing a 'proof' (to use Tango's term) for intelligent design theory. It only works if you embrace the initial premise, that being there is an all knowing God.

I don't know if the Sabermetrics religion needs an atheist, Phil. Probably best to leave well enough alone Still, if there is to be one, you'd be a terrific choice imo. Clearly Mehta feels the same. Though in both our cases I suspect it is a passing fancy.

Indulge us, Phil, just throw the high level math to the curb on this one, look at your thinking again and execute it with first principles and no assumptions. Nuts and bolts stuff.

I'm almost sure we'll convert you to atheism eventually, Phil. I'm almost equally sure it won't happen on this particular occasion. And I'm near positive that we'll regret it once successful.

Such is life. It's just the freaking Internet anyways. And I always enjoy your writing, terrific critical thinking with every read imo. And while I can't imagine why you would give two craps about my opinion, I think you're as cool as Christ, Phil. Keep up the great work.

There is a proof that demonstrates that the equation I provided in comment #2 is equivalent to the bayesian results under specified conditions (namely prior and model distributions are members of a linear exponential family of functions).