Resident Fantasy Genius

When Hitters' Stats Stabilize

Four years ago, former BPer Russell Carleton (then monikered “Pizza Cutter”) ran a study at the now-defunct MVN’s StatSpeak blog that examined how long it takes for different stats to “stabilize.” Since then, it has become perhaps the most-referencedstudyinourlittlecorneroftheinternet.

It has been a while since the initial study was run, and since there are a few little pieces of the methodology that I believe could be improved, I decided to run a similar study myself.

MethodologyThe methodology I used was a combination of the one used by Russell, a similar methodology used by Harry Pavlidis at The Hardball Times, and some improvements suggested by Tom Tango.

Like Russell and Harry, I ran a split-half correlation on a number of stats, which means that I would take, say, 100 plate appearances from every hitter, split them into two 50 PA groups for each hitter, and run a correlation between the two groups. Unlike Russell and Harry’s initial studies (and per Tango’s suggestion), I randomized selected PAs to put into each group instead of going odds and evens, which eliminates accidentally catching parks, opponents, etc. in the correlation.

Taking a step back for a second, the purpose of the study is to find the point at which each stat produces an R of 0.50. It's at this point that we can predict 50 percent of the future variation in a stat. Put another way, if a stat stabilizes at 100 PA and we have 100 PA of data for a player, we'd use a 50/50 split of the player's actual data and the league average (or whatever mean we choose) to estimate his true talent for that stat. The more data we observe for a player, the less league average we use.

Russell ran the correlations at several different intervals and stopped once he reached his stabilization point. But per Tango’s suggestion, I ran the correlation at about 15 different intervals because even if the R isn’t 0.50 at a particular interval, based on what the R is, we can estimate at what interval the Rwould be 0.50. I took all of these implied intervals and weighted them by the sample size at each interval to arrive at a final result.

Another change to the initial methodology is the denominators I use for each stat. Russell used plate appearances as his denominator for every stat, but I think we'd be better off using an individualized denominator for each stat. As an example, Adam Dunn has put the ball into play 48 percent of the time this year, while A.J. Pierzynski has put the ball into play 88 percent of the time. If we’re trying to figure out how long it will take their BABIPs to stabilize, how can we expect them to stabilize after the same number of plate appearances? Pierzynski is putting far more balls into play and will therefore see his stabilize sooner.

Finally, Russell’s study was done four years ago, so he had much less data to work with than I do now. In my study, I’ve used 11 years (2000-2010) of data for most stats and as much as possible for batted-ball types (2004-2010 for Retrosheet, 2005-2010 for MLBAM).

I had considered adjusting for various contextual factors (ballparks, opponents, umpires, weather, etc.) but decided that it would be most useful not to. When you, the readers, are looking at stats online trying to make decisions for your fantasy team, you’re not going to be making these adjustments, so I didn’t want these figures to be reflective of them. But if we were using these for a projection system, we would want to account for context.

The Results for HittersHere are the results of running these tests for a number of stats. The following table has the stat in the first column, the denominator used in the second, how many units of that denominator before it stabilizes in the third, and approximately how many years that translates to in the fourth.

To read the “Stabilizes” column—the one we most care about—we would say, “Strikeout rate, defined as K/(PA-IBB-HBP), stabilizes after 100 PA-IBB-HBP.”

The “Years” column is for a league-average player (assuming 650 PA is one full season) and will sometimes vary significantly from player to player. It’s there as a quick-and-dirty way to make comparisons between stats since they’re all using different denominators. This allows us to say, “OK, strikeout rate takes about one-fifth of a year to stabilize, but singles rate takes over two years.”

*Stolen-base success rate came back inconclusive. It’s a stat that we know has a lot of random variation in it, but this method proved unsuccessful in putting a number on it. At every interval, the correlation was between 0.10 and 0.15, so each interval implied a wildly different stabilization figure, ranging from 30 attempts to 450 attempts. The weighted average was 94 attempts, but with such a wide range of outcomes, I don’t trust that number.

Most of these findings are similar to Russell’s. Things like strikeout rate and walk rate stabilize pretty quickly, while things like hits take a while. Home runs stabilize quicker than some might believe, and hit-by-pitches actually stabilize in under a year, which I found moderately surprising.

When looking at this table, you likely noticed two sets of batted balls. Since batted-ball types are subjective and classified at the discretion of the scorer, data providers will classify balls differently. I ran these tests on the classifications used by Retrosheet (RS) and by MLBAM. It’s interesting to note that all of the MLBAM batted-ball types stabilize a bit sooner than the Retrosheet types, which might indicate that MLBAM scores them more accurately than Retrosheet does (or at least has done so over the past six years or so—it’s possible things have systematically changed over that time; also, I used slightly different time frames, so it’s not a perfect apples-to-apples comparison).

I’d love to run these tests using Baseball Info Solutions classifications—another major batted-ball provider—but unfortunately I don’t have access to the necessary data. If someone from BIS is reading this and would be curious to know, feel free to get in touch with me and I’d be happy to run the data.

It’s also interesting to note the difference between what I found for line-drive rate and what Russell found. Russell found line-drive rate to stabilize the quickest of all batted-ball rates, but here it’s the slowest for both Retrosheet and MLBAM (which jives with common wisdom and with numerous less-rigorous tests that have been conducted over the past couple years). Ground balls stabilize very quickly, and both categories of fly balls come shortly after. Infield flies are often an afterthought when analyzing a player, but they stabilize quickly and should be acknowledged.

Correcting Misconceptions about the Use of these NumbersWhile Russell’s original study was linked all over the place, people who use these numbers are sometimes confused about what they truly mean and how to use them properly.

First, note that these numbers are not magic. Once a hitter reaches 100 TPA-IBB-HBP, his strikeout rate doesn’t suddenly become a perfect representation of his talent. 99 TPA-IBB-HBP tells us almost exactly as much. One writer described the concept of stats stabilizing as “pretty simple—at a certain threshold of either plate appearances (for hitters) or batters faced (for pitchers) a number will stabilize such that it can be taken at close to face value.” That’s not true, though. Once a hitter reaches a threshold, his rate still can’t be taken at face value. At any particular threshold, we still need to include 50 percent of mean performance to get the most accurate representation of the player.

Another way writers often use these numbers is to say that once a hitter reaches a threshold in a particular year, it becomes safe to assume that he has reached a new level of production. But that’s also not true. Just because a hitter reaches, say, a 100 PA threshold doesn’t mean that every plate appearance before the most recent 100 are meaningless. Recent performance is more important than older performance, but older performance still matters. What we should do is weight the older performance less, include it with the recent performance, and then use our threshold to determine how much mean performance we need. So while a player may have just reached the 100 PA threshold, he may have 300 effective plate appearances once we account for past data, in which case we’d use 75 percent of the player and 25 percent of the mean.

---

Next week, I’ll take a look at how fast stats stabilize for pitchers. If anyone has any questions, as always, feel free to get in touch with me via e-mail, Twitter, or Facebook.

Interesting analysis. I haven't fully absorbed all of it yet but it is food for thought. One small question - what is the rationale behind the "....we’d use 75 percent of the player and 25 percent of the mean." or more specifically, how was the 75/25 split derived?

I used the regression to the mean equation n/(n + x) where n is the sample size and x is a constant. So if a player has 300 PA (n) and it stabilizes after 100 PA (x), it would be 300/(300+100) = 75%. So we'd use 75% player and 25% mean.

I could do this, but it wouldn't tell us anything new, really. What's important is where the r-squared is 0.50, because that's where it's easiest to run our regression equation (see my response to the first comment). But once we've run the study, we can find out where the r-squared would be at any number other than 0.50.

With an equation of n/(n + x), it would be .75 when n is 3 times greater than x. It would be .90 when n is 9 times greater than x. It would be 0.95 when n is 19 times greater than x. So for Ks, it would .75 at 300, .9 at 900, and .95 at 1900.

Derek, why did you use .5 R^2 and not .5 R? I was under the impression that .5 R would show the point where you would regress half-way to league-average going forward...?

Here's how I interpret what this mean, and I agree the whole "stabilizes" shtick is very misleading: When a player has this mean plate appearances (or balls in play or whatever), then his results thus fair make for as good of a projection going forward as simple using league-average rates. With fewer PAs, league-average is a better forecast, while with more PAs, actual results are a better forecast.

Of course, we tend to ignore past seasons too much, and a forecast that includes data beyond the current season is probably your best bet.

One minor edit. In my original piece, I looked for where the R passed .70, which is (roughly) an R^2 of 50%. At that point, a projection of talent incorporating regression to the mean would be 70% performance and 30% league average (or whatever mean you want to use).

As a reader, I would like to see articles like this one end with a couple of short examples of how this analysis can help us think about a particular player. I don't always read the methodology closely, but I would like to know what it means for how to understand players. Applying the study's insight to analyze a case or two would make me much more likely to click on methodological articles in the future.

The analysis behind the in-season PECOTA forecasts is a bit more rigorous, although it's along the same lines. We don't actually need to regress the in-season PECOTAs to the mean, exactly.

PECOTA isn't just projecting a player's performance as a point forecast, but as a distribution. We can use that distribution as a Bayesian prior in order to update the forecast - by combining the observed 2011 results with that distribution, we get an updated forecast. Since regression to the mean is accounted for in the pre-season forecast, the updated distribution accounts for regression.

(And yes, it is done by component - so in season results on things like K-rate are considered more predictive than results on balls in play.)

It's very clear that the first three things to look at are K%, UBB%, and HR/Contact (and interesting to see that the latter generally stabilizes before HR/OFFB, which means that variations in OFFB rate do not generally being with them the baseline HR/OFFB rate).

It's quite unclear how to best break down balls in play, however. You could do 1B/BIP and XBH/BIP independently, and they both stabilize before BABIP. If you did BABIP first, you would want to look next at XBH / Hits in Play. Alternatively, you could first do XBH / BIP and then do 1B / (1B + Outs in Play), the question being whether that stabilizes before 1B / BIP.

And then there's the very interesting question of whether finding a stat that stabilizes very *slowly* is actually desirable, because that way you can isolate the luck. It's at least informative. If in fact 1B / (1B + Outs in Play) stabilizes even slower than BABIP, then we've demonstrated that most of the luck on BIP resides in the singles, since the former removes XBH entirely.