Team Sites

Moonshot: Projecting Uncertainty

There are two important aspects of prediction. The first concerns the accuracy of the prediction—that is, how close a prediction is to the actual, observed result. The second is uncertainty, which is how sure a forecaster is about his or her projection.[1] These issues are fundamental forecasting concepts, and similarly apply to predictions of the weather, the stock market, or the outcome of tomorrow’s ballgame. At present, only one of these facets of a prediction gets much attention in the world of baseball projections, and that is accuracy. Accuracy is measured by the absolute error, which defines how close, on average, a forecast is to the actual, observed result. Projectionists struggle primarily to minimize this number.[2]

The under-examined facet of prediction that we will address in this article is the uncertainty. Whereas we know that predictions tend to be accurate to within a hundred or so points of OPS, we would also like to know whether we are more or less likely to be wrong on certain players. The uncertainty is often treated as a second-order concern because it is usually more difficult to estimate. However, as we show, it is possible to predict ahead of time which players’ forecasts are more uncertain than others. This concept is important because certain teams may prefer high versus low-risk players—a team with high win expectations (90+ wins) might prefer to reduce risk, whereas a middle-of-the-road team (80-85 wins) would presumably seek risk in order to “get lucky” and reach the postseason.

For this study, we investigated the three major slash statistics: batting average, on-base percentage, and slugging percentage. We chose these three because combined they do well at describing a hitter’s abilities. In order to predict uncertainty on a player-by-player basis, we looked for possible correlates in a player’s statistics.

The first place we looked was PECOTA. PECOTA possesses an underappreciated feature: In addition to a weighted mean projection for each player, PECOTA produces a range of predictions (percentiles), along with the probability of each outcome (a 90th percentile projection means that 90 percent of the time, a player’s forecast will be below this line). Some players have wide-ranging percentile predictions, while others are more narrow; these percentiles constitute a direct prediction of PECOTA’s own uncertainty. Steamer possesses a similar feature, so we also included its percentile forecasts. Several other measures of uncertainty include Tango’s reliability score, the total number and standard deviation of forecasts submitted to the Baseball Projection Project, a player’s own experience in terms of the number of seasons he had played, and a player’s career variability in terms of the standard deviation of his performance in each statistic.

The results show that we can make inferences about our own uncertainty for each of the three slash statistics. Linear models using all of the above variables were able to explain significant amounts of the absolute prediction error3 in AVG, OBP, and SLG. The following tables show the results of our regressions for 413 players in 2014. First, here is the combined regression of all predictors:

And here are the R2 values for each variable individually correlated against the absolute error.

The major contributors to predicting uncertainty in the combined regression were Steamer’s percentile projections and the career number of PAs. Both of these predictors make sense: The longer a player has played, the better our information becomes on that player. Steamer’s percentiles (more precisely, the standard deviation in Steamer’s percentiles) are also able to tell us something about how likely a player is to deviate from his forecasts.

Reassuringly, when we excluded Steamer’s percentile projections, the standard deviation in PECOTA’s percentile projections foretells uncertainty in all three statistics in similar ways. When PECOTA/Steamer percentiles had a wide range for each statistic, it suggested that for that player, we were likely to be less accurate (that is, there was likely to be a greater absolute error). However, when combined in the same regression, Steamer’s quantiles rendered PECOTA’s information redundant, due to the high (r~.5) correlation between them.

Other possible predictors proved less useful, or useful only for some statistics. Tom Tango’s reliability score did little in foretelling prediction error, except individually. The career standard deviation did not improve accuracy to any significant degree, and the standard deviation between projections was only useful for AVG, not OBP or SLG. Overall, we were best able to predict absolute error in AVG (R2=.2161), followed by OBP (R2=.2085), and SLG (R2=.1707).

In this first case, we used all players who possessed the necessary data, applying no plate appearance cutoff. This choice is the most inclusive, but neglects the fact that variability in each statistic is increased in smaller samples. Accordingly, a player’s projection error is likely to be more extreme in a sample of 100 PAs than it is in 400. By the time one gets to a full season’s worth of plate appearances, some statistics even become reasonably stable.

To account for this variability problem, we repeated the same analyses, but applying a threshold of 400 PAs to the players we modeled. This choice encompasses most of the regular, everyday players in the league. The results for this more restricted subset were considerably less promising: Most predictor variables dropped out of statistical significance, and the total R2 scores fell substantially (R2~.04-.08).

However, there is a hidden problem that necessitates a more nuanced interpretation. As we reduced the sample size of our dataset substantially by applying the PA threshold, we also reduced our ability to detect significant predictors of uncertainty (recall that p-values are functions of both effect size and sample size). As a result, we expect that even if some predictors of uncertainty work to the same degree in the restricted PA sample, they might not be called significant.

Re-examining the 400 PA uncertainty regressions, we found that percentiles (Steamer and PECOTA alike) remained weakly associated with prediction error for OBP and AVG. This result may suggest that percentiles still have some utility for predicting uncertainty, even for players who achieve high numbers of plate appearances. Career standard deviation proved more useful for the everyday players, as well. (We hope to confirm this by looking at multiple years in follow-up work). Even so, as noted, the R2 values were much diminished, showing that uncertainty is harder to estimate for everyday players. In the long run, rather than applying plate appearance cutoffs, a more potent strategy might be to regress all observed statistics according to the number of PAs.

Because we have shown that uncertainty is not the same for all players, and that we can estimate that uncertainty in advance, we think these results have significant implications for understanding projections. Some players, particularly those with less experience, are the most likely to deviate strongly from their projections. Forecasting systems which produce distributions of possible outcomes like Steamer and PECOTA are also able to gauge their own uncertainty to some degree.

Whether variable or unpredictable performance is beneficial to a team depends to a large extent on its level of competitiveness. Good, playoff-contending teams should want to buy players with low uncertainties, as they are most likely to produce guaranteed value. Teams on the other end of the win-curve, rebuilding or poor contenders for the playoffs, should want to buy more uncertain players, for several reasons. The more uncertain players are likely to be available at a discount, which will allow these lower-tier contenders to get better value for these players. Secondly, if the players strongly over-achieve their projections, they can be sold to contending teams for future value (i.e. young players or prospects). In practice, some baseball analysts already think in these terms to some degree, but our analysis provides a rigorous, quantitative support for doing so.

We can see abundant examples of this pattern of acquisition in the real business of baseball. The Padres, formerly thought to be poor contenders, have bought low on a series of high-risk acquisitions, like Matt Kemp and Wil Myers, both of whom have shown extremely variable performances in the past, and carry outsized PECOTA percentile projections. If these players overachieve this year, the Padres might be surprise playoff contenders. If, as seems more likely, the Padres fall out of contention by midseason, one or two of the risky Padres who overachieves can be traded, accelerating their rebuilding schedule.

Speaking of Matt Kemp and Wil Myers, one way this analysis could be improved is by incorporating injury information. Both players, and many others who have variable performances, have sustained debilitating injuries, which can severely reduce playing time or effectiveness. Injuries are a great wild card in projections, and a factor front offices undoubtedly appraise in pricing players (both in free agency and trades).

We have shown that uncertainty is variable between players, and can be predicted ahead of time. Percentile projections are a powerful tool in forecasting uncertainty: The greater the standard deviation in percentiles, the more likely a player’s performance is to err from his projection. Other useful factors include the number of PAs in the league, and to a lesser extent, the deviations between forecasting systems. Predicting uncertainty became much harder after applying a PA threshold, but percentiles and career variance still proved somewhat useful. Understanding not only the error in projections, but how that error is distributed among players could open the door to a more nuanced appreciation of player valuations and front office strategy.

Will Larson is a Ph. D. economist who moonlights as an amateur baseball statistician. He runs the Baseball Projection Project at www.bbprojectionproject.com. You can tweet to him @larsonwd or visit his personal website at www.williamlarson.com.

[1] More formally, we term “accuracy” to represent the forecast error, versus “uncertainty,” which we take to mean the forecast error variance.

[2] The other popular loss function is the root mean squared error (RMSE). We prefer the absolute forecast error because it downweights extreme “misses” versus RMSE calculations.

3Contrasting here the actual, observed result against the median PECOTA prediction. Results were similar when using the average of all prediction algorithms instead of just PECOTA.

I'd guess that there is a dimension of risk tolerance that inhibits non-contenders from employing a "buy variance" strategy. If we borrow from the wealth management sphere and say that the dimensions of risk tolerance are willingness(comfortable rolling the dice) and ability(can afford the monetary losses), the non-contenders are probably a lot more willing than able. Contenders are often more able than willing and the groups end up with similar levels of risk tolerance.
This assumes that the non-contenders are usually the low revenue/low payroll teams and the contenders are the high revenue/high payroll teams.
If this is the case then the discount received from taking on risk is reduced and the Dodgers end up with Brett Anderson, the Yankees with Stephen Drew, etc. The low revenue teams can be priced out of the lottery tickets or have to hope they have a situation like the Padres for pitchers or Rockies for hitters.

Another factor that might go into these calculations is that the maximum upside isn't as high as it is in, say, a business portfolio. You can't take on 10 risky bets and make out well if only one of them hits, because your risky guys probably produce between -1 and 1 WARP most of the times they fail, but are unlikely to get above 4-5 WARP even when they succeed spectacularly.
Now, if making the playoffs was a winner-take-all cut-off, you could see that distort the calculation, and in fantasy leagues it often does - coming 4th doesn't help if there are only prizes for 1-3, so if you are stuck in 5th, you might as well roll the dice and hope you hit a hard-way. In real baseball, though, a team that wins 86 games but was in the playoff hunt the whole way does worse that one that wins 88 games and makes the playoffs, but the difference isn't huge (unless they go on to win the world series). So the marginal utility of those extra 2 wins is capped, whereas the marginal utility of the extra category point that puts you in the money is infinite.

Yeah, these are both very good points. It's especially interesting in that now, as you mentioned, some of the very rich teams seem to be consciously trying the "buy variance" strategy (in particular, the Yankees, Red Sox, and Dodgers). I suspect that this has to do partly with the saberification of baseball, and it will be interesting to see what tactics the small market teams, like the post-Friedman Rays, come up with to adjust to the situation.

Nice job.
I think their are implications for fantasy baseball as well.
One major challenge in fantasy baseball is everyone competes with equal resources and generally similar information. Recognizing which projections are most reliable and least reliable is useful for owners.
For example, in a draft, an owner may be realizing that he is running short on home runs, and without luck, will not be competitive. Given the choice between drafting two similar players, each projected to hit 15 home runs, the player with the greater variability in outcome is the better choice if he really needs 20 home runs from that position.

Yes, definitely fantasy implications too, although I am no fantasy expert. As I mentioned in the article though, I think the best fantasy analysts (like our own crew here at BP) already know about how projection uncertainty can vary, and they already talk in those terms (for example, mentioning upside and downside risks, etc.). But certainly some analysts, fantasy-focused and otherwise, do not carefully take into account the spread around a projection, concentrating only on the mean or median outcome.

Some of the fantasy implications depend on league size. In a shallower league (say a 12-team mixed), maybe the high-variance player (Javier Baez?) is an asset only if they reach their 80% projection. But the low-variance player (hello Mark Buehrle) really has no path to relevance.

Baseball Prospectus uses cookies on this website. They help us to understand how you use our website, which allows us to provide an improved browsing experience. Cookies are stored locally on your computer or mobile device and not by BP. To accept cookies continue browsing as normal. You will see this message only once. Privacy Policy

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. See the BP Cookie Policy for more information. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.