So far, Giancarlo Stanton has hit 56 home runs in 555 at bats over 149 games. Miami has 10 games left to play. What’s the chance he’ll hit 61 or more home runs? Let’s make a simple back-of-the-envelope Bayesian model and see what the posterior event probability estimate is.

Sampling notation

A simple model that assumes a home run rate per at bat with a uniform (conjugate) prior:

The data we’ve seen so far is 56 home runs in 555 at bats, so that gives us our likelihood.

Now we need to simulate the rest of the season and compute event probabilities. We start by assuming the at-bats in the rest of the season is Poisson.

We then take the number of home runs to be binomial given the number of at bats and the home run rate.

Finally, we define an indicator variable that takes the value 1 if the total number of home runs is 61 or greater and the value of 0 otherwise.

Event probability

The probability Stanton hits 61 or more home runs (conditioned on our model and his performance so far) is then the posterior expectation of that indicator variable,

Computation in R

The posterior for is analytic because the prior is conjugate, letting us simulate the posterior chance of success given the observed successes (56) and number of trials (555). The number of at bats is independent and also easy to simulate with a Poisson random number generator. We then simulate the number of hits on the outside as a random binomial, and finally, we compare it to the total and then report the fraction of simulations in which the simulated number of home runs put Stanton at 61 or more:

That is, I’d give Stanton about a 34% chance conditioned on all of our assumptions and what he’s done so far.

Disclaimer

The above is intended for recreational use only and is not intended for serious bookmaking.

Exercise

You guessed it—code this up in Stan. You can do it for any batter, any number of games left, etc. It really works for any old statistics. It’d be even better hierarchically with more players (that’ll pull the estimate for down toward the league average). Finally, the event probability can be done with an indicator variable in the generated quantities block.

The basic expression looks like you need discrete random variables, but we only need them here for the posterior calculation in generated quantities. So we can use Stan’s random number generator functions to do the posterior simulations right in Stan.

What would be better for at bats? I know it’s not a fixed number. The Poisson is wrong in that if the player starts and stays in then there’s a minimum number of at bats given starting position in the batting order. Then the variation is going to be on the upside beyond that and will be correlated with the team winning, which will be mildly correlated itself with home runs. I’m not sure how I’d code that. I’d imagine at bats tails off exponentially.

And yes, you’d definitely want a better prior on home run rate. A hierarchical model would be obvious if you have other player data. I was just working from the very simple data presented. The hierarchical model itself is tricky in that the population of home run rates is very skewed with a pileup at the top and a long tail.

The way I’d really start with this would be to plot (a) at bats (for individual and other players), and (b) home run rates across players. Then I’d start thinking about beter distributions.

Some minor bugs/typos/whatever:
1. The “b” parameter of the beta posterior distribution for theta should be 1+555-56, not 1+555, with this correction I get the event probability to be 33% .
2. The last factor of the integrand in the “event probability” integral should be the beta posterior density, not the binomial likelihood (the constant matters here).

Observe that when at bats is Poisson(lambda) and home runs is Binomial(n=at bats, p=theta), home runs is just a thinned Poisson, i.e., home runs ~ Poisson(lambda*theta).

My hunch would be that as long as no subject matter information about plausible values for lambda & theta enter the model, one could as well forget the at-bats and model the Poisson distribution of home runs directly.

With a “scale-free” p(param) propto param^(-1) for the home-runs rate, the posterior predictive for the home runs during the last 10 games is Neg-bin(56, 149/10) [BDA2, p. 52-53], with which I get a probability of 0.326 for the case event (compare to the 33% I got with random sampling from Bob’s model). Of course, (I suppose) the two-level approach starts to be useful as soon as some kind of informative priors are used.

This is the type of example I would like to see fleshed out to explain how Bayesian analysis differs from more standard frequentist analysis. My own analysis simply models the remaining at bats as Poisson(37.24832), where the parameter is the average number of at bats per game through the first 149 games, multiplied by the 10 remaining games (actually one subtle improvement is to base lambda on 154 games, since he missed 5 of those games thus far). Then, the number of home runs he will hit in the remaining games is Binomial, with n=Poisson and probability = .100901 (his home runs per at bat thus far). The resulting probability is 32.2%. With the change to account for missed games, it drops to 29.6%.

So, my question is: what is gained by moving to a Bayesian framework over this straightforward calculation?

I think that in this sort of example the real benefits of Bayesian inference start coming when you start adding more information such as what sorts of pitches are thrown to Stanton and aspects of his swing, compared to that of other players. And also when there is interest in predicting more granular outcomes, not just the number of home runs that will be hit.

I like the number of home runs aspect because of its simplicity. Starting simple makes it easier to see what to add. We’re always suggesting to people using Stan to start with simple models, see where they fail to model reality well, then elaborate.

I used a similar example of hits in the case study on binary trials and used it to highlight the difference between plugging in an MLE for batting average and using a Bayesian posterior which accounts for uncertainty in the estimate of hit rates. Plugging in an MLE gives you a binomial, which is underdispersed compared to what you actually see. The Bayesian posterior in the beta prior case gives you a beta-binomial posterior, which is better calibrated on actual data sizes compared to plugging in an MLE; the difference approaches zero asymptotically as the beta posterior concentrates around the MLE. The case study also contrasted complete pooling, no pooling and partial pooling and also contrasted beta priors on the probability scale and normal priors on the log odds scale.

In this particular example, you’re getting the propagation of the uncertainty in the estimate of .

And I agree that using 154 games rather than the 149 he played in would’ve made more sense. I’d probably want to do something like separately model whether he played (something like an injury occurring and then persisting or healing) separately from number of at-bats per game.

I think you want to regress HR/AB towards the mean with about 140 AB of league average performance which is about 6.6 HR per 140 AB so Beta(theta|6.6+56, 555+140-56-6.6) would be about right if you only have this year’s data. You could do a little better if you had past year’s data and Stanton’s exit velocities.

That’s in line with the roughly 300 AB in a prior on hit rate I found for position players using one year of data. I’d expect a wider range of home run ability than batting ability (and of course the two are correlated as they’re both outcomes, one of which is included in the other).

And as has come up in other discussions, once you start regressing (usually on the log odds), you can start doing things like regressing on at bats—more at bats is correlated with being a better hitter.

This was a really fun and straight-forward exercise. For those of us that are just learning Stan, this would be a very helpful recurring feature. (As if you don’t have enough work to do!)

To make it even more work…I wonder if similar examples could be collected and disseminated through mc-stan.org. Maybe some “Simple Tutorial Exercises” — something to help get your feet wet with Stan syntax and to make sure to get some practice wrapping your head around problems.