This post draws extensively on material from the book Introduction to Empirical Bayes by David Robinson. I've found the book to be an excellent introduction to Empirical Bayes, and a great introduction to the Lahman Baseball dataset as well. You can find the book here. I strongly suggest it!

Empirical Bayes refers to a "hacker" approach to a specific kind of Bayesian Analysis. Empirical Bayes can be successfully employed when data can be modelled by a Binomial distribution depending on a parameter $N_i$ and $p_i$, where $p_i$ is the quantity we want to estimate for each data point. Crucially, it may be the case that each point has a specific $p_i$ that is not, in general, equal to any other $p_i$. In that case, we can think of the set $\{p_i\}$ as being drawn from a distribution that has a well-defined mean and variance. If the distribution is not uniform, then the data points should not be considered independent of each other-each point contains information about the prior values of the other points. Therefore, we would like an approach that "draws power" from all the data points to help us perform inference on the most likely value $p_k$ for a data point $k$ for which we may otherwise have limited evidence (for example, if we wanted to know the hit rate for a baseball player who has only played five games, instead of 1,000).

Notice that the above scenario is different from what we often hope for in science. In scientific experiments, we will often set up an experimental design to measure a variable $p$, and the experiment will be replicated several times to generate various measurements of this variable. In this case, each replicate is considered to be a binomial draw using the same coin, $p$, and the variability is attributed to random sampling. These two scenarios are similar, but fundamentally very different. In the data, the way this will play out is in the variance. Tossing many different coins with binomial sampling will result in a greater variance than tossing a single coin with binomial sampling.

Empirical Bayes is the result of placing a Beta prior on a Binomial likelihood and evaluating the result (since the Beta is a conjugate prior for the Binomial, the result is an "updated" Beta distribution). The Beta distribution depends on two hyperparameters, $\alpha$ and $\beta$, and a proper Bayesian calls for priors on these two hyperparameters (and priors on the priors of the priors, etc...), Empirical Bayes will be happy with ignoring any further priors.

To motivate the problem, let's start with a simulation. Suppose we simulate 1 million baseball players, each of whom have exactly 300 at-bats (chances to hit the ball; I'm not a baseball player). For each player, we will draw a coin from a beta distribution, $p_i \sim \mathrm{Beta}(81, 219, p)$, and then we will simulate their hitting career with a Binomial distribution generated using that coin ($\mathrm{Bin}(N, p_i)$).

In [2]:

# Simulate player statistics. For each player, choose a coin from the beta# distribution, Beta(81, 219). Then, simulate the number of balls they hit # out of 300 at-bats, given the coin they were assigned.
num_trial <-10e6
sims <-data_frame(
true_average = rbeta(num_trial,81,219),
hits = rbinom(num_trial,300, true_average))
sims %>%head(5)

true_average

hits

0.3137426

81

0.3141298

106

0.2944093

97

0.2468624

78

0.2416608

62

Now that we have a dataset, we can ask a range of questions. For example, what are the range of hit-rates ($p_i$'s) of players that hit exactly 100/300 balls?

In [3]:

# look at only players who hit exactly 100/300 balls
hit_100 <- sims %>%
filter(hits ==100)# Make a histogram of the players who hit 100 balls out of 300 attempts# Also draw a KDE on top
sims %>%
filter(hits ==100)%>%
ggplot(aes(true_average), color=factor(hits))+
geom_histogram(aes(y=..density..))+
geom_density()+
labs(x ='True average of players wth H hits / 300 total bats',
y ='Density', color="H")# Print the mean average for the players who hit 100/300
median(hit_100$true_average)

`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

0.301503283873861

The above graph shows us that we would not be surprised if the players had a hit-rate between 0.25 and 0.35. However, if we were told that a player with a hit-rate of 0.4 scored 100 hits out of 300 at-bats, we ought to be very sceptical about the claimed hit-rate for that player.

Someone walks up and asks you who the best batter of all time is. Confident you can find the answer, you ask your database for the batter with the highest hit rate, computed as hits/total bats. The database retains a list of complete strangers with relatively few total batting attempts, and none of the all-time famous players. Clearly, the players with little evidence are dominating this calculation: They are outliers. But what is a good, principled way to downplay these outliers?

One way to downplay outliers would be to downgrade players with spectacular hit-rates towards the average rate in a manner that depends directly on the evidence they have for their hit rate. So, players with little evidence would be "shrunk" aggressively towards the average, and players with a lot of evidence wouldn't be shrunk at all. The "shrinkage" will be mediated by the prior distribution, which depends on two hyperparameters, $\alpha$ and $\beta$. The evidence is incorporated in the likelihood, which is a binomial distribution. Combined, they generate a posterior likelihood that incorporates information about our prior beliefs AND the evidence for each player.

The obvious next step is to determine the correct shape of the prior. One way to get a (very good) idea about what the prior should look like is to look at the batting proportions of players who have been playing for long enough to gather credible statistics on their hit rates, then fit a beta-binomial curve to that data. This will get us the hyperparameters $\alpha$ and $\beta$ (but crucially it will not get us anything about the uncertainty in this parameter; for that we would need an MCMC sampler).

Before we can do that, we should get the data, clean it up, and limit ourselves to Batters, since Pitchers are apparently terrible at Batting and could introduce systematic bias into our prior.

We can see that the MLE estimates are $\alpha = 101$ and $\beta = 288$. This means the average player is expected to have a hit rate of 0.35 on average! It also means that players who have batted less than 400 times are considered to have insufficient evidence to believe their raw estimates. On the other hand, this prior states that we ought to believe the raw estimates for batters who have played more than 4,000 times. Using these two numbers, we can now shrink the expected hit-rates of each player:

Notice how the players with tens of thousands of at-bats (yellow) all fall close to the red line that marks $y=x$. These players are not shrunk at all: Their evidence is so strung it overpowers the prior. On the other hand, dark purple dots tend to lie near the blue line that marks the prior mean, $\alpha/(\alpha + \beta)$. The dark purple dots are players who have barely played any games, and therefore they are aggressively shrunk towards the average estimate regardless of their actual hit rate.

Having established our empirical prior, we can now perform computation rapidly and easily on our data. For example, one thing we can do is compute 95% credible intervals for each players. All we need to do is compute the correct percentiles using each players specific beta function, and we're done! The code below implements that in a couple of lines.

Another thing we can do rapidly is test hypotheses. Suppose we wanted to know whether a particular player $i$ has a hit rate $p_i >= 0.3$. We could quantify the probability that we mistakenly claimed a player had a hit-rate greater than 0.3 when in reality the player has a hit-rate below 0.3. In other words, we could quantify the probability that we made a classification mistake. This probability corresponds to the Posterior Error Probability, PEP:

The marvelous thing about the PEP is that we can use it to rapidly calculate q-values. The PEP essentially measures how likely it is that the null hypothesis is true ($H_0:~p_i < 0.3$) given the evidence and the prior put together.

Suppose now that we took the $n$ best players from the dataset and asked what the average number of mistaken rejections of the null happened, given their PEPs? It makes sense that the average number of mistaken rejections should be proportional to the mean PEP of the players we selected. This yields the false discovery rate for that group! If the mean PEP is $x$, then the average number of false positives we would expect in that group is $nx$.

Marvelously, if we allow $n$ to vary from 1 to $K$ (the maximum number of players in the dataset), and if we calculate the mean PEP cumulatively on the ordered set of PEPs, we will end up with a list of numbers. This list of numbers is known as the set of q-values for each player. The q-value is therefore a measure of the probability that the null hypothesis is true given the player data. We can use this interpretation to choose a significance threshold that, on average, will control the number of false positives below a desired rate.

The code below implements the PEP first, then the q-value computation.

There! We could now set a q-value threshold and control the FDR at that threshold (on average) for our dataset. Let's ask how many players have a batting rate greater than 0.3 with an FDR threshold of 0.01 (1% false positive rate):

In [66]:

nrow(career_eb %>%
filter(qvalue <0.01))

61

This concludes the first post on Baseball Statistics. I've found David Robinson's book eminently readable and an excellent introduction to the subject. I think I have a few ideas about places where Empirical Bayes could be extremely powerful, but I have a few outstanding questions:

Empirical Bayes works well when large amounts of data are available. How large is large? How small is too small? Some soft bounds would be nice as a guide (I will explore this through simulation in another post)

Empirical Bayes seems very powerful, but a major drawback is that we can't easily get an idea about uncertainty in the hyperparameters, and how that uncertainty propagates into/from the parameters. Are there diagnostic tools to tell us when Empirical Bayes may be making a mistake, and we should opt for a full-fledged MCMC approach?

All in all, I have had fun learning about Empirical Bayes. I suppose the most important take homes for me are the fact that Empirical Bayes can be a good first approximation to hierarchical models. Whereas a hierarchical Bayesian model would model the hyperparameters as a distribution, Empirical Bayes sets these hyperparameters to their most likely value inferred from the data. The hierarchical approximation and the empirical approach will be in good agreement if the distribution is nicely peaked around the most likely estimate, but can diverge otherwise. Of course, the reason to use full hierarchical models is often that the distributions are NOT nicely peaked, so the purist will say that a full hierarchical treatment is always required. However, this disregards the power of the empirical approach: It can be implemented extremely quickly.