Teaching Bayesian stats backward

Most presentations of Bayesian statistics I’ve seen start with elementary examples of Bayes’ Theorem. And most of these use the canonical example of testing for rare diseases. But the connection between these examples and Bayesian statistics is not obvious at first. Maybe this isn’t the best approach.

What if we begin with the end in mind? Bayesian calculations produce posterior probability distributions on parameters. An effective way to teach Bayesian statistics might be to start there. Suppose we had probability distributions on our parameters. Never mind where they came from. Never mind classical objections that say you can’t do this. What if you could? If you had such distributions, what could you do with them?

For starters, point estimation and interval estimation become trivial. You could, for example, use the distribution mean as a point estimate and the area between two quantiles as an interval estimate. The distributions tell you far more than point estimates or interval estimates could; these estimates are simply summaries of the information contained in the distributions.

It makes logical sense to start with Bayes’ Theorem since that’s the tool used to construct posterior distributions. But I think it makes pedagogical sense to start with the posterior distribution and work backward to how one would come up with such a thing.

Bayesian statistics is so named because Bayes’ Theorem is essential to its calculations. But that’s a little like classical statistics Central Limitist statistics because it relies heavily on the Central Limit Theorem.

The key idea of Bayesian statistics is to represent all uncertainty by probability distributions. That idea can be obscured by an early emphasis on calculations.

11 thoughts on “Teaching Bayesian stats backward”

I couldn’t agree more. Honestly, this is exactly how I think Bayesian statistics should be taught. Bayes’ Theorem seems like a minor algorithmic detail for Bayesian statistics, just like computing the sum of squares is an incidental detail for the ANOVA. The posterior is where the action is, especially when the posterior isn’t Gaussian.

To start, postulate the existence of an “oracle” that you can consult to gives you the posterior distributions. After you show the usefulness, you can then show that you can “consult” the sample data and also consult with experts who have prior experience and information. Nice motivation.

It’s a great idea, and pedagogically speaking it’s basically what the classical crowd do already: start with the idea of testing a hypothesis and only later worry showing how to construct an actual test. Indeed, if you don’t take enough stats courses you can easily believe that the latter is a matter of looking the right one up in a big book of named tests.

On the other hand, I’d disagree with JMW that Bayes theorem is an algorithmic detail. It’s a fundamental conceptual element that shows how to connect a model to data. What is an algorithmic detail is how you actually ‘realise’ Bayes theorem get hold of the posterior distribution. Perhaps the better analogy for SS computation is MCMC sampling.

Conjugate Prior: I agree that MCMC is a better analogy. Many people believe Bayes = MCMC. I’ve had people look at me dumbfounded when I tell them I’ve done a Bayesian calculation without MCMC, i.e. by calculating integrals numerically.

Bayesian stats isn’t hard per se. In fact, it’s much simpler than frequentist stats because everything’s more direct when you talk about probabilities of parameters.

What was hard for me (and others I’ve worked with) is the concept of a random variable (and it’s not the measure theory — I was a pure math major as an undergrad, so I did analysis). The problem is more philsoophical. Specifically when random variables are used counterfactually. That is, our models assume that even events that have already happened could have happened differently. (There’s deep philosophical waters here, of course.)

Where people begin losing the notation is when we write things like Pr[X = x], with X being an RV and x being a value. If you don’t go through probability theory, this whole distinction between random variables and values of random variables gets confused. Especially if you start out reading applied Bayesian model books like Gelman et al., which conflate the X and x notationally in a way that’s REALLY confusing for beginners.

I find the easiest way to think about all of this is using a sampling notation for directed graphical models like BUGS. You can put the model together, plug in values for some known nodes, and infer the rest. Then you can work backwards through all the magic of sampling.

Bob: The philosophical waters come with the Likelihood function, rather than the prior, so everyone has to face them at some point. But they don’t have to be very deep. The mechanism view of causality (e.g. Woodwards’ ‘Making Things Happen’) gives a (relatively) uncontroversial reading of the relevant counterfactuals.

And I totally agree with the X=x comment. But it’s even worse for beginners than you suggest, since ‘P’ itself typically denotes a different mathematical function each time it’s tokened!