Seemingly intuitive and low math intros to Bayes never seem to deliver as hoped: Why?

This post was prompted by recent nicely done videos by Rasmus Baath that provide an intuitive and low math introduction to Bayesian material. Now, I do not know that these have delivered less than he hoped for. Nor I have asked him. However, given similar material I and others have tried out in the past that did not deliver what was hoped for, I am anticipating that and speculating why here. I have real doubts about such material actually enabling others to meaningfully interpret Bayesian analyses let alone implement them themselves. For instance, in a conversation last year with David Spiegelhalter, his take was that some material I had could easily be followed by many, but the concepts that material was trying to get across were very subtle and few would have the background to connect to them. On the other hand, maybe I am hoping to be convinced otherwise here.

For those too impatient to watch the three roughly 30 minute videos, I will quickly describe my material David commented on (which I think is fairly similar to Rasmus’). I am more familiar with it and doing that avoids any risk of misinterpreting anything Rasmus did. It is based on a speculative description of what what Francis Galton did in 1885 which was discussed more thoroughly by Stephen Stigler. It also involves a continuous (like) example which I highly prefer starting with. I think continuity is of overriding importance so one should start with it unless they absolutely can not.

Galton constructed a two stage quincunx (diagram for a 1990 patent application!) with the first stage representing his understanding of the variation of the targeted plant height in a randomly chosen plant seed of a given variety. The pellet haphazardly falls through the pins and lands at the bottom of the first level as the target height of the seed. His understanding I think is a better choice of wording than belief, information or even probability (which it can be taken to be given the haphazardness). Also it is much much better than prior! Continuing on from the first level, the pellet falls down a second set of pins landing at the very bottom as the height the plant actually grew to. This second level represents Galton’s understanding of how a seed of a given targeted height varies in the height it actually grows. Admittedly this physical representation is actually discrete but the level of discreteness can be lessened without preset limits (other than practicality).

Possibly, he would have assessed the ability of his machine to adequately represent his understanding, by running it over and over again and and comparing the set of heights plants represented on the bottom level with knowledge of past, if not current heights this variety of seed usually did grow to. He should have. Another way to put Galton’s work would be that of building (and checking) a two stage simulation to adequately emulate one’s understanding of targeted plant heights and actual plant heights that have been observed to grow. Having assessed his machine as adequate (by surviving a fake data simulation check) he might have then thought about how to learn about a particular given seeds targeted height (possibly already growing or grown) given he would only get to see the actual height grown. The targeted height remains unknown and actual height becomes know. It is clear that Galton decided that in trying to assess the targeted height from an actual height one should not look downward from a given targeted height but rather upward from the actual height grown.

Now by doing multiple drops of pellets, one at a time, and recording only where the seed was at the bottom of the first level if and only if it lands at a particular location on the bottom level matching an actual grown height, he would doing a two two stage simulation with rejection. This clearly provides an exact (smallish) sample from the posterior given the exact joint probability model (physically) specified/simulated by the quincunx. It is exactly the same as the conceptual way to understand Bayes suggested by Don Rubin in 1982. As such, it would have been an early fully Bayesian analysis, even if not actually perceived as such at the time (though Stigler argues that it likely was).

This awkward to carry out, but arguably less challenging way to grasp Bayesian analysis can be worked up to address numerous concepts in statistics (both implementing calculations and motivating formulas ) that are again less challenging to grasp (or so its hoped). This is what I perceive, Rasmus, Richard McElreath and and others are essentially doing. Authors do differ in their choices of which concepts to focus on. My initial recognition of these possibilities lead to this overly exuberant but very poorly thought through post back in 2010 (some links are broken).

To more fully discuss this below (which may be of interest only to those very interested), I will extend the quincuz to multiple samples (n > 1) and multiple parameters, clarify the connection to approximate Bayesian computation (ABC) and point out something much more sensible when there is a formula for the second level of the quincunz (the evil likelihood function) . The likelihood might provide a smoother transition to (MCMC) sampling from the typical set instead of the entirety of parameter space. I will also say some nice things about Rasmus’s videos and of course make a few criticisms.

As for n > 1, had Galton conceived of cloning the germinated seeds to allow multiple actual heights with the same targeted height, he could have emulated sample sizes of n > 1 in a direct if not very awkward way using multiple quincunzes. In the first quincunz, the first level would represent the prior and the subset of the pellets that ended up on the bottom of the second level matching the height of the first plant actually grown, would represent the posterior given the first plant height. The pellets representing the posterior in the first quincunx (the subset at the bottom of the first level) would then need to be transferred to the bottom of the first level of second machine (as the new prior). They then would be let fall down to the bottom of its second level to represent the posterior given both the first and second plant height. And so on for each and every sample height grown from the cloned seed.

As for multiple parameters, Rasmus moved on to two parameter in his videos by simply by running two quincunzes in parallel. At some point the benefit/cost of this physical analogue (or metaphor) quickly approaches zero and should be discarded. Perhaps at this point just move on to two stage rejection sampling with multiple parameters and sample sizes > 1.

The history of approximate Bayesian computation is interesting and perhaps especially so for me as I thought I had invented it for a class of graduate Epidemiology students in 2005. I needed a way of convincing them Bayes was not a synonym for MCMC and thought of using two stage rejection sampling to do this. Though, in a later discussion with Don Rubin it is likely I got it from his paper linked above and just forgot about that. But two stage rejection sampling is and is not ABC.

The motivation for ABC was from not having a tractable likelihood but still wanting to do a Bayesian analysis. It was the same motivation for my DPhil thesis which was not having a tractable likelihood for many published summaries (with no access to individual values) but still wanting to do a likelihood based confidence interval analysis (I was not yet favouring a Bayesian approach). In fact, the group that is generally credited with first doing ABC (with that Bayes motivation) included my internal thesis examiner RC Griffiths (Oxford). Now, I first heard about ABC in David Cox’s recorded JSM talk in Florida. Afterwards, whenever I exchanged emails with some in Griffiths’ group and others doing ABC, there arose a lot of initial confusion.

That was because in my thesis work, I did have the likelihood in closed form for individual observations but only had summaries which usually did not have a tractable likelihood. The ABC group did not have a tractable likelihood for individual observations ever (or it was to expensive to compute). Because of this, when I used ABC to get posteriors from summarised data, because that was all that I had observed, it would be actually approximating the exact posterior (given one had only observed the summaries). So to some of them I was not actually doing ABC but some weird other thing. (I am not aware if anyone has published such an ABC like analysis, for instance a meta-analysis of published summaries).

So now into some of the technicalities of real ABC and not quite real ABC. Lets take a very simply example of one continuous observation that was recorded only to two decimation places and one unknown parameter. In general, with any prior and data generating model, two stage rejection sampling matched to two decimation places provides a sample from the exact posterior. So not ABC just full Bayes done inefficiently. On the other hand, if it was recorded to 5 or 10 or more decimal places, matching all the decimal places may not be feasible and choosing to match to just two decimal places would be real ABC. Now think of 20, 30 or more samples recorded to two decimal places. Matching all, even to 2 decimal places is not feasible but deciding to match the sample mean to all decimal places of the sample mean recorded will be feasible and is ABC having used just the summary. Well, unless one assumes the data generating model is Normal as then by sufficiency its not ABC – its just full Bayes. These distinctions are somewhat annoying – but the degree of approximation does need to be recognised and admittedly ABC is the wrong label when there is no approximate rather than exact posterior.

Now some criticism of Rasmus’s videos. I really did not like the part where the number of positive responses to a mail out of 16 offers was analysed using a uniform prior – primarily motivated as non-informative – and the resulting (formal) posterior probabilities discussed as if they were relevant or useful. This is not the kind of certainty about uncertainty to be obtained through statistical alchemy that we want to be inadvertently instilling in people. Now, informative priors were later pitched as being better by Rasmus, but I think the damage has already been done.

Unfortunately, the issue is not that well addressed in the statistical literature and someone even once wrote that most Bayesians would be very clear about the posterior not being the posterior but rather dependent on the particular prior. At least in any published Bayesian analysis. Now, if I was interested in how someone in particular or some group in particular would react, their prior and hence their implied posterior probabilities given the data they have observed would be relevant, useful and could even be taken literally. But if I was interested in how the world would react, the prior would need to be credibly related to the world for me to take posterior probabilities as relevant and in any remote sense literal.

Now, if calibrated, posterior probabilities could provide useful uncertainty intervals. That’s a different topic. For an instance of priors being unconnected to the world, Andrew’s multiple comp post provided an example of a uniform prior on the line that is horribly not credibly related to the effects sizes in the research area one is working in. Additionally, the studies being way too week to in any sense get over that horribly not relatedness. In introductory material, just don’t use flat priors at all. Do mention them but point out that they can be very dangerous in general (i.e. consult an experienced expert before using) but don’t use them in introductory material.

I really did like the side by side plots of the sample space and parameter space for the linear regression example. The sample space plot showing the fitted line (and the individual x and y values) and the parameter space plot initially having a dot at the intercept and slope values that give the maximum probability of observing the individual x and y values actually observed. Later dots where added and scaled to show intercept and slope values that give less probability and then posterior probabilities where printed over these. Now, I do think it would be better if there was something in the parameter space plot that roughly represented the individual x and y values observed.

Here one could take the value of the intercept (that jointly with the slope gave the maximum probability) as fixed or known and then with just one unknown left, find the maximum slope using each individual x and y value and plot those. Then do the same for the intercept by taking the slope value (that gave the maximum probability) as known. The complication here comes from the intercept and slope parameters being tangled up together. Much more can be done here, but admittedly I have able to convince very few that this sort of thing would be worth the trouble. (Well one journal editor, but they found that the technical innovations involved were not worthy of getting a published paper in their journal). What about the standard deviation parameter? It was taken as know from the start? Actually that does not matter as much as that parameter is much less tangled up with the intercept and slope parameters.

When one does have a closed form for the likelihood (and it is not numerically intensive), two stage rejection sampling is sort of silly. If you think about two stage rejection sampling, in the first stage you get a sample of the proportions certain values have in the prior (i.e. estimates of the prior probabilities of those values). In the second stage you keep the proportion those certain values that generated simulated values that matched (or closely approximated) observed values. The proportions kept in the second stage are estimates of probabilities of generating the observed values given the parameter values generated in the first stage. Hence they are estimates of the likelihood – P(X|parameter) – but you have that in closed form. So take the parameter values generated in the first stage and simply weight them by the likelihood (i.e. importance sampling) to get a better sample from the posterior much more efficiently. Doing this, one can easily implement more realistic examples such as multiple regression with half a dozen or more covariates. Some are tempted to avoid any estimation at all by using a systematic approximation of the joint distribution on grids of points leading to P(discretized(X),discretized(parameter)) and P(parameter) * P(X|parameter) for some level of discreteness. I think this is a mistake as it severely breaks continuity, does not scale to realistic examples and suggests a return to unthinkingly plugging and chugging mindlessly through weird formulas that one just needs to get used to.

These more realistic higher dimension examples may help bridge people to the need for sampling from the typical set instead of the entirety of parameter space. I did work it through for bridging to sequential importance sampling by _walking_ to prior to the posterior in smaller, safer steps. But bridging to the typical set likely would be better.

In closing this long post, I feel I should acknowledge the quality of Rasmus videos which he is sharing with everyone. I am sure it took a lot of time and work. It took me more time than I want to admit to put this post together, perhaps since its been a few years since I actually worked on such material. Seeing other make some progress, prompted me to try again by at least thinking about the challenges.

So there are a lot of things in these videos I’m not happy about (but I’m happy about some of it! :). But it’s not the lack of mathematics… What I’m most unhappy about is that I’m too hand-wavy about what probability is and instead rely on people intuition about “random” draws. I’m not to fond of the “randomness” concept, and if I were to do a new version I would perhaps try to talk more about probability and then introduce Bayes by means of grid approximation. What I do like about introducing Bayes by approximate Bayesian computation is that I think it quickly gives people an intuition for what kind of models you can build.

By the way, the side by side plots for the linear regression example were stole… heavily inspired by a presentation by Robert Grant (www.robertgrantstats.co.uk/). :)

I like the oranization. As someone who used to work on semantics, I really like the analogy to time and point when thinking about how hard it is to define something like probability in common language versus mathematically.

I’d like to see all this talk about personal “subjective” probability turned into more of an epistemic discussion about knowledge and the scientific community.

Yes, that’s a good point in the second paragraph. Those notes were written for a continuing ed course focused largely on pointing out the problems with frequentist statistics. I’ve retired from teaching it, but I’ll pass on your suggestion to the person who taught it this May (and I hope will continue to teach it for a while).

Oh! I would like to add that I think short tutorials can be harmful in (at least) three ways:

1. Make a topic seem harder and/or more boring than it really is.
—————————————————————–
Many Bayesian tutorials I’ve seen fall into this category. This is irritating because I usually say that a good thing with Bayesian statistics is that it is *easier* to understand than classical statistics, and then people go online and find some tutorial which doesn’t make any sense to them, and then they call me a liar.

An example would be this Introduction to Bayesian statistics: https://youtu.be/0F0QoMCSKJ4 . You expect an introduction, but if you just took two courses in basic statistics (t-test, anova, and the likes) then you don’t know what a likelihood function is, you’ve never heard of the Beta distribution and you probably don’t remember that the strange eight θ is called “theta”, etc.

2. Mislead people in ways that will will backfire later.
—————————————————————————–
This is my subjective view, but there is a lot of confusing crud in how people teach Bayes, and a lot of it can be found in tutorials, which might be peoples first introduction to he subject. Again the video above (https://youtu.be/0F0QoMCSKJ4) is a good example: “Probability is our degree of belief”, is causing a lot of confusion and makes it sound like you can only use Bayes when you have a personal subjective prior that you produced by the magic of introspection. It also starts by estimating the “theta” of a coin, which is very confusing as we don’t need to estimate anything, there are few things I’m as certain of in life as that coins are 50/50. Parameters are not assumed fixed but “treated as random variables” makes no sense if you don’t know what a random variable is, and is not how I think of parameters in Bayesian models; they are assumed fixed but *unknown*.

3. Make people think they know more than they do.
————————————————-

If you make people think they master a subject that is really very tricky, then that is also going to backfire when reality hits. This is kind of the problem with Stan (and maybe R in general) in that it is very easy to make it do things even if you don’t know what you’re doing. :)

With respect to my tutorial, I hope I’m not too guilty of (1), they way I hand-wave over probability is definitely an example of (2), and trying to cram Stan in at the end is certainly going to result in a (3). But the most important thing about a tutorial to such a big subject as Bayesian statistics is to make people *enthusiastic* and wanting to learn more (and maybe pick up Richard McElreaths book), and that was the main goal of my tutorial :)

For (1), I think getting people interested is great. They need to be motivated to put in the hard work it’ll take to learn things properly. Hitting strangers with a calculus quiz probably isn’t the best way to win friends and influence people.

For (2), I think random variables are the hardest thing to understand conceptually.

In my experience from learning the material and now interacting with others now that I know it, there is a lot of confusion around parameters in Bayesian models and randomness. The key thing people often miss is that a random variable always only has a fixed value. Sometimes that value has been observed (the outcome of yesterday’s coin flip), but often it hasn’t (the outcome of tomorrow’s coin flip). That’s true for both parameters and “ordinary” (i.e., frequentist acceptable) random variables.

Let’s say we look at a random variable for the outcome of a coin flip. Perfectly OK for a frequentist as coin flipping is hypothetically repeatable. Yet the flip in question, represented by random variable Y, is just the result of that one particular flip. If you have two flips, you get two random variables, Z1 and Z2, both of which have a single value. I was just browsing the appendix of a standard PK/PD textbook’s intro to probability and they got this horribly confused, describing stochastic processes as repeatedly sampling from a single random variable.

What’s hard I think is that this reasoning is counterfactual if we’re talking about data in the past, which we model as a random variable with a known value. We know how the coin flip or surgery or educational test came out, yet we persist in treating it as random. Conditioning on single values in conditional densities may be more challenging theoretically than random variables themselves, but I think random variables are much harder conceptually.

Then there’s this lingering frequentist-Bayesian debate of what’s allowed to be modeled as a random variable. If you think of probability as representing knowledge rather than belief, it dosn’t sound so bad. Our current knowledge is uncertain at any given point. For example, we may only know the speed of light or the gravitational constant to a few decimal places—there’s lingering uncertainty. Nobody runs around saying our knowledge of the speed of light is subjective and that each physicist has to use their own subjective judgements as a result. The notion of subjectivity also gets tied up with researcher degrees of freedom, where people often forget you get a lot of latitude in classical analyses when picking experimental designs, hypothesis tests, likelihoods, and penalty functions. Let’s change the ground of this marketing battle (or perhaps even better, just ignore it).

“a random variable always only has a fixed value. Sometimes that value has been observed (the outcome of yesterday’s coin flip), but often it hasn’t (the outcome of tomorrow’s coin flip). That’s true for both parameters and “ordinary” (i.e., frequentist acceptable) random variables.

Let’s say we look at a random variable for the outcome of a coin flip. Perfectly OK for a frequentist as coin flipping is hypothetically repeatable. Yet the flip in question, represented by random variable Y, is just the result of that one particular flip. If you have two flips, you get two random variables, Z1 and Z2, both of which have a single value.”

Going onto crossvalidated and reading questions like “I just watched an ML MOOC and I was thinking we could just replace linear regression with gradient descent! Has anyone thought of that yet?” really demonstrates the danger of (3). I regret to say questions of that caliber are the mode, although there are still some good discussions.

Meanwhile, if you go to mathoverflow, I think the good discussions are more common. Maybe that’s because machine learning + stats is a huge fad right now, maybe that’s because pure math MOOCs are less popular or maybe that’s because I’m less capable of identifying a good pure math discussion.

You might be interested in something Jeff Leek drew attention to for MOOCs – generate voice over presentations so that you can change was is said in a presentation and regenerate the whole presentation with the changes https://github.com/seankross/ari

> intuition about “random” draws.
I did take people using the digits of Pi and simple program code as a way to generate uniform pseudo-random variables with the emphasis being to make things looks as pattern-less as possible and showing how these pseudo-random variables do useful things such as estimating integrals and by using envelope rejection sampling getting Normal, Gamma etc. samples. They seemed comfortable with the material and _seemed_ to get some sense of random outcomes from it. (Went better that most material).

If you don’t know math and you don’t know Bayes, I suggest spending your time learning math. If the goal’s to eventually understand statistics, I’d suggest learning enough calculus that you understand conceptually what derivatives and integrals are doing and just enough combinatorics that you can understand binomial coefficients and the hypergeometric distribution (of course you won’t know what that is unless you already know the answer!). You can go a long way without ever solving an integral—we have MCMC for that. (I used to TA math and it was a nighmare in calc 2 when the students just threw trig integrals at you all recitation long—now I just plug them into Wolfram Alpha online.

When I was a professor of computational linguistics, we didn’t try to teach our graduate students to program like many linguistics departments did—we made them take the core CS data structures and algorithms class. By the same reasoning, if they didn’t know logic, we made them take a semester of mathematical logic (this was before the stats revolution in natural language processing). I had the joy of getting introduction to semantics students who’d all had a semester of serious mathematical logic through the completeness and incompleteness theorems. I was the envy of my peers—they had to start by teaching basic propositional and predicate logic, which honestly requires a semester to do right. So when I wrote my semantics book based on my classes, everyone told me they couldn’t teach it; I told them the trick was sending the students to a semester of mathematical logic first.

I don’t know that anyone’s served by cramming the material for 3 classes into one. I fear a lot of machine learning and data science classes fall into that trap—they get such a heterogeneous enrollment that the syllabus includes trying to teach programming, trying to teach basic statistics, and then getting the core issues in machine learning (maybe convergence and algorithm analysis, but if the students have only had a few weeks intro to stats and algorithms, it’s going to be hard going).

Once you know math, Bayes is almost trivial to introduce (of course, it’s going to take a decade to master it). I taught myself textbook Bayes quite easily, but after a few years practicing it and writing software, I decided to join the circus, because I was getting stuck working on my own.

First, I agree that a fair amount abstract thinking is needed to get much out of these low math presentations and that likely is best developed by doing some math belore but not the 3 or 4 year long undergrad math courses. To me that is just like needing Latin before engaging any high learning a century or two ago.

Yes a lot of material will be inaccessible to you and many will encourage you to leave academia if you don’t have it. And if you want to study with certain masters, there is no way around it (e.g. Andrew’s frequent use of advanced math stat insights to get crafty Bayesian workarounds). But it is not necessary and especially not to just get started – the intent is get people going not stop them from going further. Hopefully the opposite. And that will take years.

It also limits statistics just to those who will do it and more importantly not get endlessly distracted and exhausted by all those regularity condition games – e.g. defining a weird data generating model, insisting the parameter space not be bound away from 0 and that observed data be counter factually taken a continuous – just so that with one parameter the MLE is inconsistent. Huh? Yes MLE inconsistency is important with multiple parameters but can be directly demonstrated with a large number of small 2 group studies with binary outcomes. https://radfordneal.wordpress.com/2008/08/09/inconsistent-maximum-likelihood-estimation-an-ordinary-example/

I agree—I hope people don’t think I was suggesting 3 or 4 years of math—that’d be 6 or 8 semesters. You really just need most of the concepts from the first two years of calc and linear algebra, because we have computers to do all the solving for us. Nobody needs to learn how to do an integral analytically any more than they need to learn long division (though it may help provide a foundation) or how to do Gaussian elimination by hand.

We’ve got lots of irons in the fire in terms of different case studies and presentations. I’m aiming a bit lower than Michael. Then we’re building up with Andrew and others to write the big Stan methodology book.

I liked Rasmus’ videos, and I think their content was helpful for students in my Masters level Bayesian class.

But I am — and am increasingly — skeptical of the idea that you can take someone without a lot of relevant math background and get them to the point where they can write meaningful Stan programs in one semester. I try (see https://courseworks.columbia.edu/access/content/attachment/QMSSG4065_001_2017_1/Syllabus/95ab6bf2-5d79-49e0-abc7-4d65e69ff299/BSSS.pdf ) with students that have as strong a background as one could hope for (in a social science program, at least) and can only spend the last month on using the Stan language. I think most of the students would be happier if I didn’t get to the Stan language at all (i.e. stopped with using Stan to estimate models via the **brms** R package) and went slower through the material in the first two months.

So, I cringe when the Stan developers do a 1 to 3 day workshop that tries to get people who have even less of a math background up to the point where they are writing Stan programs at the end. And (although I know Rasmus is going to soon try), I worry that there are not enough people that are prepared / committed enough to go through a sequence of several online courses to get to the point where they are writing Stan programs.

I don’t think this is a problem specific to Stan, but an issue with statistics in general. You can get people doing things, but understanding what they’re doing is much harder. Continuing my rant from above, I’d say learn math before stats, and basic probability and math stats before (or along with) Stan.

I’m trying to write the intro to math stats and Bayes in long form right now and should have a few chapters to share soon. I’m trying to write it more as a computer science book than as a stats book, by which I mean being very explicit. But I’m sticking to pseudocode rather than ghettoizing it into the R or Python communities. As Andrew once told me about writing the regression book with Jennifer, it’s hard to be actually write things that are correct without the precise language of mathematics.

I personally found BDA impossible to understand as an outsider, despite having an undergrad degree in math (including enough analysis and topology to make the basic measure theory easy for me); what I didn’t have was a good enough understanding of probability theory as used in statistics (all the exponential family stuff he leans on, for example, or anything having to do with covariance) as well as too weak a background in math stats (didn’t really understand random variables or estimators as random vaiables well enough, and that’s roughly presupposed by BDA—the very first hiearchical modeling example in chapter 5 requires an understanding of moment matching for an empirical Bayes-like step and multiviarate Jacobians for formulating the prior). I think learning stats has actually helped me learn the math. Especially things like Jacobians and properties of positive definite matrices.

I’ll continue to be an outlier here. I don’t believe much math is necessary for learning statistics – eventually, yes, but to get the basic critical thinking about data and its meaning I see nothing wrong with using friendlier software that does not require much math. This does invite people to do things they don’t really understand, but in the real world people are going to do that anyway. I’d rather they have some ability to think about data and do stupid things than see them do stupid things without any understanding of measurement, variability, randomness, etc.

I was not trained in Bayesian analysis and I have struggled to get a concrete handle on how a Bayesian analysis differs from a frequentist analysis. I produced a small discrete simulation, based on hypothetical political polling, that shows how a flat prior coincides with the frequentist approach, and how they differ according to the prior information (obtained from prior hypothetical political polls – this also allows me to talk at a conceptual level about conditions under which such a poll should be considered prior information or not). I am puzzled by Andrew’s preference for continuous examples since I find the discrete approach much easier conceptually (and mathematically, of course).

Sure. You have to start somewhere. I really love Jim Albert’s book Curve Ball for introducing statistical reasoning with almost no math beyond some pretty simple algebra. It gives you a great feeling for topics like uncertainty in estimation and regression to the mean (rookie of the year effect). For dealing with quantitative stuff, I think the place to start would be something like Garrett Grolemund and Hadley Wickham’s R for Data Science. That has great exercises for visualizing data and thinking about it, with lots of code to make it concrete.

Preference for continuity is because learning discrete things first can cement some really bad intuitions that block the ability to learn the continuous case. The same might be said for learning in low dimensions. Personally, I think it’s easier to start with discrete probabilities. I personally started with baseball and Dungeons and Dragons as a kid before I knew calculus.

The key thing to understand about what distinguishes Bayes is that everything’s modeled probabilistically. What’s hard is understanding the limitations imposed by frequentist scruples about not modeling parameters as random variables. That leads you into hypothesis testing, which doesn’t make sense most of the time in a Bayesian setting (a point hypothesis has probability zero in a continuous setting).

“I don’t think this is a problem specific to Stan, but an issue with statistics in general. You can get people doing things, but understanding what they’re doing is much harder.”

I agree, although if the user is not personally writing down functions that involve PDFs, it is not that necessary to know what a PDF is. So applied frequentists can do applied frequentism without knowing what a PDF is or any calculus. To do a regression, they just have to assert that “the errors have a i.i.d. normal distribution” and the implementation utilizes the normal log-likelihood function without the user seeing that level of detail. The user can then misinterpret the coefficients or the p-values or whatever, but not knowing what a PDF is is not the cause of that misinterpretation.

The difficulty for teaching the Stan language is that users do need to know what a PDF is (and some other calculus things like a change-of-variables), but they are usually coming from a frequentist or supervised learning background where that stuff is not only not taught but they are taught that “hard math” is not necessary for them to publish papers or get jobs in this or that field.

I see. So they really don’t get math stats before they hit your class. That’s really too bad, especially if you have no control over that (or no audience if you did require math stats). You can get people to feed data into the hopper and collect the analyses that get spit out. You can probably even get them to specify a regression model a la RStanArm. But I see the difficulty in getting them to write down a Stan model.

I was raised by wolves in machine learning out in the wilds of natural language processing. In that realm, everything is specified as a PDF that gets fed into some kind of estimator, usually some kind of penalized MLE or variational approximation or some ad hoc variant thereof. Though even there people will just grab someone else’s package and feed data into it and look at the results (clustering algorithms are particularly prone to produce this kind of exploratory data analysis).

Just to share a perspective I obtained when working with Doug Altman’s group in Oxford 2001-2003.

I mention to Doug that Nelder and McCullagh’s generalized linear models was based on likelihood. Doug’s response was “no it’s not, not at all.” So I showed him the contents list and point out 2.2.2 Likelihood functions. “Oh, I missed that”. Additionally, when I was talking to one of the statisticians in Doug’s group they firmly said – “if what you are suggesting involves Jocobians I have no interest at all. Haven’t had to deal with those since my second year calculus course.” They were doing good statistical analyses and explaining what they meant. So it’s not necessary.

Though I don’t think they could grasp likelihood functions and they repeated expressed a lack of interest in anything that involved them which lead to this paper being re-written over and over again (until I think Doug just gave in) https://www.ncbi.nlm.nih.gov/pubmed/16118810 .

What are the pre-reqs to your QMSS class? Have they taken an intro to math stats like in the masters program in stats (at the level of the DeGroot and Schervish text)?

I’ve done three of these Stan courses and they’ve all assumed the audience already knew stats pretty well and just wanted to learn what was up with Stan. The more intro courses I’ve done have focused on teaching the basics of Bayesian stats rather than focusing on Stan.

I think there may also be a misconception that students should be able to learn everything in one pass. I think it helps to see the material so you know what you’re aiming for and what background you have to fill in. I usually find that I don’t understand the material from a class until I do the follow on course. I don’t think I understood calc until I did analysis and don’t think I understood analysis properly until I did topology. I did abstract linear algebra rather than sticking to the real numbers and learning about determinants; I didn’t understand any of that until I had to sort out how multivariate distributions work. Sure, understanding multivariate distributions is really fundamental, but you can go a long way developing intuitions in univariate settings (danger: the intuitions formed can be very misleading about things like concentration of measure).

or actually knuckle down and have to apply what I’ve learned with colleagues who are serious about getting details right.

Not everyone’s going to do that backfill, but I think for the Stan language itself (as opposed to brms or rstanarm or rethinking), we should be aiming at the audience who does have that background.

Bob said, “I think there may also be a misconception that students should be able to learn everything in one pass. I think it helps to see the material so you know what you’re aiming for and what background you have to fill in. I usually find that I don’t understand the material from a class until I do the follow on course.”

I don’t think any of those are actually correct in practice. Of course, each can be correct in some certain circumstances, but I think in the typical case, we’re quantifying something else. I’m not quite ready to defend a specific thing but the paper is in the works, hopefully it will make something clear.

What I think I can say though is that a Bayesian update can be thought of mathematically as doing two things:

1) Down-weight all regions of space by a factor (the likelihood) that measures how well the predictions of a model which has those parameter values match what is to be expected from comparing the model to reality. Imagine you take the likelihood and divide it by its maximum value, so then everywhere it’s less than or equal to 1. So multiplying by it takes the density down basically everywhere.

2) Reweight everything to maintain the total integral = 1, that is, increase the density in each remaining region by a quantity that is proportional to its current value until a global constraint is met.

Imagine a sandbox. You dig out some sand from all the regions that do a poor job. Then you use your bucket to put the dug sand back, in proportion to how high the remaining sand is.

The result is that high probability regions are regions where the model “works relatively well” according to the limits built in to its description: the prior (a measure of theoretical understanding of the purpose of the parameters within the predictive model) and the likelihood (a measure of agreement between prediction and observation)

There’s more to it, but this basic idea of measuring a kind of agreement is my main thesis.

I found it really useful to read John Stuart Mill from A System of Logic (1882, Part III, Chapter 18) on the topic of inductive inference; the most direct quote is the following.

We must remember that the probability of an event is not a quality of the event itself, but a mere name for the degree of ground which we, or some one else, have for expecting it. … Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not. But its probability to us means the degree of expectation of its occurrence, which we are warranted in entertaining by our present evidence.

Lots of great stuff in that chapter. He is clearly outlining an epistemic approach (one based on the theory of knowledge) to understanding statistics rather than a doxastic approch (one based on theories of belief).

For me, the pragmatists, Peirce, James, Dewey, et al., filled in the rest of the picture by focusing on community rather than individual knowledge and belief. I think there’s a straight line from here to Andrew’s views, but I’m sure he’ll correct me if he sees things differently.

Then Kolmogorov provided a consistent axiomatic system with its normative approach to epistemics and model-theoretic demonstration of consistency by way of measure theory. Classic 1930s mathematics following on from set theory and logic at the time. In what I think of as his greatest work, Kolmogorov later revolutionized axiomatic set theory and the theory of computation with Kolmogorov complexity (independently discovered by two others!).

Bob:
Your other comments require more thoughtful response, but here quickly Peirce very much disagreed with “Every event is in itself certain, not probable; if we knew all, we should either know positively that it will happen, or positively that it will not.” (e.g. his quip “All clocks are clouds”)

The one important inductive inference point in my post, I think, was the need for the prior and data generating models to both be connected to the world, to be meaningful representations in some sense of what is out there. Now what percentage of uncertainty is out there versus in the representation (that is better being a random one than systematic one given our knowledge), I don’t think is that important.

Thanks for the clarification on Peirce. I took a seminar on Rorty, but never really dove into the history or particulars of the progenitors (I’m very detail oriented, but not enough so to be a philosopher!).

You want to draw from the posterior but (think you) can more easily draw from the prior and re-weight that to make it from the posterior given prior=0 implies posterior =0. You will be right in cases where sampling from the full parameter space is OK.
(I have slides some where going through that stuff that I can send you if you wish).

I’ll argue for the epistemic interpretation: a probability is (or can be) “the reasonable credibility of a proposition” (R. T. Cox). Presumably you’re aware of Cox’s Theorem; I have a new paper out that derives the same result from purely logical considerations. Starting with the idea that we want to extend propositional logic to handle degrees of certainty / plausibilities, I propose four *existing* properties of propositional logic that we should retain in the extended logic. As with Cox’s Theorem, I find that there is an order-preserving isomorphism between plausibilities and probabilities that respects the laws of probability. Furthermore, I obtain Laplace’s classical definition of probability as a *theorem*. These results are obtained without assuming that plausibilities are real values, or even totally ordered.

I’ve never tried to teach this, so I may be too optimistic, but I think it would be easier to explain Bayes by presenting the joint distribution and how we get the posterior distribution for the parameter conditioning on the data.

If students can interpret an histogram as a probability distribution, surely they can understand a 2d histogram with some additional effort. Once the basic intuition is acquired, it should be easier to introduce the continuous limit, the Monte Carlo approximation, etc. Something like this: http://imgur.com/a/2C9mA

I agree! It just needs a little bit more work describing what probability and probability distributions are, and then explain that conditioning is not only something one does with ones hair, but something you also can do with data.

In the two undergrad courses I taught at Duke in 2007, almost all the students were very confused by joint tables/histogram being conditioned on the data to reveal the posterior tables/histogram. Now they had just been taken through a binomial grid example (p=1/10,…,9/10) and 2 successes out of 3 draws with standard formulas. The way many cope with the confusing was to calculate P(p) given the marginal histogram of p, B(2/3|p) from the conditional histograms and plug and chug the usual formula to get what was already given in the posterior histogram.

But I think the bigger point is that when they do get “how we get the posterior distribution for the parameter [from the joint distribution by] conditioning on the data – it does not seem to get them much further.

Considering that one stats prof of my acquaintance thought at one time (and maybe still does) that the non-parametric bootstrap was *by definition* generating bootstrap data sets by sampling with replacement from the original data set, I’m going to say that there’s not much to choose from there. (Apparently the distinction between (a) using the empirical distribution as a plug-in estimator of the sampling distribution, and (b) doing a Monte Carlo approximation of that, is more subtle than I had realized.)

No it really is very complicated (mathematically) – or at least Peter McCullagh was able to get a really technical paper out of it.

What is interesting, is that when he was giving a talk on his initial work on that paper – he realized he had yet to notice the following.

For ease of blog comment think of a sample size of 3 and using the empirical distribution as a plug-in estimator of true distribution. The sampling distribution of three samples drawn from 3 possible values has 27 possible samples. Pretending one cannot write that down but must sample from it, using sampling without replacement one will learn it exactly with a reasonable number of draws. The non-parametric bootstrap does sampling with replacement and so less efficiently learns the distribution with some error if the simulation is not large enough. With large sample sizes the difference is trivial.

Do you mean that the definition given originally by Efron (cited in the presentation you linked to, by the way) is not valid because there exist extensions/improvements/correctins/alternatives? But you say that “the non-parametric bootstrap does sampling with replacement”. I still don’t understand what is Corey’s issue with the statement he quotes (that the non-parametric bootstrap was *by definition* generating bootstrap data sets by sampling with replacement from the original data set).

I appreciate you providing a case in point! ;-) You’re right that Efron skips over this distinction; I don’t know why. The first paragraph of Section 2.3 (on page 22) of Bootstrap Methods and Their Application by Davison and Hinkley is a bit clearer on the question.

Let’s take a step back. Suppose I want to know the probability distribution of the result if I roll four six-sided dice, drop the smallest outcome, and take the sum of the other three outcomes. One approach I can take is brute force: explicitly loop through the 1296 elements of the sample space and push the probability measure forward through the drop-smallest-take-sum-of-remaining function. Another approach I can take is Monte Carlo: simulate the four die rolls and calculate the result of interest over and over. Brute force is exact but would rapidly become impractical with more dice and trickier functions.

The non-parametric bootstrap is simply “use the empirical measure as the estimator of the (unknown) probability law of the data”. The task then becomes “push the empirical measure forward through the functions defining the estimators of the things you actually care about.” You could do that with brute force if you wanted to and if it was practical; it’s just that often the only feasible way to proceed is by Monte Carlo simulation from the empirical measure which in the IID case turns out to correspond to “generate bootstrap data sets by sampling with replacement from the observed data”.

> The non-parametric bootstrap is simply “use the empirical measure as the estimator of the (unknown) probability law of the data”.

It’s based on that, but it’s not “simply” that. You’re missing the resampling aspect!

From Davison&Hinkley (page 2, “purpose of the book”): “The key idea is to resample from the original data – either directly or via a fitted model – to create replicate datasets, from which the variability of the quantities of interest can be assessed without long-winded and error-prone analytical calculation.”

The example following the paragraph you mentioned in page 22 ends with the following remark: “This resampling procedure is called the nonparametric bootstrap.”

Yeah, I’d probably say the Bootstrap is a resampling _procedure_ or _algorithm_ rather than a model, but which can of course be understood as relying on the empirical distribution as a plug-in estimator.

“… if I roll four six-sided dice, drop the smallest outcome, and take the sum of the other three outcomes”, I’d say you were rolling up a D&D character. I wrote a blog post a while back on advantage in D&D 5e advocating Monte Carlo methods.

This is also a good exercise to illustrate how many Monte Carlo simulations you need. For instance, there’s one outcome (3), that is 1/1296. You’re going to need a whole lot of Monte Carlo to estimate that probability to any degree of accuracy. I always try to make this point about the challenges of discrete sampling—it’s very bad at the tails.

There’s a whole school of thought that says you do even stats 101 with bootstrap, and I’ve found that yes even very beginning students can understand it assuming you are good at teaching and don’t rush. Causeweb has a lot of materials related to the Lock bootstrap based intro stats book and using the Statkey website.

I mean in this threat the definition of low math is pretty high, but if you want really low math bootstrap is quite a good entry point to a lot of topics around uncertainty.

Bootstrap is great but the hard part is coming up with the estimate to bootstrap. That is, you have data y, estimate theta_hat(y), bootstrap distribution p(y_boot|y), and bootstrap distribution of the estimate, theta_hat(y_boot), induced by p(y_boot|y). The literature on the bootstrap is all about coming up with a reasonable p(y_boot|y) and also figuring out what to do with the induced distribution on theta_hat(y_boot). That’s all fine, but I think the real action is in deciding the estimator theta_hat(). I work on problems where least squares and maximum likelihood estimates don’t work well. You can take a bad estimate and bootstrap all you want, without the problem getting fixed. Again, this is not a slam on the bootstrap, just a reminder that bootstrap requires an externally supplied theta_hat().

> Bootstrap is great but the hard part is coming up with the estimate to bootstrap…a reminder that bootstrap requires an externally supplied theta_hat()

Yes, I agree.

> I work on problems where least squares and maximum likelihood estimates don’t work well.

Me too. The main problems with these are a) lack of robustness and b) problems with non-identifiable models.

In many ways I do think it is a good way to separate estimation theory from uncertainty of estimation (this is something I’ve changed my mind on). In this sense, statistics in the bootstrap-style is just about uncertainty of an estimator determined via other means.

Regularisation theory is one way of thinking about estimation theory. Bayes is another. They have some overlap but are not identical. Increasingly I prefer the mathematical foundations of the former to the latter. In particular, I think Bayes still has serious issues with a) and b) above.

First, any simplification compromises performance, so it detracts from popularity. I love ABC but not very practical in some situation (thoug IMHO that can be addressed).

Second, and perhaps more relevant to you, some of the “analogies” used, like your Galton machines, simulate a random process – but the structure of the model (the pin arrangement etc) are very far removed from the process being modelled (growing seeds). So it is like learning a new language. Not like the model plane in a wind tunnel.

I think the latter can be solved with software. But here I am thinking more Lego and SIM city than STAN.

But more generally, I’d be interested in knowing if there are any theory of learning out there that offer a principled answer to the question you raise. Just like people have proposed a grammar of graphics, I wonder if there are theories about representation of models and inference. A programming syntax seems SO antiquated in an age of AR, VR, and AI.

In Bayesian statistics you normally need at least one layer of probability modelling more than what corresponds to the observations. I think this (and the computational issues arising from it) is the major difficulty with straightforward access to Bayesian statistics at an elementary level.

I’m not sure what you mean by “start from” here. If you’re doing a standard bootstrap analysis of the variance of some random variable, that’s going to be a random variable defined as an estimator of some unknown parameter (that is, a function of the random variables representing the data). You draw simulated data sets from the original data with replacement and you get plug-in estimates for the variance of estimators. In practice, the plug in nature of defining new random variables in terms of others is very much like dealing with Bayesian posteriors calculated through MCMC.

Admittedly, I should have broken up this sentence “not that unusual in statistics to get the same statistical output (uncertainty interval, estimate, tail probability,etc.) for every sample, or some samples or the same distribution of outputs or the same expectations of outputs or just close enough expectations of outputs” in which the last phrase is for the bootstrap.

I think the differences are not something to make distinctions of – but we may have to just disagree about that.

But let me recall my first set of experiences with the obviously simple model free bootstrap.

When I started my biostats program at U of T (1984) Rob Tibshirani was a postdoc and gave us a seminar on the bootstrap. In the seminar, I asked “isn’t that just the method of moments but instead of matching the first four moments you are matching many more. We know that isn’t going to be a good idea.” Rob did not agree and I think he said that he did not think so.

So, over the next few years I would ask any faculty member I thought might know if the bootstrap was the method of moments. The best answer I got was – “every thing is finite so it likely is something like that”.

Then years later I came across the von Mises step function approximation in the Encyclopead of Statistics – yikes that’s exactly the simple bootstrap – re-express the sample as a step function with 2n-1 (or was it n-1) steps (which is a distribution function). So assume a discrete distribution function and then estimate it by matching the first 2n-1 moments. Now re-read what goes wrong with the method of moments estimation … OK, there is lots wrong with the simple bootstrap and Efron and other have been arguing that only various fixed up versions should be used and some of these work well in certain problems.

Continuing on the math journey, Von Mises step function approximations are orthogonal polynomials approximations and most generally Chebyshev systems https://www.encyclopediaofmath.org/index.php/Chebyshev_system. Nicely display Bob’s claim that as you continue to be engaged your level of math dose or should increase. Anyway, Peter Hall was visiting Toronto years later so I got a chance to discuss this with him. His comment was that one could define the bootstrap as the method of moments but he preferred to define it in other ways that lead to more interesting math (really enjoyed my interactions with Peter, sad he is gone).

Now the main point – Efron mathematicized the bootstrap there by making his contribution to something people had been doing for years. Mathematical objects are never simple model free animals!

> Now the main point – Efron mathematicized the bootstrap there by making his contribution to something people had been doing for years. Mathematical objects are never simple model free animals!

I am very pro math. But ‘mathematicising’ is not, to me, the same thing as ‘modelling’ unless the latter is so broad as to mean little.

Why is ABC a seemingly attractive way to teach Bayes? I would agree it’s because it presents a constructive _procedure_ for carrying out Bayes updates instead of just an axiomatic or algebraic characterisation. The procedure itself seems more concrete even than the idea of Bayes updates.

Of course you can and should analyse this procedure to see where it takes you. For example, you can characterise the real numbers algebraically, and you can characterise them via a construction procedure using the rationals. It’s nice to see that they agree.

On the other hand, constructive mathematics (for example) offers a different view on mathematical concepts that some students find helpful. For example, I recently-ish had a student returning to university mathematics after dropping out when younger in part due to difficulties with concepts like limits. I explained these ideas to them in terms of procedures and that seemed to help significantly.

In terms of introducing difficult concepts I think:

– Introduce a concrete problem to be solved or a clear goal

– Introduce a constructive procedure for tackling this specific problem, taking you through a series of reasonable and understandable steps

– Demonstrate the good properties of the chosen procedure for tackling the problem

– Discuss what can go wrong and the limitations of the procedure. E.g. using counterexamples.

The last two points are very important, of course, and can make all the difference.

To paraphrase one of Andrew’s least favourite people – we don’t _understand_ mathematics, but we learn how to _do_ it.

Now, back to Boostrap.
Bootstrap is similar to ABC – it’s most direct interpretation is as a procedure. You can also analyse it’s properties as a procedure: where does this process ‘take you’?

One benefit, as I tried to argue, is that the ‘series of reasonable and understandable steps’ are simpler than ABC because it starts with a concrete given – the dataset – and not an abstract unknown – the parameter/prior over a parameter.

Boostrap is direct, ABC/rejection is like the contrapositive. They may be equivalent (I would argue against in this case tbh) but one direction is more intuitive.

I don’t think it’s necessarily true that the bootstrap direction is more intuitive. I think it depends a LOT on where you’re standing. I’d argue that the prior over the parameter, in many modeling situations, is much more concrete and understandable thing than “the sampling distribution of the mean under repeated trials”. Specifically, take my blog example, the dropping paper balls experiment:

Now, in this experiment, the parameter g for gravitational acceleration, is a meaningful thing, it’s known to be somewhere close to 9.80 m/s^2 wherever you are on the earth, and there are tables for various cities, so it would be totally unreasonable to infer from dropping balls that g = 7.1 or g=12.16 is a reasonable value

Further, the aerodynamic 2*radius should be somewhere close-ish to the measured diameter of the ball, it has to be that order of magnitude. If we measure a ball at 7.5cm diameter and we get from our model that it falls as if it were a perfect sphere of diameter 0.1 cm we know we’ve gone wrong somewhere…

So, if your starting point is a physicsy mathematical model of some real stuff happening, the prior is often a very concrete idea of about what the different parameters should be because they have symantics associated to them… they mean something concrete. And, the idea of a sampling distribution of repeated trials under identical conditions… not so much. what does it even mean to have identical conditions? What you consider to be “identical” will strongly affect what you consider to be the “sampling distribution”, so the “sampling distribution” is a really abstract thing depending on some subtleties such that different groups of students would see potentially very different distributions when they try identical repetitions of certain experiments.

On the other hand, I do agree with you about constructive procedures and their role in understanding models. It’s no good to worry about abstract properties of mathematical objects… like the integrability of the indicator function on the irrationals… a totally meaningless thing for applied math. You can’t compute the decimal expansion of even *one* irrational.

Re: estimating gravity from falling objects under a physical model. You might be interested in this paper:

‘Statistics of Parameter Estimates: A Concrete Example’ by Aguilar et al. They include both Bayesian and frequentist methods (the frequentist approach is based on Philip Stark’s work in inverse problems). See http://epubs.siam.org/doi/abs/10.1137/130929230

As might be expected by the cynical, none of the methods actually work _especially_ well considering the example is so simple. But they all work OK ish. The culprit is likely model misspecification (eg th assumed drag law) and/or measurement issues. The confidence intervals seem to do the best though, being more conservative.

Model misspecification for sure. They treat estimating g as an important component. They have g to 5 decimal places already. If they use that as their prior then C becomes the real quantity of interest, which is truly unknown… Yet they don’t do that calculation. But, this is precisely the most important thing about Bayes, the ability to incorporate many sources of information in a systematic way. They should have done that calculation in addition to the others. Oh well, otherwise it is in fact a nice paper.

I’m not entirely convinced of Christian’s comments. I think that if students really understand the idea of probability modeling, then it’s not too big a stop of going to a second layer of such modelling. I think the real problem is really understanding/accepting/internalizing the idea of probability modeling.

I think that probability modelling is very, very hard to “really understand”, and I don’t think it’s realistic to wait for this to happen before going on with the beef of statistics; rather, if we’re lucky, students get closer and closer to “real understanding” the more they learn. Having to deal with two layers from very early on doesn’t make this task any easier.

> doesn’t make this task any easier.
But it might make it more informative and profitable for future progress – or that’s the hope.

You first statement is close to the overall impression I am getting here – these two stage rejection sampling teaching vignettes will enable folks to get some sense of what Bayes is and how it works. Its not going to be anywhere near a competent sense or even a sense we hope for but it something that hopefully sets up and encourages further engagement to get fuller senses over time.

The holy grail being to get a competent sense of what to make of Bayesian analyses addressing real problems – their upshot. That third, pragmatic sense needs to be based on recognition of what Bayes is and how (formally) it works but is open ended, so the journey never should end for any of us.

Several people have commented on how students’ backgrounds (in a number of ways) can affect how one needs to go about teaching them Bayesian statistics. My only experience actually teaching (an introduction to) Bayesian statistics was for an unusual audience: middle and high school mathematics teachers enrolled in a summer master’s program for such teachers. But I think there is a lot that I learned in teaching that course for four summers that might be relevant for introducing Bayesian statistics to other populations.

One thing is that having students work together on (ideally carefully chosen) assignments can be helpful. One reason is that they have to explain to their peers what they are doing rather than rely on passive intake. Also, different (sometimes subtly so) explanations may be needed for different students. So I might say something that helps one student “get it,” and that student may be able to add to what I said in a way (that I would not have thought of) that helps another student “get it”.

Most of the students really seemed to appreciate the Bayesian ideas, especially compared to frequentist concepts. (I think that was partly because I emphasized the details and subtleties of the frequentist ideas, rather than letting students get away with “the gist”)

I tried the material out when on a multi-day course entitled training for trainers with about a dozen civil servants who were also on the course (it was my project.) I went through the Galton quincunz story in a presentation but also designed a two stage simulation with dice and cards that they were able to implement in small groups and get posterior probabilities. The Galton story was continuous like but the dice and card simulation were discrete, but other than that I got tell the story, have them complete an exercise and discuss it afterwards.

Most seemed to enjoy the material and I do think they all grasped what was going on – but I don’t think any light bulbs actually lit up about this being what statistics is and how it works in real research. As my son aptly commented when he was a teen “Yea, Dad I see whats going on with the pellets and why that subset of pellets (posterior) would be of interest, but shouldn’t there be a formula that does a better job?”

As a clueless user of statistics, I gained a lot from taking four years to do the math+stats courses I needed and this made me able to understand books like BDA3 for the first time. I have seen mathematicians do Jacobians in their heads as they rush through transformations on the blackboard at lightning speed, I think that I would need a lot more practice to be able to do that fluently. I wish there was a book just on Jacobians for the Clueless Statistician, so one can practice these moves in the safety of our own home.

As a teacher of statistics, I think it’s OK if I can teach the mathematically unsophisticated student to fit linear mixed models in Stan and know what a vcov matrix is and such like things. I see these as gateway drugs; the student typically takes one of two trajectories. Either they drift forever in a low plateau of linear mixed models, never straying from the one model they know how to fit, or they get curious and put in the work to learn more and get hooked for life. The former will not get very far, but they will/may still do good work scientifically. The latter group, which is maybe 1% of my contact population, will almost always do well and come up with new and important insights. People who come in with strong negative opinions about math or who can’t learn to manage the rising sense of panic on not understanding a new concept usually don’t make it into the 1%.

To both groups, I try to teach that just asking for expert advice and “recommendations” will not serve them well. They have to know enough to make their own decisions and be able to defend them. In my field, people just think they can ask an expert whether to fit, e.g., full variance covariance matrices, in a hierarchical linear model, they want a mechanical procedure they want to apply no matter what data come in. And many people willingly step forward to provide such blanket “recommendations”. Then the poor user gets confused if someone else disagrees with such “rules”, and then plaintively ask on facebook and twitter, “what I am I supposed to think? Someone please just tell me what to do.”

Lots of concepts discussed in this thread (like what is “random”, what is “distribution”, what is “parameter”) have nothing to do with math. These are all philosophical concept and should be approached as such.

You can master all the measure theory behind statistics and still have no idea what you are doing when “computing a posterior for Theta”.

I’d say knowing at least a bit of measure theory is a necessary condition for understanding posteriors—it would be hard to say you know what you are doing computing a posterior if you don’t know at least enough about measure theory to formulate a proper definition. Once you know measure theory, defining the mathematics of Bayesian posteriors is trivial. But I take your point that it’s not sufficient. I could understand the math before I saw what the statistics was really all about and I think that’s partly philosophical. I think there are at least three stumbling blocks:

1. Models are just approximations of reality and involve choices by the modeler.

2. Bayesian inference is inductive inference—it shows how to update knowledge with uncertainty in the light of evidence with uncertainty. Bayesian probabilities model that uncertainty.

3. Statistics is counterfactual—we model past events probabilistically as if they might have turned out otherwise, even if they only have one value.

I think you can understand probability quite well without measure theory. Well enough to understand posteriors. Otherwise, anyone doing statistics before the 1930’s had no idea of what they were doing?

I think there are big advantages to understanding what is going on in integration, and differentiation as well. As you know, I’m a big fan of nonstandard analysis, I think it gives people who are comfortable with basic algebra a direct path to understanding the needed ideas from integration and differentiation without getting bogged down in traditional measure theory background that has basically no advantages for them. The difference between Lebesgue integration and Riemann integration for example are probably irrelevant for 95%+ percent of applied problems.

You can do essentially all of probability using nonstandard analysis and finite probability spaces. Edward Nelson’s PDF is a must-have:

I should say, his book isn’t for the mathematically naive. It’s not a way to learn about probability for people who know very little math. But, it’s a demonstration that traditional measure theory is mostly shrouding radically elementary ideas in a mystique of the pathology of infinities, Cantor sets, and Banach-Tarski paradoxes etc. Just sticking to finite but nonstandard sets gets you all the useful probability theory you could want.

“I’d say knowing at least a bit of measure theory is a necessary condition for understanding posteriors — it would be hard to say you know what you are doing computing a posterior if you don’t know at least enough about measure theory to formulate a proper definition.”

Things you will find in BDA: how to do Bayesian statistics at an advanced level.
Things you will not find in BDA: measure theory.

Actually the last chapter in the book “Dirichlet process models” (introduced in the 3rd edition and written I suppose by Dunson) uses a lot of measure-theoretic language. I don’t know if it could have written in simple terms like the rest of the book, but in any case this shows you can do a lot before being confronted to measure theory.

I should’ve been more specific. I didn’t mean taking a graduate course in measure theory in the definition, theorem, proof, example style. I agree with everyone else that’s more than most people are going to need.

I only meant enough calculus to understand what limits, derivatives, and integrals are, and enough measure theory to understand event probabilities, random variables, and the related notions of expectation and covariance, as well as the key notion of concentration of measure (this last bit is the thing most people miss that is critical for understanding Bayesian posteriors). That amount of math stats and measure theory is presupposed by BDA. The very first hierarchical modeling example in chapter 5 assumes you can compute a multivariate Jacobian and match moments of a beta distribution. On the other hand, Gelman and Hill manage to mostly sidestep measure theory, though they provide a hand-waving definition of random variables (balls in an urn [aka sample space] with values written on them for the random variables—they don’t mention you need an uncountable number of balls!).

Trying to write about stats these days without measure theory would be like trying to write about gravitation without manifolds. Sure, it’s possible, but you’ll be left out of most of the conversations.

> Trying to write about stats these days without measure theory would be like trying to write about gravitation without manifolds. Sure, it’s possible, but you’ll be left out of most of the conversations.

OK students, you might have heard about this thing called ‘gravity’ but we won’t cover that until grad school after you’ve done analysis on manifolds.

(PS funnily enough I actually do think manifolds should be introduced much earlier in standard curricula…)

I do think if you’re going to teach stats 101 from a frequentist perspective, bootstrap and simulation are the way to go.

I personally would like to see people teach deterministic type model building first… ODEs and algebraic models and soforth. teach people the importance of dimensionless quantities, and segue from choice of dimensionless scales to choice of priors. After all, most of dimensional analysis is about choosing scales so that things are O(1) and that goes reasonably well towards then putting a weighting function over the quantity.

Then, once you have a weighting function over the quantities… talk about how to find out more from collecting data and comparing predictions to the data under different possible values of the unknowns. How much weight should we give to parameter values that cause predictions to vary highly from measured? How about if predictions are close to measured? How do we measure “close”

I wouldn’t even once mention the idea of “random” just different weights to be given to different values of theoretical quantites, and different weights to be given to different degrees of difference between prediction and measured quantities.

Once those ideas are cemented, we can move on to how to compute with these weights… at that point you could introduce several computing techniques: resampling, ABC, HMC/MCMC, etc.

These days I teach lot of engineering students, in particular mathematics and mathematical modelling to engineering students.

I spend almost all of my time on ‘mechanistic’ modelling via ODEs, PDEs, reaction networks etc. I recently taught one lecture on estimation for some ODE models we’ve been deriving.

Given the time constraints I taught them about a) distances for measuring model fit to data, b) measures of model complexity and c) how to explore the trade off between a and b. Basically, penalised nonlinear regression (or multiobjective optimisation) with discussion the role and ways of choosing/exploring penalties. No randomness required.

I would call this ‘parameter estimation’.

In terms of _data analysis_ I would just teach how to explore variability of useful statistics via bootstrap/resampling.

(We also teach computational/numerical methods where they code the methods from scratch and project-based design papers where they code heuristic optimisation/search algorithms themselves tailored to specific problems).

I think seriously don’t – but perhaps I should do a post to clarify why I think that.

Briefly, I think its just the mistaken sense that getting the sampling distribution solves statistical inference (Efron/Tibshirani preferring the use of bootstrap ideas and Cox/Fraser preferring the use of higher order asymptotics.)

When dealing with realistic problems (e.g. nuisance parameters, estimates that depend on higher moments, parameter spaces that a non-linear, etc.) it can be very hard to get actual coverage anywhere near supposed coverage. Yes, thoughtful experienced experts can get it right.

Efron once wrote (around 2000) that most statisticians don’t. That included me, in that when the percentile intervals were _simmilar_ to the BCA intervals in applications, I (mis)thought they had no advantage (i.e. I forgot to think about the distribution of intervals over repeated applications).

Most times I remember checking out bootstrap work done by colleagues, it was not valid according to technical papers written by experts.

Also, Efron did write about the bootstrap as being a Bayesian approach that automatically/implicitly defined a _reasonable_ non-informative prior.

> The very first hierarchical modeling example in chapter 5 assumes you can compute a multivariate Jacobian and match moments of a beta distribution.

How much measure theory do you need for that? Unless understanding multivariate calculus means that you know enough measure theory, even if you’ve never heard of a sigma-algebra.

> Trying to write about stats these days without measure theory would be like trying to write about gravitation without manifolds. Sure, it’s possible, but you’ll be left out of most of the conversations.

That’s a different question. It’s useful for multiple reasons but it should not be an end in itself. And it doesn’t necessarily help students who are not interested in pursuing an academic career in this field.

“If anyone wants to concentrate his attention on infinite sets, measure theory, and mathematical pathology in general, he has every right to do so. And he need not justify this by pointing to useful applications or apologize for the lack of them; as was noted long ago, abstract mathematics is worth knowing for its own sake.
But others in turn have equal rights. If we choose to concentrate on those aspects of mathematics which are useful in real problems and which enable us to carry out the important substantive calculations correctly – but which the mathematical pathologists never get around to – we feel free to do so without apology.”

Yes, I think that “multivariable calculus” is really more about what Bob is talking than measure theory. I do think that in constructing a sampler like Stan, working with ideas around hamiltonian dynamics and soforth you need something more than an undergrad multivariable calc class, but most people are using Stan not developing it. Even as a user it’s useful to understand some ideas about manifolds etc but ultimately it’s not measurability and the pathologies of infinities etc that are the keys to the castle, it’s functions over many variables.

As I tried to lay out in another comment, I just meant understanding that we’re dealing with a sample space, a collection of events represented as subsets of that space, and event probabilities satisfying some basic axioms. The event probability function is the measure. That’s all I meant. That lets you understand random variables. Without that basic understanding, I think it’s hard to get along. And of course Andrew knows all this stuff.

The stuff I think of as “measure theory” is all that stuff about “measurable functions” and “measurable sets” and how the individual points have probability 0 and only certain sets even have probability, and the probability of an event is no longer the sum of the probabilities of the elements in it. All that is pretty unintuitive and requires you to adopt a point of view that is, in my opinion, contrary to what is useful in model building. Hence, I like the nonstandard number of balls in an urn view of mathematical probability. It corresponds to a direct idealization of what is actually done (namely, using RNGs to give you IEEE floats and the like).

The next thing I think is of concern is that teaching mathematical probability theory tends to move you away from other interpretations of probability and cement a “one true interpretation” view. For example the Jaynesian plausibility of the truth view has nothing to do with balls and urns so to speak. So, I don’t exactly disagree with you about the importance of having some mathematical knowledge, but I think some subtle thinking needs to go into how to convey useful mathematical concepts without having them shut-out useful model-building ways of thinking. The math can have many different meanings attached to it, and how you teach the math seems to affect how well people are able to transport the math across to the meaning.

I don’t think I know any measure theory. Or maybe I do, and I just don’t know that it’s called that. The most theoretical things I’ve ever done are the appendix in BDA (which was my reconstruction of arguments that I’d heard before but not ever seen formally explained; actually it turns out it was all in De Groot’s book, but I didn’t know that at the time) and my 0.234 paper (but Gareth took over the writing of that one so the proof ended up being written in some sort of mathematical code that I couldn’t really follow; my own proof was much more long-winded and I’m sure was less rigorous).

Honestly very comforting to hear you say that Andrew. I always go back to BDA and find it usually to be ‘more than enough’ rigor for what I work on. The idea of needing to master measure theory or nonstandard analysis or whatever to ‘really’ do Bayes/stats is a bit horrifying :)

An event is a subset of the sample space, . Let be the collection of events (there are a few reasonable closure conditions such as that is an event and we have complements and countable unions—we almost never need to worry about these details).

A probability measure is a function satisfying some basic axioms (like positivitiy and that disjoint events have additive probabilities—these details are all obvious).

Oh, and a random variable is a function with cumulative distribution function .

If you want to get fancy, you can define a pdf as the derivative of a continuous cdf.

OK, this is mostly just stuff from a good somewhat rigorous intro to probability course. I know you’re directing this at Andrew, but by measure theory I start thinking about subtle distinctions btw lebesgue and Riemann integrals etc.

Thanks for laying out what you mean by “measure theory” — my understanding of that phrase is in line with Chris and Juho’s comments.

But if we’re talking about teaching how to apply statistics, we also need to give examples of how this applies to real world problems. So my question is: If you were teaching statistics, how would you illustrate your definition in a context such as studying the effect of a drug on a certain population of patients?

This comment thread has confirmed for me what I had begun to think after reading comments by you experts here for several years: colleges should offer majors (and probably should form new departments) called Data Analysis, or Data Science. It is a distinct field that of course includes statistics, and philosophy of statistics, but also measurement theory, machine learning, and the history, sources, and ethics of big data. Add new courses in meta-analysis and “replication.” It is plainly a rich enough vein to fill 16 to 20 undergraduate semesters when treated in depth. My daughter is seeking this major, in fact.

I don’t think we need a new major (but then I don’t think we need “machine learning” or “uncertainty quantification” either). I think we need to be teaching all these things in statistics as well as visualization and communication and serious computing.

Creating new faculty a university is always an uphill battle. Departments jealously guard tenure lines. Data science has been a boon to both computer science and statistics department in terms of faculty growth, but I don’t think either want to see data science departments. As far as I know, Columbia’s Data Science Institute doesn’t have any faculty tenure lines.

You are the experts. I’m just a lawyer. What I see is the way you all discuss things. The problem you identify again and again is that the formal statistical methods you need to teach your undergrads in a hurry are just a slice of the process of getting from the chaotic “real world ” to a pile of “data,” to a sensible question (NHST, boo!) to a reasonable answer. Also, apparently online processes now create more “information ” every few days than was available from all of the world’s printed sources in history before 2003. What is it? Where is it? What good is it? What tools do we have to sift it? What counts as good programming of those tools?

It was actually your comment of August 2, and Shravan’s and Andrew’s comments, that highlight for me the utility of situating stats courses in a broader, less rushed curriculum about data gathering, analysis, and use. Making the math concrete.

Your comment underscores that the adjacent academic apartments will always “do their own thing” better and at higher research levels than the type of engineering department I’m describing (and may be better at avoiding what you view as fads, if I understand your scare quotes around “machine learning”). Sure, that’s a given. But once upon a time we didn’t have statistics departments, they were math. We didn’t have computer science departments, they were — what? I.E.? Math? We once didn’t have area studies or urban studies or epidemiology. Fields and departments seem to arise when faculty start saying, gee, a lot of our students could benefit from looking at the picture in this adjacent way, but we just don’t have the time in our degree sequence, and it would a shame to sacrifice all the cool stuff our senior undergrads get into.

I think I see you all talking that way. And I know your students are growing up in a world uniquely — uniquely, mind! — awash in “data.” The world needs academic faculty aimed directly at that phenomenon, and I’d be willing to bet it will get them, in large part because rich parents and donors will see what I see.

“But once upon a time we didn’t have statistics departments, they were math. We didn’t have computer science departments, they were — what? I.E.? Math?”

I’m not sure this is entirely true. Yes, once upon a time we didn’t have statistics departments, but my impression is that statistics was surfacing in a variety of academic departments (as well as non-academic environments) — yes, math was one of them, but statistics was developed in various ways by economists, demographers, astronomers, biologists, industrial engineers, and others.

“The world needs academic faculty aimed directly at that phenomenon, and I’d be willing to bet it will get them, in large part because rich parents and donors will see what I see.”

The university from which I am now retired did not have a statistics department until after I retired, and I was involved in the move to create one, so can offer some interesting comments on how a statistics departments might arise.

1. In preparation for co-writing a proposal for a statistics department (not the first attempt anyone had made at this — a previous one a decade or so earlier had failed), I looked at the “sister institutions” that my university usually compared itself to. I found that of the nine or so, Texas and Indiana were the only ones without a statistics department. These were also the only universities in the group that did not have any of the following: A school of medicine, school of public health, or school of agriculture. I think this was not coincidental, but that the existence of such entities was usually crucial in making the case for a statistics department.

2. At the time of the proposal, the Deans involved were of the opinion that getting the OK for a department at that time was not likely, but that a proposal for a Division of Statistics had a better chance of being accepted, so that’s what we proposed.

3. Coincidentally, there was also an interest within Natural Sciences for a unit (separate from the Department of Computer Sciences) dedicated to Scientific Computation.

4. The Dean decided to combine both initiatives, and proposed a Division of Statistics and Scientific Computation, which was approved.

5. After several years, that Division was “upgraded” to a Department of Statistics and Data Sciences.

Bottom line: The establishment of a new department can involve a variety of factors and may vary from place to place.

Nope, not what I view as fads. Deep belief nets are really effective non-linear classifiers. I just think of machine learning as part of statistics.

I used to work in computational linguistics, which was split between linguistics departments and computer science departments. I always thought it was just linguistics done by people who knew more math and computer science. Because of the computational burden and the need to develop algorithms, it had a home in computer science. And because there were applications, like search and speech recognition. I think that’s also why machine learning is in computer science—there are some hard computaional problems and somebody needed to do them. And the work needed to do them looked like systems and algorithms, so it looked like computer science. I just think it should be part of statistics. Just because we use computers to get the answer doesn’t make it computer science (and I’m a computer scientist by training more than anything else!). Physics uses a ton of computing, but there aren’t computational physicists in computer science departments. Why not? I think it’s because physicists are better at doing their own computing than statisticians.

I’m still hearing what I’m hearing. The disciplinary boundaries aren’t fully satisfactory to you, and there is a higher unity behind (or across) them. That is in part how sociology and demography and political science spawned urban studies in the 1960s and 70s, something I do know a little about. I don’t know, but I suspect that schools of public health earlier grew out of medicine and demography in a similar way. Anyway, I won’t have anything to do with this.

I also want to clarify that I don’t think statistics is doing statistics better. I think statistics has done itself a disservice focusing on a lot of esoteric theory and on hypothesis testing and classical estimation rather than prediction. I think the need for prediction and the lack of satisfaction trying to get it out of traditional statistics textbooks, is where the real split is at, and is in large part the driving force behind the rise of machine learning in computer science. I think it’s also the motivating factor behind the rise of Bayesian methods in statistics departments. There are some non-probabilistic systems like SVMs (and even those are statistical in some sense), but for the most part, machine learning is based on probability theory the same way statistics is.

“the process of getting from the chaotic “real world ” to a pile of “data,” to a sensible question (NHST, boo!) to a reasonable answer”

Learning how to do this process, which is just what I would call “science”, is what PhD candidates in research are learning how to do, in principle. And I think it’s no mistake that 1) a PhD takes a long time to complete, and 2) it evidently doesn’t appropriately cover all the steps anyway.

That is to say, while I agree with the general idea of an education focused on these issues, I can’t see that an undergrad major would be enough space to do justice to it. Current trends in what the data science community cares about teaching aren’t encouraging, either. Either you get a research scientist who understands “data gathering” pretty deeply and mostly ad-hocs the analysis, or you get a data scientist who has the analysis down but thinks “data gathering” means finding the right URL and being critical about the codebook.

“I would argue that an undergrad education probably doesn’t give enough perspective to do all of this, even though the basic mathematical tools are there. You need to be comfortable building things from scratch and dealing with people in intense situations. I’m not sure how to train someone for the latter…”