…I am discovering that one of the biggest sources of confusion about the foundations of statistics has to do with what it means or should mean to use “background knowledge” and “judgment” in making statistical and scientific inferences. David Cox and I address this in our “Conversation” in RMM (2011)….

Insofar as humans conduct science and draw inferences, and insofar as learning about the world is not reducible to a priori deductions, it is obvious that “human judgments” are involved. True enough, but too trivial an observation to help us distinguish among the very different ways judgments should enter according to contrasting inferential accounts. When Bayesians claim that frequentists do not use or are barred from using background information, what they really mean is that frequentists do not use prior probabilities of hypotheses, at least when those hypotheses are regarded as correct or incorrect, if only approximately. So, for example, we would not assign relative frequencies to the truth of hypotheses such as (1) prion transmission is via protein folding without nucleic acid, or (2) the deflection of light is approximately 1.75” (as if, as Pierce puts it, “universes were as plenty as blackberries”). How odd it would be to try to model these hypotheses as themselves having distributions: to us, statistical hypotheses assign probabilities to outcomes or values of a random variable.

However, quite a lot of background information goes into designing, carrying out, and analyzing inquiries into hypotheses regarded as correct or incorrect. For a frequentist, that is where background knowledge enters. There is no reason to suppose that the background required in order sensibly to generate, interpret, and draw inferences about H should—or even can—enter through prior probabilities for H itself! Of course, presumably, Bayesians also require background information in order to determine that “data x” have been observed, to determine how to model and conduct the inquiry, and to check the adequacy of statistical models for the purposes of the inquiry. So the Bayesian prior only purports to add some other kind of judgment, about the degree of belief in H. It does not get away from the other background judgments that frequentists employ.

This relates to a second point that came up in our conversation when Cox asked, “Do we want to put in a lot of information external to the data, or as little as possible?”

As I understand him, he is emphasizing the fact that the frequentist conception of scientific inquiry involves piecemeal inquiries, each of which manages to restrain the probe so as to ask one question at a time reliably. We don’t want our account to demand we list all the possible hypotheses about prions, or relativistic gravity, or whatever—not to mention all the ways in which each can fail—simply to get a single inquiry going! Actual science, actual learning is hardly well captured this way. We will use plenty of background knowledge to design and put together results from multiple inquiries, but we find it neither useful nor even plausible to try to capture all of that with prior degrees of belief in the hypotheses of interest. I see no Bayesian argument otherwise, but I invite them to supply one.[i]

A fundamental tenet of the conception of inductive learning most at home with the frequentist philosophy is that inductive inference requires building up incisive arguments and inferences by putting together several different piece-meal results. Although the complexity of the story makes it more difficult to set out neatly, as, for example, if a single algorithm is thought to capture the whole of inductive inference, the payoff is an account that approaches the kind of full-bodied arguments that scientists build up in order to obtain reliable knowledge and understanding of a ﬁeld. (Mayo and Cox 2010)

An analogy I used in EGEK (Mayo 1996) is the contrast between “ready-to-wear” and “designer” methods: the former do not require huge treasuries just to get one inference or outfit going!

Some mistakenly infer from the idea of Bayesian latitude toward background opinions, subjective beliefs, and desires, that the Bayesian gives us an account that appreciates the full complexity of inference—but latitude for complexity is very different from latitude for introducing beliefs and desires into the data analysis! How ironic that it is the Bayesian and not the frequentist who is keen to package inquiry into a neat and tidy algorithm, where all background enters via quantitative sum-ups of prior degrees of belief in the hypothesis under study. In the same vein, I hear people say time and again that since it is difficult to test the assumptions of models, we should recognize the subjectivity and background and be Bayesians! Wouldn’t it be better to have an account that provides methods for actually testing assumptions? One that insists that any unresolved knowledge gap be communicated in the final report in a way that allows others to critique and extend the inquiry? This is what an error-statistical account requires, and it is at the heart of that account’s objectivity. The onus of providing such information comes with the requirement to report, at least approximately, whether formally or informally, the error-probing capabilities of the given methods used. We wish to use background information, not to express degrees of subjective belief, but to avoid being misled by our subjective beliefs, biases and desires!

This is part and parcel of an overall philosophy of science that I discuss elsewhere.

[i] Of course, in some cases, a hypothesized parameter can be regarded as the result of a random trial and can be assigned probabilities kosher for a frequentist; however, computing a conditional probability is open to frequentists and Bayesians alike—if that is what is of interest—but even in those cases the prior scarcely exhausts “background information” for inference.

Post navigation

21 thoughts on “More on using background info”

No big disagreement but just the note that de Finetti-type old school subjectivists don’t think that it makes sense to put probabilities on the truth of hypotheses. They use this formalism only as a technical device to have (hopefully) well interpretable prior predictive probabilities for observable events.
(You probably know this anyway but your text may suggest otherwise.)

Thanks Christian. Yes, I should add that (and of course, there’s lots, lots more to say here. This was an outgrowth of the empiricism of the day (where some even hoped to reduce theories to equivalent observations sentences). But priors as “technical devices” leave the foundations of the enterprise in limbo.

The problem with de Finetti’s views is not, in my opinion, that they leave the foundations in limbo. I believe that he has rather well explained how to interpret probabilities and why it makes sense (using the betting rate operationalisation) to apply them to observable events only. I’d think that there is something for contemporary Bayesians to learn in there (or at least to not forget so often).

In many practical setups I’m not a “de Finettian” (for example when it comes to the kind of general theories you are interested in) but I think that credit should be given where it is due.

Christian: I think the betting rate operationalisation is a disaster, both for scientific practice and for foundations. Recall Senn’s paper (from RMM) and our discussion of it on this blog. But I’m interested in what you think contemporary Bayesians can learn from him.

Well, disaster or not, the betting rate operationalisation gives a clear explanation of what is meant by “probability”. Whether this is useful for what scientists want to do is a different story of course.
I’d qualify the frequentist explanation of probability as “fairly clear” (infinite repetition etc.) and pretty much all Bayesian ones apart from betting rates for observable events as “unclear” (I haven’t made my mind up about Jaynes yet who may be slightly better).
So what they can learn from de Finetti is a proper explanation of what the prior (and therefore the posterior) means. Many of the contemporary Bayesians apparently have a frequentist interpretation of the sampling model in mind and mix that up in the same formalism with a prior that is a very different animal, so nobody can know what the posteriors mean.

Christian: If one is prepared to admit one’s account of evidence is inapplicable to scientific hypotheses or even generalizations, or perhaps admit a very different account of evidence is needed for them, then maybe all learning (that remains) could be cashed out in terms of changes to betting ratios. Do you think that is plausible? I think that is quite a bizarre way to view learning, so I don’t really see your point that we get from de Finetti “a proper explanation of what the prior (and therefore the posterior) means”.
Recall as well the issue of temporal incoherence, and dynamic Dutch books:

The people discussed in relation to these posts included philosophers Howson and Williamson, both denying that they advocate updating by Bayes’ Theorem (though one is subjective, the other “objective”). Many reject Bayesian conditionalisation, and recommend we start over on each experiment. So what happened to the purported advantage of mathematically incorporating the prior?
In the following I allude to a post from Gelman’s blog:

Mayo: It seems that you discuss much broader issues than the point I wanted to make. I defend betting rates on observable events as a proper explanation of a possible meaning of probability. I’m not saying that it is the only legitimate meaning that probability can be given (in my own work I’m usually frequentist), and neither am I saying that the de Finettian way is the one that should be taken when dealing with scientific hypothesis (although I have a weak spot for sticking to what is observable; but discussing this in detail would probably open Pandora’s box).
I have written before, in this blog, that I find the issue of interpretation of probabilities problematic in Gelman’s work and I don’t find Howson or Williamson more convincing in that respect either. I guess it’s true that they debunk betting representations, but that in itself is not an argument.
More interesting is the Dutch book discussion. My take of it is that “avoiding Dutch books” is not so much a real necessity (among other things for the reasons given in the linked discussion) but a constructive device for the subjectivist interpretation. There is no need to subscribe to it but if you do, you get the probability formalism out of it (with very few further ingredients). If you don’t, you need something else in order to explain probabilities from a Bayesian point of view and I can’t see that they have anything as convincing.

Christian: Even if, as you say, betting rates on observable events give a proper explanation of a possible meaning of probability (at least of subjective probability), it would not have thereby given an adequate explanation of what if anything probability has got to do with evidence or statistical inference. We may agree on that, and you’re right that my interest is with inference/evidence or the like. But just limiting oneself, as I think you are for the moment, to understanding probability via betting: don’t these betting representations rely on basic events in games of chance (to obtain the indifferences), and so do not these also rely, deep down, on a relative frequency notion.

1) I’d say that de Finettian probabilities in fact have got to do with evidence insofar as they tell us how existing subjective beliefs should be updated in face of new evidence (assuming that one subscribes to the “avoid Dutch books”-logic). This is all fine if that’s what is wanted (for example it may be quite useful in decision situations with clearly quantifiable loss).
I agree that when it comes to inference about scientific hypotheses, it doesn’t yield the kind of statement/knowledge that we are after (I see why a de Finettian could disagree; but in this respect they should defend their views themselves).
2) Subjectivism doesn’t need indifference. The individual may or may not assess basic events as indifferent. There are no restrictions in this respect (only in objective Bayes).
3) @David Rohde: I like your posting, but I work on mixture models and I suspect that it is not very transparent for people who don’t.

really deep deep down the definition either assume linear utility of money (perhaps for small amounts of money) or use the principle of indifference e.g. a lottery. So you are right there is a problem, but its not quite the one you describe… … and as far as I know there is no answer to this problem.

… so there is something to add to your collection of anti-Bayes arguments. :)

Thanks Christian, I was hoping you would comment (knowing you worked in the area).

My point was if you take prediction as fundamental some apparent technical problems in Bayesian statistics dissapear. Although please feel free to elaborate seeing as you did such a good job explaining de Finetti.

The original operational subjective theory is to be pedantic not about updating probabilities and therefore not about temporal coherence… so counter examples against Bayesian updating because of difficulties with temporal coherence such as the ones you mention are not relevant to the purist operational subjective theory.

I think it is reasonable to say the operational subjective theory is a pretty lean theory… no probabilities for parameters and no updating… this means most criticism of Bayes more generally just isn’t relevant….

Also I am not sure temporal coherence is even that relevant in practice. I write the best statistical software now that I can, which there are good reasons to say that it must implement approximate conditioning… The argument that the software should implement conditioning are not diminished by examples that suggest I might reasonably disagree with the software’s output in the future.

Not sure whether I can explain the mixture models issue better than David did (particularly because it may require a general introduction to mixture models)… the point is, if you have a model with ambiguous parameterisation (i.e. more than one parameter vector indicating the same model, as happens in the mixture model with label switching), interpretations of the prior and posterior over the parameter space depend on how exactly probability is distributed between the different versions (parameters) of the same model. This is behind the problems with MCMC David wrote about. However, as it is the same model, all these parameter choices produce the same predictions for future events, so for the predictive distribution it doesn’t matter where (out of the different parameters yielding the same model) the probability goes.

Christian: By the way, I meant to note that I do quite agree with you that
“Many of the contemporary Bayesians apparently have a frequentist interpretation of the sampling model in mind and mix that up in the same formalism with a prior that is a very different animal, so nobody can know what the posteriors mean.that there’s a tendency to mix up different meanings of the prior.”

This is a crucial problem. However, it’s even difficult to clearly point out all the equivocations, at least I haven’t been able to. There’s so much of a tendency to slur together “how frequently (a value occurs),” “how much weight of evidence for..,””how much uncertainty do you feel toward…” coupled with simply defining the prior as some kind of “mathematical structure used to obtain posteriors”. Add to that a willingness to allow slip-sliding between all of these even in a single problem.* I am trying to lend some clarity in the book I’m writing, but it’s quite difficult and may be hopeless. Maybe it doesn’t matter in certain applications….

Mayo: If I may just chip in the Jaynesian point of view on this question: like the prior distribution, the sampling distribution captures and encodes information about what data are plausible. In the IID setting, this translates directly into the expected frequency distribution by Bernoulli’s weak law of large numbers. The Jaynesian view of the weak law of large numbers is distinct from that of the frequentist view in that in the former, the weak law of large numbers gives a prediction, whereas in the latter, the weak law of large numbers is a definition or a tautology.

It seems to me that Bayesians of whatever school tend to take the weak law of large numbers as read, leading to the appearance of equivocation from a strict frequentist point of view.

Corey: I don’t have a clue as to why you call the LLN a tautology (for frequentists), but perhaps it is relevant to note “the relationship between the [LLN] construed as an empirical law and as a mathematical law” (Mayo 1996, EGEK*, Chapter 5 Models of Experimental Inquiry p. 169). I follow Neyman here. Sorry, am in transit.
*accessible from my blog LHS.

Mayo: It’s on page 165 in the dead tree edition of EGEK that you’ve most kindly mailed to me.
I call it a tautology because, in my understanding, the sine qua non of frequentism is the claim that mathematical probability’s *only* scientifically valid use is the modelling of relative frequency. So I used the word tautology in the sense of the original Greek: tauto, “the same” and logos, “word/idea”.
I don’t disagree with anything written in the section to which you’ve referred me: it makes a positive argument for modelling relative frequency with mathematical probability, but nothing in it requires that the modelling of relative frequency be the sole scientific use of mathematical probability. The argument for the uniqueness of this use of mathematical probability comes elsewhere in the book, I think.

Corey: First, I’ve no idea why your comments are being held up—I’ll ask the Elba folk. Second, I’ve utterly lost or failed to grasp the train of your last two posts (in relation to what I thought this post was talking about)—but maybe if I try, this will get to some of the mystery surrounding what appears straightforward (to error statisticians): Frequentist statistics/error statistics/sampling theory have always been empirical accounts–they get their justification in terms of empirical connections e.g., between relative frequencies, as derived from a probability model, and relative frequencies in “real experiments” (as Neyman calls them). And a tautology has a fairly uncontroversial meaning: it is a proposition that is true by definition of its components, true in all worlds, noncontingent, or the like, e.g., A V ~A.
Finally, I see now (that I’m on the ground) that the quote is actually on p. 168 of the dead tree volume of EGEK, but the page you give, 165, is also relevant for the “empirical fact of long-run stability” (where long run just means long enough to see a pattern emerge). I hope people will look at those few pages, if they are interested. I’m hoping tomorrow to launch into a new section of the book directed precisely to what we were talking about: the issues on using background knowledge, belief, and disbelief in inquiry.

Mayo (I can’t nest under your comment): I was trying to offer a brief Bayesian account lacking “all the equivocations” and “slip-sliding”. I got distracted into defending my use of the word “tautology”.

Echoing Hennig’s point about lessons of de Finetti, and trying to give a practical example.

An implication of de Finetti is don’t do decision theory on the parameter space, but rather on the predictive distribution. Gelman makes the same argument, although of course he would not attribute it to de Finetti.

To give a practical example… in mixture models “label switching” is sometimes seen as a problem. The standard Gibbs sampler for a mixture will often mix well between one important mode of the posterior and fail to mix between modes. If this poorly mixing sampler is used you can take a point estimate (apply decision theory to the parameter space) and get a reasonable answer for the location of clusers.

On the other hand if your MCMC sampler works better than expected, perhaps if a better algorithm is used then due to symmetries in the set up it is obvious a priori that the posterior for all clusters must be identical. As a consequence if you attempt to do decision theory on the parameter space you will not get a meaningful location of the clusters.

In short if your algorithm works badly point estimates of the parameter work out ok, but if it works well they don’t.

by Gilles Celeux, Merrilee Hurn, Christian Robert
a powerful MCMC algorithm is employed. As the MCMC works well, standard point estimation techniques don’t, they suggest elaborate alternatives that appear to work well.

Other smart people do other smart things to get around the symmetries in the posterior to combat the label switching problem.

The trouble for me is, I don’t think label switching is a problem – failure to label switch is a problem… and I think if the operational subjective view had a more persistent voice in applied work (as well as foundational discussion) the community might understand this issue better.