Different meanings of Bayesian statistics

I had a discussion with Christian Robert about the mystical feelings that seem to be sometimes inspired by Bayesian statistics. The discussion originated with an article by Eliezer so it seemed appropriate to put the discussion here on Eliezer's blog. As background, both Christian and I have done a lot of research on Bayesian methods and computation, and we've also written books on the topic, so in some ways we're perhaps too close to the topic to be the best judge of how a newcomer will think about Bayes.

Christian began by describing Eliezer's article about constructing Bayes’ theorem for simple binomial outcomes with two possible causes as "indeed funny and entertaining (at least at the beginning) but, as a mathematician, I [Christian] do not see how these many pages build more intuition than looking at the mere definition of a conditional probability and at the inversion that is the essence of Bayes’ theorem. The author agrees to some level about this . . . there is however a whole crowd on the blogs that seems to see more in Bayes’s theorem than a mere probability inversion . . . a focus that actually confuses—to some extent—the theorem [two-line proof, no problem, Bayes' theorem being indeed tautological] with the construction of prior probabilities or densities [a forever-debatable issue].

I replied that there are several different points of fascination about Bayes:

1. Surprising results from conditional probability. For example, if you test positive for a disease with a 1% prevalence rate, and the test is 95% effective, that you probably don’t have the disease.

My impression is that people have difficulty separating these ideas. In my opinion, all five of the above items are cool but they don’t always go together in any given problem. For example, the conditional probability laws in point 1 above are always valid, but not always particularly relevant, especially in continuous problems. (Consider the example in chapter 1 of Bayesian Data Analysis of empirical probabilities for football point spreads, or the example of kidney cancer rates in chapter 2.) Similarly, subjective probability is great, but in many many applications it doesn’t arise at all.

Anyway, all of the five items above are magical, but a lot of the magic comes from the specific models being used–-and, for many statisticians, the willingness to dive into the unknown by using an unconventional model at all–-not just from the simple formula.

To put it another way, the influence goes in both directions. On one hand, the logical power of Bayes' theorem facilitates its use as a practical statistical tool (i.e., much of what I do for a living). From the other direction, the success of Bayes in practice gives additional backing to the logical appeal of Bayesian decision analysis.

I think also bayes gets a bit of mystique from being part of the only formal models of intelligence we have (AIXI etc).

I think AIXI models something, I just don’t think that something happens to be getting stuff done in the real world. Specifically, I don’t think the point of intelligence is to find the best function between the input and output, unless you stretch the meaning of input and output to include the whole physical interaction of the computer and the world (which would break AIXI).

Daniel Burfoot

I think a lot of the mysticism revolves around the issue of picking a prior, and how that essentially arbitrary choice can totally change the conclusions you arrive at from your analysis.

For me, the best way to clear up the mysticism is to make the switch from probabilities to codelengths. Here’s how it works. Instead of trying to find the “probability” of a data set (what does that even mean?), you’re trying to encode it and send it to your friend. There is a one-to-one mapping between codelengths and probabilities, so nothing is lost by doing this. However, when you start thinking in the encoding-data mode, certain things become very clear:

1) There is no question but that you have to pick a prior. The prior is just the data format that you and your friend agree on to transmit data. There can be no funny business. You cannot change the prior after seeing the data (overfit), because if you do your friend won’t be able to decode it.

2) If you get lucky by choosing a data format/prior that happens to match the data well, you will get good (short) codes.

3) If your format is really a meta-format, that is it allows the sender to look at the data, analyze it, develop a new specific format, and then send the data using that specific format, then obviously you need to send information regarding the new specific format in advance of the actual data. Furthermore, your choice of meta-format will affect the choice of specific-format and there is just nothing that can be done about this.

4) The age-old philosophical question of “zero probabilities” (do we ever assign anything zero probability?) becomes in this context a rather more practical question: can our data format be used, in principle, to send any observed data set (perhaps with a very long code)?

5) In the limit as we assume less and less about the type of data that we observe, and correspondingly make our data format more and more general, we eventually arrive at the Solomonoff distribution. Here the data format is simply the specification of a Turing machine.

I have no objection to your formulation, but your comments do have answers within the standard world of statistics:

1. Picking the prior is no more difficult than picking the data model (the likelihood, in statistical jargon).

2. You write, “Instead of trying to find the ‘probability’ of a data set (what does that even mean?)…” Probabilities can be calibrated empirically; see, for example, chapter 1 of Bayesian Data Analysis.

Daniel Burfoot

Right, my point is simply a philosophical one that thinking in terms of codelengths can clear up issues that seem to inspire “mystical feelings” when talking about probabilities or priors. The actual answers are all the same.

anonym

I think part of the “mystical” feeling probably comes from the realization of how widely applicable Bayes’ theorem is and a sense that it can function as something like the foundation of a calculus of thought such as people like Leibniz and Boole have sought for so long and can be applied to every aspect of thought.

This is like number 5 in the post, but when phrased as “coherent reasoning”, it sounds like something that would be applicable occasionally to matters of the intellect alone rather than also to mundane, seemingly non-intellectual decisions like which route to take home from the office, how much to trust somebody initially and how to update appropriately, whether chili sauce is likely to improve this dish, and almost everything else. The quasi-mystical feeling comes from the sense of having unlocked one of the universe’s great secrets, a secret that is foundational and thus applicable to pretty much every aspect of life and thought, every single day of your life. When viewed like this, it’s a bit like language itself, and I imagine that somebody who could all of a sudden become a language user after life without language would have similar epiphanies.

Cyan

Daniel Burfoot,

I’ve got some questions about the codelength version of Bayesian statistics, which I’m assuming is synonymous with MML:

– do you have any recommendations for a good introductory text with some problem sets?
– MDL attempts to avoid the necessity of using a prior distribution; how does it go wrong?
– is there a codelength/MML analogue to Bayesian posterior inconsistency results such as those of Persi and Diaconis [ref]?

Cyan

(I had a brainfart — that should be Freedman and Diaconis.)

mjgeddes

Heh,

The ‘mystical feelings’ inspired by Bayes are quite misplaced. Bayes can be beaten.

The assumption required for Bayesian reasoning to work is that the possible outcomes being assigned probabilities are independent of the motivations of the agent observing these outcomes. As the motivations of the agent start to become mixed with the possible outcomes, Bayesian reasoning starts to break down.

This is clearly seen in puzzles of Anthropic reasoning, such the Doomsday argument, or puzzles of self-reference such as Newcomb’s box, which cannot be solved via Bayesian methods.

The reason Bayes breaks down in these situations is that the boundaries between ontological categories are somewhat fluid, whereas Bayes requires that these boundaries be precisely defined. The reason for the ambiguity of ontological categories is the impossibility of finite algorithmic definitions of the meaning (semantics) of many concepts.

Analogy formation is a more general and poweful method of reasoning than Bayes, because analogy formation provides a means to ensure interoperability between different knowledge domains, and thus it can deal with fluid ontological categories.

Bayesian reasoning is merely a special case of analogy formation, namely the case where the semantic meaning of concepts is fixed (i.e the case where the boundaries between ontological categories are precisely defined).

The assumption required for Bayesian reasoning to work is that the possible outcomes being assigned probabilities are independent of the motivations of the agent observing these outcomes. As the motivations of the agent start to become mixed with the possible outcomes, Bayesian reasoning starts to break down.

It sounds to me as though you have your concept of reason muddled-in with ideas from decision theory. “What is true” and “what to do” are rather different issues.

Daniel Burfoot

Cyan,

The foundation of MDL is information theory, the standard textbook by Thomas and Cover has many good problems. For a good tutorial specifically about MDL, do a scholar.google search for “Grunwald” and “MDL tutorial”. Also see Vapnik’s awesome “Nature of Statistical Learning Theory” for a good discussion of the relationship between Bayes, MDL, and VC-theory style regularization as well as other intriguing topics.

I don’t believe MDL does avoid the necessity of using a prior distribution; I believe it makes the necessity of choosing such a distribution philosophically clear and unavoidable (it is the choice of data format agreed on in advance between sender and receiver).

The problem of inconsistency is a deep technical one and as I said I don’t believe the codelength view offers any technical advantages, only philosophical ones. The basic analogue of the inconsistency problem seems to be the case where you have a data format that could in principle achieve low code rates for a certain data set, if you could infer the right set of model parameters; but for whatever reason the data “tricks” you into inferring the wrong parameters and so you can’t get the optimal low codelength.

one of the best starting points is Baxter and Olivier’s “MDL and MML: Similarities and Differences”, 1994 (Tech Report in three parts). David Dowe’s page http://www.csse.monash.edu.au/~dld/MML.html may also be interesting for you.
The best intro paper on MDL is probably Grünwald’s “A Tutorial Introduction to the Minimum Description Length Principle”, which also addresses your question about priors in MDL (and mentions some consistency results, if I remember correctly). Grünwald’s recent book on MDL also makes for an interesting read, if you want to dig deeper. Li & Vitalyi’s canonical book on Kolmogorov complexity will give you the most profound understanding of the topics Daniel Burfoot mentioned above.

anonym: I think part of the “mystical” feeling probably comes from the realization of how widely applicable Bayes’ theorem is and a sense that it can function as something like the foundation of a calculus of thought such as people like Leibniz and Boole have sought for so long and can be applied to every aspect of thought.

“Calculus of thought” is a solved problem. What Leibniz sought, Boole found, and Frege, Russell, and Whitehead brought to completion: the calculus of thought is propositional and first-order predicate calculus. That is, the calculus with which we must reason, to reason validly, not the ways in which we do reason, which is whatever our meatware does, valid or not, and is not a calculus.

A problem with regarding Bayesian reasoning or anything else — quantum logic, modal logic, or whatever — as the calculus of thought is that on the metalevel, we always go on reasoning in POML: plain old mathematical logic, where things are simply true, or false. Bayes’ theorem itself is of this nature. It is about probabilities, but is not a probabilistic statement. What is the probability that P(A|B) = P(B|A) P(A)/P(B)? One. Bayesian and other calculi are mathematically accurate models of certain aspects of the world, but they do not model valid reasoning in general. Standard mathematical logic does.

Tyrrell McAllister

@Richard Kennaway

I think the question of whether probabilistic reasoning or POML reasoning is more fundamental is not so straightforward. When one searches for a “most” fundamental logic, one finds that the structure of the candidate systems becomes “loopy”, with each perfectly capable of being embedded within the others.

For example, should you go with POML or intuitionistic reasoning in mathematics? There’s no formal criterion. Every intuitionistic theorem is a classical theorem once you restrict the quantifiers properly. Every classical theorem is an intuitionistic theorem if you replace “P or Q” with “not-(not P and not-Q).

This is just an approximation.
The only way to check a mathematical proof is by feeding it to a proof-verifying physical system. That system could e.g. be a program running on a digital computer or a human (including yourself). Unless you have perfect and absolutely reliable knowledge about both the physical system and the physical laws of this universe – which you can’t, in practice – there’s always the chance that the system in question does it wrong; in this case, that it outputs “proof is correct” even though the proof isn’t.
Feeding the same possible proof to very many different proof-verifying physical systems can dramatically decrease the probability of an incorrect verdict like that. It can’t push it to exactly zero.
My personal probability of Bayes’ theorem being incorrect is ‘epsilon’, i.e. nonzero but too low for me to bother tracking the exact order of magnitude.

Cyan

Thanks, Daniel and Manuel!

Daniel, I’ve read Grünwald’s tutorial, and it’s clear that MDL is not Bayesian: just look at the normalized maximum likelihood distribution, which when it exists solves the MDL problem — and violates the likelihood principle, as it requires a summation over the data space. (…And does not require an explicit prior; not sure if there’s an implicit prior determined by the choice of data format, per your statement.) Rissanen is pretty contemptuous of the Bayesian approach (and the frequentist approach and the likelihood approach; dude has some strong opinions). I was hoping for a nice 70 page Grünwald-style tutorial on MML, as I have not been able to find one myself.

This has been an interesting discussion, revealing to me that the participants in this forum have a much different perspective on Bayes, compared to the perspectives of Christian Robert and I have.

I have little to add to the discussion except to comment on Richard Kennaway’s statement that “standard mathematical logic does [model valid reasoning in general].”

One thing I’ve learned in applied statistics is that there are lots of different logical frameworks that can work well. So, sure, mathematical logic can model valid reasoning, so can Bayes, so can fuzzy sets, machine learning, etc. All these systems can work, and they all have logical holes too. It’s the nature of inference.

Intuitionistic logic is the only one I can think of that has any possibility of competing for the office of calculus ratiocinator. I can just about imagine conducting all one’s thought on every meta-level without ever assuming that if a proposition cannot be false, it must be true. But a classicist can pass among intuitionists just by prefixing everything with not-not, and who’s to know if a professed intuitionist isn’t doing the same? Personally, I think intuitionism was a historical accident that would never have happened if Babbage’s machines had been more practical, but that is another story.

@Sebastian: My personal probability of Bayes’ theorem being incorrect is ‘epsilon’, i.e. nonzero but too low for me to bother tracking the exact order of magnitude.

If I’m not assigning actual numbers, I’m not doing probabilistic inference, even if I believe I am. Even if I have an actual epsilon not plucked out of the air, if I throw it away, I’ve reverted to POML.

BTW, combining the last two points, I see from Google that there is such a thing as intuitionistic Bayesianism. I do not know how well known this is.

@Andrew: All these systems can work, and they all have logical holes too.

Logical holes in POML? Gödel’s completeness theorem proves their absence. (His incompleteness theorems talk about theories expressed in POML, not POML itself.) Did you have something else in mind?

mjgeddes

It sounds to me as though you have your concept of reason muddled-in with ideas from decision theory. “What is true” and “what to do” are rather different issues.

You don’t have to be a genius to see that there’s something suspiciously incomplete about Bayes. The Anthropic puzzles have not been solved using Bayesian methods, and Bayesian experts report being ‘confused’ in certin cases such as the Doomsday argument. This hints that there may be more powerful reasoning methods.

We know that Deduction is merely a special case of Induction (namely the case where the probabilities are set to 100%). In other words, Deduction is merely a shadow of Induction. Could it be that Induction in turn is merely a shadow of some as yet undiscovered more powerful method still? If so, there would be some shifting parameter which is not probability, but that when set to some special case would look like a probability.

semantic distance perhaps?

Daniel Burfoot

Cyan,

See, this kind of terminological disagreement illustrates why I think it’s better to use the codelength idea 🙂

Can normalized maximum likelihood be used to send data? If so, then it implies an implicit prior over data sets which is exactly 2^(-l(x)), where l(x) is the length of the code. Whether or not this means it is “equivalent” to Bayes would seem to depend on what the word “Bayesian” means to you; in my lexicon it means a philosophical commitment to the necessity of using prior distributions that are essentially arbitrary. Once you’ve accepted that priors are necessary, then the rules for updating them are mathematical theorems which are no longer disputable.

Note that the above argument “Can method X be used to send data? If so, then it implies an implicit prior over data sets…” works for a wide range of methods X (e.g. Support Vector machines, Belief nets) which various people have claimed are not explicitly Bayesian.

It also means they are ALL subject to the mighty No Free Lunch Theorem which says roughly that in general, data compression cannot be achieved. All modeling and statistical learning techniques should therefore be prefaced by disclaimers noting that “this method does not work in general, but if we make certain assumptions about the nature of the process generating the data…”

Andrew, thanks for starting this discussion, looking forward to future OB posts from you (don’t tell Eliezer that you’re into things like the Gibbs sampler and Metropolis algorithm, though).