Bayesian Probability Theory and Quantum Mechanics

John Baez

September 12, 2003

It's not at all easy to define the concept of
probability. If you ask most
people, a coin has probability 1/2 to land heads up if
when you flip it a large number of times, it lands heads up
close to half the time. But this is fatally vague!

After all what counts as a "large number" of times?
And what does "close to half" mean? If we don't
define these concepts precisely, the above definition is
useless for actually deciding when a coin has probability 1/2 to
land heads up!

Say we
start flipping a coin and it keeps landing heads up, as
in the play Rosencrantz and Guildenstern are Dead by
Tom Stoppard. How many times does it need to land heads
up before we decide that this is not happening with probability
1/2? Five? Ten? A thousand? A million?

This question has no good answer.
There's no
definite point at which we become sure the probability
is something other than 1/2. Instead, we gradually
become convinced that the probability is higher. It seems
ever more likely that something is amiss. But, at
any point we could turn out to be wrong. We could have been
the victims of an improbable fluke.

Note the words
"likely" and "improbable". We're starting
to use concepts from probability theory -
and yet we are in the middle of trying to define
probability! Very odd. Suspiciously circular.

Some people try to get around this as follows.
They say the coin has probability 1/2 of landing heads
up if over an infinite number of flips it lands
heads up half the time. There's one big problem, though:
this criterion is useless in practice, because we can
never flip a coin an infinite number of times!

Ultimately, one has to face the fact that probability cannot
be usefully defined in terms of the frequency of
occurence of some event over a large (or infinite) number
of trials. In the jargon of probability theory, the
frequentist interpretation of probability is wrong.

Note: I'm not saying probability has nothing to do
with frequency. Indeed, they're deeply related!
All I'm saying is that we can't usefully define
probability solely in terms of frequency.

If you're not convinced yet, consider a statement like
this: "Mr. X has a 60% chance of winning the next presidential
election". There is no way to determine this by holding
the next presidential election a large number of times
and checking that Mr. X wins about 60% of the time.
Nonetheless, I claim this statement is meaningful.
If you don't believe me, argue with the British bookies
who post odds and take bets on such events. They make
their living by doing this!

Carefully examining such situations, we are lead
to the Bayesian interpretation of probability.

The basic idea behind the Bayesian interpretation is
that probability is not something we start by measuring.
Instead, we must start by assuming some probabilities.
Then we can use these to calculate how likely
various events are. Then we can do experiments to see what actually
happens. Finally, we can use this new data to update our assumptions -
and the Bayesian interpretation gives some recipes for doing this.
But everything starts by assuming some probabilities at the start.
This is called the "prior probability distribution"
or prior for short.

Subjective Bayesians argue that
ones choice of prior is unavoidably subjective.
Objective Bayesians try to find rules for choosing
the "right" prior. Personally I don't think there's
a serious conflict here. The choice of prior is subjective, but
in some situations there are nice rules to help one choose it.
For example, in a situation where your evidence suggests
that your coin has a symmetry - the two sides
don't seem very different - you can use this
to pick a prior which says the chance of it landing heads up
equals that of it landing tails up. Of course you could be
neglecting the fact that some sneaky guy weighted the coin
and it just looks symmetrical. That's life.

What follows is a collection of nice posts on
the Bayesian interpretation of probability and its
relevance to quantum theory. It turns out that a lot
of arguments about the interpretation of quantum theory
are at least partially arguments about the meaning of
the probability! For example, suppose you have an
electron in a state where the probability to measure
its spin being "up" along the z axis is 50%.
Then you measure its spin and find it is indeed up.
The probability now jumps to 100%. What has happened?
Did we "collapse the wavefunction" of the
electron by means of some mysterious physical process?
Or did we just change our prior
based on new information? Bayesianism suggests the latter.
This seems to imply that the "wavefunction" of
a specific individual
electron is just a summary of our assumptions about it,
not something we can ever measure. Some people find this infuriating;
I find it explains a lot of things that otherwise seem mysterious.

There are a lot of tricky issues here. Quantum theory
is more general than classical probability theory. It
presents a lot of new puzzles of its own. But, at the
very least, we need a clear understanding of what "probability"
means before we can tackle these quantum quandaries.

I believe the frequentist interpretation just isn't good enough
for understanding the role of probability in quantum theory.
This is especially clear in quantum cosmology, where
we apply quantum theory to the entire universe. We can't
prepare a large number of identical copies of the whole
universe to run experiments on!

>baez@guitar.ucr.edu (john baez) writes:
>>People sometimes feel like retroactively projecting the wavefunction
>>down onto the component that fits their observation, *after* they've
>>made the observation. Any attempt to do this sort of thing is, IMHO,
>>a confused version of what you really are doing when applying quantum
>>mechanics correctly, namely, to *first* model all your assumptions
>>about the state of the universe by a wavefunction, and *then* compute
>>probabilities with that wavefunction.
>>Doing this systematically gives all the right answers to quantum
>>mechanics problems, so there's no need to do anything else.
>The "collapsarian" style for calculating amplitudes is to use one's
>current observations to get an up-to-the-minute best fit wave
>function. The Everett style seems to be to fix a wave function once
>and for all, and *never* change it in light of observations.

This is a really crucial issue, and it's probably due to long
discussions with Daryl that I evolved my current position on this issue.
I do *not* side with the - no doubt mythological - "Everettistas"
described below:

>Here is a sample conversation between two Everettistas, who have fallen
>from a plane and are hurtling towards the ground without parachutes:
> Mike: What do you think our chances of survival are?
> Ron: Don't worry, they're really good. In the vast majority of
> possible worlds, we didn't even take this plane trip.

Note that anyone who acted that way would be silly, and that their error
would have little to do with QUANTUM MECHANICS, but mainly with
PROBABILITY THEORY. Probability theory is the special case of quantum
mechanics in which ones algebra of observables is commutative. (This
becomes a theorem in the context of C*-algebra theory.) Quantum
mechanics has special features due to the noncommutativity, but
probability theory already exhibits certain subtleties of
interpretation, which I believe are a large part of what's at stake
here.

Part of the point of Bayesianism is that you start with a "prior"
probability measure. Let me just call this the "prior" - I think
Bayesians have some bit of jargon like this. (I wish some expert on
Bayesianism would step in here and give a 3-paragraph description of its
tenets, since I feel unqualified.) In any event, when I say you
"*first* model all your assumptions about the state of the universe by a
wavefunction, and *then* compute probabilities with that wavefunction,"
I mean that the wavefunction plays the role of the "prior".

Bayesianism is called "subjective" in that it applies no matter how you
get your prior. In other words, you could be a pessimist and wake up in
the morning assuming that sometime today a nuclear attack will
devastate your town, and constantly be surprised as each hour goes by
without an attack. This might or might not be smart, but if you are a
good Bayesian you can correctly compute probabilities assuming this
prior.

Similarly, you could do good experiments, crappy experiments, or no
experiments, and guess a wavefunction based on what you know, guess, or
hope, but if you know the rules of quantum mechanics, you can compute
probabilities correctly *assuming* the wavefunction. You could even find
the wavefunction written on a crumpled-up piece of paper in my trashcan!
No problem!

When you compute probabilities, however, you don't just compute
"straight" probabilities with respect to the prior, you also compute
*conditional* probabilities. I.e., even if you are the above pessimist,
you can consider the probability that you'd enjoy a fine afterdinner
brandy *given* that the nuclear attack hasn't occured yet. You might
think there is only a 5% chance that you will be alive at this point,
but still think that *given* that the attack hasn't occured, there is a
95% chance that some brandy would be enjoyable.

So suppose Ron and Mike use as their prior some wavefunction based on
(approximate) measurements they did of the positions and velocities of
all the elementary particles in the world on Tuesday. When they are
falling out of the plane on Wednesday, they *could* use the prior to
compute the probability that they actually took that plane trip. It
might indeed be very low. However, they might find another calculation
infinitely more interesting: namely, the conditional probability that they
will survive, *given* that they took the trip and fell out of the plane.

Note: if you want, you can think of this process of switching from
computing probabilities using the prior to computing conditional
probabilities as a mysterious PHYSICAL PROCESS - the "collapse of the
wavefunction". This would be wrongheaded, because in fact it is simply
a change on *your* part of what you want to compute! If you think of
it as a physical process you will be very mystified about things like
when and how it occurred!

As I said once upon a time, we can imagine a sleepy physics student
attending a physics lecture. At the beginning of the class the
professor is working out a problem in which an object has velocity v = 0
at time t = 0. At this point the student drifts off to sleep. Later,
he wakes up and finds the professor working out a TOTALLY DIFFERENT
PROBLEM in which an object has velocity v = 1 at t = 1. The student
doesn't realize that he has been asleep for half and hour, and so he
raises his hand and asks "Professor!! At what time t did the
acceleration occur??" Asking when the wavefunction collapses is like
this.

Now I admit that this view of quantum mechanics takes a while to get
used to. In particular, there really *are* issues where quantum
mechanics is funnier than classical probability theory. In classical
probability theory there are pure states in which *all* observables have
definite values. Subconsciously we expect this in quantum mechanics,
even though it's not so. So we always want to ask what's "REALLY going
on" in quantum mechanics - meaning that we secretly yearn for a
wavefunction that is an eigenstate of all observables. If we had such a
thing, and we used *it* as a prior, we wouldn't need to worry much about
conditional probabilities and the like. But alas there is no such
thing, as far as we can tell.

This is a really crucial issue, and it's probably due to long
discussions with Daryl that I evolved my current position on this issue.
I do *not* side with the - no doubt mythological - "Everettistas"
described below:
>Here is a sample conversation between two Everettistas, who have fallen
>from a plane and are hurtling towards the ground without parachutes:
>
> Mike: What do you think our chances of survival are?
>
> Ron: Don't worry, they're really good. In the vast majority of
> possible worlds, we didn't even take this plane trip.
Part of the point of Bayesianism is that you start with a "prior"
probability measure. Let me just call this the "prior" - I think
Bayesians have some bit of jargon like this. (I wish some expert on
Bayesianism would step in here and give a 3-paragraph description of its
tenets, since I feel unqualified.)

:-). I have found this point of view to be very helpful for getting
a better understanding of quantum mechanics and even for understanding
why people argue about it so much. In fact, the spectrum of interpretations
in quantum mechanics has a close analogue in probability theory.
The "wave function is real" view is analogous to the "frequentist" view of
probability theory where probabilities describe "random pheonomena" like
rolling dice or radioactive decays and the "wave function represents what
you know about the system" view is analogous to the Bayesian view where
probability is just a consistent way of assigning liklihoods to propositions
independent of whether they have anything to do with a "random process." Just
as in quantum mechanics, arguments have raged for many (more than 100) years
without any real resolution and, just as in quantum mehcanics, when the
two camps actually solve the same problem, the mathematics is basically
the same. A typical example of this sort of disagreement is Laplace's
successful calculation of the probability that Jupiter's mass is within
some interval. To a frequentist, the mass of Jupiter is a number.
Admittedly, this number is unknown, but it is definitely not a random
variable (since there is no "random process changing Jupiter's mass")
and so it is utter nonsense to talk about it's p.d.f. This may seem
like a silly kind of disagreement, but the consequences in terms of
what problems can be solved and in terms of understanding what probability
theory is all about couldn't be greater.

Most people are more familiar with the frequentist view where you
say that if you perform a "random experiment" N times with n successes
then the "probability of success" is the large N limit of n/N. You then
assume that these probabilities obey Kolmogorov's axioms, and you're all
set. The rest of probability theory is solving harder and harder problems.
The Bayesian view of things is a bit different and starts this way. Suppose
that we want to attach a non-negative real number to pairs of propositions
(a,b) and this number is supposed to somehow reflect how likely it is that
"b" is true if "a" is known. Let me write this number as a->b [*].
For "->" to be a useful likelihood measure one expects a few modest things
of it. For example, if you know a->b, this should determine a->.not.b and
the procedure to get from a->b to a->.not.b shouldn't depend on "a" or "b."
It turns out that this and just a little bit more is enough to entirely fix
probability theory, as shown in an obscure paper by Cox in Am.J.Phys.
in 1946. One gets

which is the Bayesian form of probability theory. You can then trivially
show (Bayes Theorem) that

(a.and.b -> c) = (a->c) {(a.and.c -> b)/(a->b)}

if (a->b) is nonzero. This is often used in the following context:

a = "stuff that you know"

b = "more stuff that you found out"

c = "something that you're interested in"

Then if you already know (a->c), Bayes theorem tells you how to find the
probability that c is true, given your additional knowledge b, i.e.
(a.and.b -> c). For example, suppose that you happen to know that the
behavior of a random variable x obeys one of a family of pdfs f(x,t)
where t is some unknown parameter. Given a sample of independent x values
X = (x1,x2,...,xn), what can you say about t? Using Bayes theorem,
it's easy like pie. If "e" is the initial knowledge of the experiment as
just described, then you want to calculate

(e.and.X -> t) = (e->t) {(e.and.t->X)/(e->x)}

Here (e->t) is called the "prior" probability that the true pdf is f(.,t).
If we have no reason to prefer one value of t over another, we can use
the "uniform prior" (e->t) = const. Then, since (e.and.t->X) =
(e.and.t->x1.and.x2.and.x3...xn) = f(x1,t)f(x2,t)...f(xn,t),

(e.and.X -> t) = const. Prod{j=1,n} f(xj,t)

and you're done. This is usually called the likelihood method. There are
more sophisticated methods for chosing priors in various situations
(e.g. "Maximum Entropy") but the basic idea is the same.

So far, I have left out one important point. In the frequentist
view of probability you start of assuming that probabilities have a particular
frequency meaning. In the Bayesian view, this must be derived by
considering copies of a single experiment and considering the probability
that n/N of them have success. You can then get the standard frequency
meaning of probabilities provided that you assume (roughly) that
probability zero events don't happen.

Note that because of this frequency meaning, probability theory is
not just a piece of mathematics. It is really a physical theory about
the world which might or might not be correct. From this point of view, it
it tempting to try to explain quantum phenomena by modifying probability
theory. As far as I can tell, this idea actually works and has more
consequences than "just another interpretation" of quantum mechanics.

Bayesianism is called "subjective" in that it applies no matter how you
get your prior. In other words, you could be a pessimist and wake up in
the morning assuming that sometime today a nuclear attack will
devastate your town, and constantly be surprised as each hour goes by
without an attack. This might or might not be smart, but if you are a
good Bayesian you can correctly compute probabilities assuming this
prior.

That's right. Of course, you can get the wrong answer if you
have the wrong prior, but this is viewed as a progress! From the
Bayesian point of view, science progresses by finding out that your prior
isn't working. For example, your prior may include a physical theory that is
wrong.

When you compute probabilities, however, you don't just compute
"straight" probabilities with respect to the prior, you also compute
*conditional* probabilities.

In the Bayesian view, all probabilities are conditional since they
all depend on what you know. This is also true in Kolmogorov's system
but only within a fixed sample space.

Note: if you want, you can think of this process of switching from
computing probabilities using the prior to computing conditional
probabilities as a mysterious PHYSICAL PROCESS - the "collapse of the
wavefunction". This would be wrongheaded, because in fact it is simply
a change on *your* part of what you want to compute! If you think of
it as a physical process you will be very mystified about things like
when and how it occurred!

Yes! As I've said, probably too many times, it's like wondering what
physical process causes a probability distribution to "collapse" when you
flip a coin.

Note that anyone who acted that way would be silly, and that their error
would have little to do with QUANTUM MECHANICS, but mainly with
PROBABILITY THEORY. Probability theory is the special case of quantum
mechanics in which ones algebra of observables is commutative. (This
becomes a theorem in the context of C*-algebra theory.)

Could you post or email me the reference for this theorem?

Now I admit that this view of quantum mechanics takes a while to get
used to. In particular, there really *are* issues where quantum
mechanics is funnier than classical probability theory. In classical
probability theory there are pure states in which *all* observables have
definite values.

Notice that a statement like: the coin is in "state" (1/2,1/2) would
be very bad language from the Bayesian point of view since (1/2,1/2)
represents what you know and not some physical property of the coin.
One of the reasons that this point of view "takes getting used to" in
quantum mechanics is that the language of standard quantum theory constantly
reinforces the idea that Psi is the "state of the system."

Subconsciously we expect this in quantum mechanics,
even though it's not so. So we always want to ask what's "REALLY going
on" in quantum mechanics - meaning that we secretly yearn for a
wavefunction that is an eigenstate of all observables. If we had such a
thing, and we used *it* as a prior, we wouldn't need to worry much about
conditional probabilities and the like. But alas there is no such
thing, as far as we can tell.

That would be like yearning for the "true probability distribution" for
coin flippage. But the fact that there isn't any such thing independent
of your state of knowledge doesn't mean that their isn't something REALLY
going on (e.g. a REAL copper penny being flipped by a real human being).
In spite of non-commuting observables and Bell's theorem and all it's
variations, I don't think that it has quite been shown that there can't
be something "REALLY going on", as you say.

By the way, Ed Jaynes is writing a book on Bayesian probability
theory which is easily readable by undergraduates. For some reason, it's
current draft is available on www at:

The Bayesian approach to this question - when is a
prior distribution "right"? - seems to me to be the most
clear-headed one... though I still have a lot more to learn about what
the Bayesians actually say, and the different flavors of Bayesianism.

I'm interested in flavors too and especially if someone could comment
on how the Jaynesian version is related to the rest of the field or to add to
my list of references:

"Maximum Entropy and Bayesian Methods", ed. J.Skilling, Kluwer, 1988
[this is one of a series of conferences]

John Baez writes:
>The Bayesian approach to this question - when is a
>prior distribution "right"? - seems to me to be the most
>clear-headed one... though I still have a lot more to learn about what
>the Bayesians actually say, and the different flavors of Bayesianism.
I'm interested in flavors too and especially if someone could comment on
how the Jaynesian version is related to the rest of the field or to add
to my list of references:

The late Harold Jeffreys was a proponent of the
"objectivist" school; Jaynes frequently mentions
Jeffreys in his writings. Jeffreys wrote a treatise,
_Theory of Probability_, which is sometimes in print
(Oxford), in which he sets out his views.

Jack Good (I.J. Good) discusses the classification
of Bayesians in an article reprinted in his book,
_Good Thoughts_.

[deletia]
It is this insistence on a unified, "natural" approach that I don't
like. The impression I get (and I guess I don't really know enough
about it to make any strong claims) is that once you've decided on
your subjective prior distribution(s), from then on, you just turn the
crank. I'd rather have the privilege of, at any time, deciding to toss
out my old idea of what the probabilities are, and use instead a new
(possibly unrelated) set of probabilities. I don't want to be told
how my probabilistic guesstimates are supposed to change with time.

It's true: Bayesianism requires you to update your probabilities every time
you incorporate new data, and do so in a manner consistent with the data. Why
is this seen by frequentists as a Bad Thing? Because of the different
interpretation Bayesians and Frequentists put on the word "probability".

[deletia]
Maybe Bayesianism doesn't dictate what *prior* to use (although I
thought that some Bayesians insist that you use the maximum entropy
principle).

Me, for one.

>But it seems to me that they *do* dictate how my
>probability distributions can change with time.

Indeed, but at this point, you should ask what it is that you mean by
probability? Do you interpret it as the degree of belief in a proposition, as
the expected frequency of occurrence of an event in an infinite sequence of
trials, or some loose mish-mash of the two?

Maybe I'm wrong about this, but if so, then what *does* Bayesianism do?
What is the "Bayesian approach", if it doesn't in any way
constrain how one does probability? It just tells you what kind of
attitude you should have towards probability?

Bayesianism uses the first of the above interpretations of probability, using
Cox's theorem as validation, although the approach dates much further back, to
the days of Laplace. From your comments objecting to being told how your
probabilities should change with time (!) I suspect you use the second
interpretation, which puts you firmly in the frequentist camp.

In a reasonable world, there is room for both approaches, provided people are
clear about what they are calculating. The problems occur most often when
people try to do things using a frequentist approach to data analysis which
simply isn't appropriate. For further reading on the subject, I suggest Berger
and Sellke's paper on the subject a few years ago. (Journal of the American
Statistical Association, March 1987, Vol 82, No 397, 112 ff) A most
stimulating exchange of views followed it. Some of which confirm that we don't
live in a reasonable world.

It is this insistence on a unified, "natural" approach that I don't
like. The impression I get (and I guess I don't really know enough
about it to make any strong claims) is that once you've decided on
your subjective prior distribution(s), from then on, you just turn the
crank.

Pretty much. You assume a prior distribution and calculate the
probability of the distribution given the data (well, you often calculate
the posterior distribution, i.e. the probability distribution given
your prejudices (your prior) and the data).

I'd rather have the privilege of, at any time, deciding to toss
out my old idea of what the probabilities are, and use instead a new
(possibly unrelated) set of probabilities. I don't want to be told
how my probabilistic guesstimates are supposed to change with time.

To quote somebody else (John Baez?) "Huh?" The change-with-time part
seems to be simply the rules for how the addition of new data changes
your posterior distribution. If you believe in what you're doing, then
acquiring new data does change your beliefs. It should, after all,
if you get enough data you might find out your prejudices (your prior)
were wrong. Looking at the math you find that if you have enough data
it doesn't matter what you take for the prior, which is comforting.

Of course, if you like, you can change your mind and substitute a new
prior distribution, and recalculate. If that is what you want we've
got it! Call now to order ...

[John Baez wrote:]
>Huh? As Bill Taylor explained, the whole point of Bayesianism, or what
>he more descriptively calls "universal subjective priorism", is that
>there's no point to saying your subjective probabilities are "wrong"!
>You seem in fact to be arguing FOR Bayesianism in what you just wrote.
>This was the whole point of my parables in which one is told ones prior
>distribution by a little bird, or a fox. Given the prior one can
>compute probabilities, but without a prior one can't compute the
>probability that someone else's predictions are right or not, so one
>can't "judge" a prior without reference to some other prior.
Maybe Bayesianism doesn't dictate what *prior* to use (although I
thought that some Bayesians insist that you use the maximum entropy
principle). But it seems to me that they *do* dictate how my
probability distributions can change with time.

Only in the sense that it dictates how to calculate the posterior
distribution given the prior distribution and the data. This part is
plain arithmetic and is generally not the controversial part of
Bayesianism (that is even an anti-Bayesian would say well if you must
do it, do it that way). The anti-Bayesian critique usually centers on
the subjectivity inherent in the choice of prior. Bayesians counter
this with the "at least we have our prejudices up front" argument.
(An example: using chi-squared to fit a multi-parameter model is
equivalent to Bayesian inference on data with gaussian noise and
uniform priors for the parameters.)

Maybe I'm wrong about
this, but if so, then what *does* Bayesianism do? What is the
"Bayesian approach", if it doesn't in any way constrain how one does
probability? It just tells you what kind of attitude you should have
towards probability?

Oddly enough, that change of attitude can change how you calculate, in
certain examples. A common example is that if you use classical
methods to fit for a parameter from the data, you are generally
assuming a uniform prior on the parameter, i.e. 0 is as likely as any
other value. If you don't know a priori that the effect represented
by the parameter exists, that is you want to discriminate between the
null hypothesis and "the parameter exists with value X," maybe that is
the wrong prior to use. Maybe you should go Bayesian and put half the
prior probability on the null hypothesis, parameter = 0, and
distribute the rest as you see fit over nonzero values.

This can have a big effect on the results: data which show that a
parameter is statistically significantly different from zero with a
classical analysis can be consistent with the nonexistence of the
parameter in a Bayesian analysis. Philip Morrison claims that this
may be responsible for much of the hoo-ha over the ephemeral "fifth
force" and the retired Princeton professor who thinks he's shown that
ESP exists. (Take that, S*rf*tti!)

Bayesian analysis also makes your life infinitely simpler, in the
sense that you don't have to run around remembering a zillion
different classical-statistical formulae for the case of normal
distribution with known mean and unknown variance, unknown mean and
known variance, and so on.

All this Bayesian stuff used to intimidate me until I got hold of a
good introductory book and worked out a few examples, then I realized
it was actually easier than the other statistics. By now there are a
number of sources for learning about it: I recommend Peter Lee's
"Bayesian Statistics: An Introduction." (Oxford U.P.)

The most recent post, an excellent one by Ben Weiner, makes a good point
more elegantly than I had done earlier...

Bayesian analysis also makes your life infinitely simpler, in the
sense that you don't have to run around remembering a zillion
different classical-statistical formulae for the case of normal
distribution with known mean and unknown variance, unknown mean and
known variance, and so on.

Quite so; this uniformity of approach is one of the most appealing things
about it. And also is the simplicity of getting results; either numerically
with hardware, or theoretically (when conjugate priors are used, though
this smacks of objective Bayesianism, see below!)

But Oh Dear! Ben, I was sorry to see you using "infinitely" there, when
you just mean "very much"! Surely the media-man-in-the-street hasn't
corrupted your linguistic habits so terribly! You know how we *loathe*
these non-technical "infinity" uses; hardly admissible hyperbole! ;-)

Many others have made good comments on this ongoing thread, but I'll just
add a few answers to some outstanding points.

Firstly to deal with a couple of irritants...

Mike Price:

and it is not apparent (to me, anyhow) that the assumption of
probability isn't already lurking somewhere under a stone, in your
definition, even excluding the explicit use of "average".

Well this is surely no problem. There are plenty of places where "averages"
are taken (i.e. arithmetic means), that have nothing to do with probability.
Saul Youssef:

>about as useful as speaking of negative natural numbers!
That's an odd point to make since negative numbers are very useful!

You made this comment in another thread a while ago, Saul, very similarly.
You'll note I did say negative *natural* numbers; not just negative numbers.
Sure negative numbers are useful; but negative natural numbers, like negative
probabilities, are about as useful as square circles.

However, all was forgiven, Saul, when I saw this..

it's those annoying Ayn Randians from news groups outside of our own
galaxy who are really getting to me...

<snort.> Right on; they really are from another universe, where greed
masquerading as conscience is considered OK!

More seriously though, I was struck by this comment from John Baez:

You can make observations and use them to GUESS a probability
distribution, but this has an irreducible element of subjectivity in
it.

Highlighting the word "guess" seems a little unwise; as the use of this
word suggests you vaguely feel there's *still* a *real* prior out there
somewhere; one you can never know but try to come close to. As we've all
insisted, this is totally wrong thinking; the true-blue (or subjective)
Bayesian can make no sense of this idea. Your prior isn't a guess at anything,
it's just your own quantification of your own uncertainty.

And this connects to another point, well emphasized by Saul:

Many people think of probability theory as a something applying to
problems that have "random" elements like dice, roulette wheels or
radioactive decays. The main insight of the Bayesians is that [it has]
nothing in particular to do with "random" processes (whatever they
are) and apply in a vastly greater domain.

Exactly so; as many others have also observed. Bayesian ideas are used
in any situation where there is *uncertainty*, whether of a probabilistic
nature or otherwise.

I have a nice anecdote about this. Dennis Lindley was giving a talk to us;
and started at a just-freshly-cleaned whiteboard, by writing up in large
caps the word UNCERTAINTY. He turned to his notes, fiddled with them for a
second or two, and turned back to the board to start talking - when he checked
suddenly, amidst scattered chuckles. The ink in UNCERTAINTY had melted into
the residual dampness on the whiteboard, and produced a delightfully fuzzy
splodge that could still just be read as "uncertainty", but with overtones
of Heiseneberg-like clouds of ink particle probabilities. Very apt!! Dennis
said he must try and get it set up like that for any future talks...

---

Now there's a whole sub-thread stemming from this plaint by Daryl McCullough:

I don't want to be told
how my probabilistic guesstimates are supposed to change with time.

This has already been answered well:- it's the natural probabilistic way
they ought to change, if you have any probability models at all. But I can
see how you might still feel grumpy about this. Why shouldn't you go back
and change your prior if it looks like the subsequent data a making it look
*really* stupid!

This is tough to answer. For one thing, if your prior was so silly as to have
zero probabilities in it, (or zero-density intervals, in the continuous case),
then you may *have* to. F'rinstance, if you declared that there was *zero*
prior chance of a six turning up on a dice - but then a six *did* turn up;
well, you're completely stuffed! You just have to go back and start again
without the silly zeros. And it'd be much the same if you had the prior not
quite zero but about 10^(-35). It'd still take billions of sixes turning
up before you'd posteriorly admit there was a reasonable chance of getting
some sixes. Clearly that was a silly prior. (Not *wrong*, note, just silly;
even by your own standards.) The Bayesian, like anyone else, has to use
some common sense and start with reasonable priors that admit a fair chance
to anything that could remotely happen. For a coin, for instance, one would
have most of the probability peaked near .5, but with reasonable amounts
smeared out toward 0 and 1, with perhaps some small (but not invisible)
probability masses at 0 and 1; to allow for the outside chance of a two-header.

But I think Daryl might still have a complaint. Even after all this, he may
still feel it sensible to be able to go back and change your prior *after*
having seen some data. But I think this would be (reasonably) regarded as
fairly irrational by a Bayesian. Of course, as Paul Budnik says:

However it is well known that
people do not make rational decisions about betting. That is why it is
possible to win at the race track if you do enough homework and have
enough self discipline.

Quite so! Though the way to win off this irrationality is to be the bookie!
The thing is - the scientist is supposed to have got a bit beyond this
kind of "hot streak" irrational thinking; and the Bayesian is just saying
that a common-sense prior and Bayes-theorem posteriors is the way to do it.

Otherwise, we could have this scenario. Daryl is trying to determine a coin's
propensity for coming up heads. He starts with a prior with a peak tight
around 0.5 . Then the first three tosses come up tails, so he goes back
and changes his prior to one much more spread out on the low end; but then
7 of the next 8 tosses come up heads, so he goes back and starts with a
new prior spread out more on the upper side; then the next few tosses...

Clearly this is an extreme example. But I hope Daryl, you might agree that
your desire to "go back and change the prior" would probably be essentially
like this, even in messier situations. What you're really doing is (mentally)
estimating posteriors from some (perhaps uncrytallized) subjective prior.

It may be, of course, that Daryl's real worries are as in his comment:

I thought that some Bayesians insist that you use the maximum entropy principle

Saul dealt with this - it *may* be useful sometimes to do this, but (the
true-blue Bayesian would say) *only* if you think it is! No-one says you
have to; except the red-hot objective Bayesian, who may be the nearest to
a religious nutter in this debate. There is one big problem with "objective
priors", which usually boil down to uniform - or as near as one can get to
it when strict uniform would be "improper". John Baez was obliquely referring
to it, I think:

any computation of probabilities is a computation w.r.t. some probability
measure [...]
entropy is defined relative to a prior distribution [...]

This is a variant of Laplace's paradox, about which much has been written.
I've not read it myself; Jim Berger is an objective Bayesian, I gather, and
when he was talking here I tried to get him to expound on it, but he skirted
away from it. The trouble is, as Laplace first observed, that a unifrom
prior is no longer uniform if you just re-parametrize the underlying
"observation space" in some non-linear way. An objectivist may insist that
there is usually only one "natural" way to do the parametrization, but this
is far from clear! In estimating the spread in a population, is sigma-squared
(the variance) or sigma (the standard deviation) more natural? It could
make a substantial difference when you assign your "natural" objective prior.

That example is not a great one, but multivariate situations are notoriously
susceptible to changes in parametrization. And there could be worse, if
John's remarks above are extended in an obvious mathematical way:- it may
even be that the "natural" parametrization is (measure-theoretically) singular
with respect to the "obvious" one! Like a "thick Cantor" distribution. This
would play havoc with things if you started with the "wrong" objective prior.
This may seem mere silliness and nit-pickery, but maybe not. I have this
haunting recollection (i.e. no hint of references!), of having seen a paper
about the orbits of asteroids. (Falsely reminiscent of a classic paper of
I.J.Good on Bode's law!) Apparently the gravitational resonance of Jupiter
*not only* tends to judder them into bands where the orbital frequency is a
simple ratio of Jupiter's, *but also* has the effect that the perturbational
width of the band surrounding frequency p/q is of order 1/q^2. So when all
these are unioned then intersected the result is a Cantor set of bands of
positive measure! Well obviously I've recalled it all awry - but
it was something like that, anyway. I remember it was the only case of a
Cantor set I'd ever seen that looked remotely natural. Sorry for the vague
nature of this example - maybe it rings a bell with someone else?

------

Both Saul Youssef and Daryl McCullough veered onto another sub-thread that
I approached above:- ultra-tiny probabilities, and how to handle them.

Daryl:

The only meaning that can be given to the claim
that "heads and tails each have probability 1/2" is in terms of the
limits of infinite runs of coin-flips - it is roughly equivalent to
saying that the probability of an infinite run for which the relative
frequency of the occurrence of heads is not 1/2 is zero. However,
that simply makes the meaning of "probability 1/2" dependent on the
meaning of "probability 0". But what is the meaning of *that*?

It doesn't really seem to have much to do with coherence, as you suggest.
I don't think this *is* a problem for the Bayesian, though it may be a
concern for the frequentist.

Saul:

Then the probability that n/N successes are observed is binomial and in
the large N limit concentrated around n/N=p. This means nothing, yet,
of course, unless we add the assumption (roughly) that "probability
zero events don't happen.

Exactly so. This gets to the heart of what seems to me to be the *only*
proper way to do frequentist statistics - well anyway it's the way I do it
myself in the only context where I personally do statistics, i.e. testing
the effectiveness of my own card-shuffling methods!

But there is a serious point. I had a colleague who used to say that when
a client came to you with a staistical problem, he wanted an answer couched
in similar terms to the way he presented it - and these would typically
NOT include the words probability, likelihood, or whatever. He wants something
more definite. Now such a client may be thought unreasonable; but I'm not so
sure. Let's consider an easy example.

There is an observation taken from a probability distribution we all *know*
to be Uniform on [A, A+1]. Only A is uncertain. One observation is taken,
and it turns out to be 2.7 . We can safely declasre that A is in the
range 1.7 to 2.7 ! No mucking about with confidence intervals or priors or
significance or anything else the client doesn't want to hear about! Great.

Of course no client beyond kindergarten level is going to ask for our opinion
in that scenario! But if we can make every statistical situation come out
in a similarly definite way, we would really be onto something. And this is
not necessarily absurd. In the standard situation where we observe X-bar of
a sample of size n from Normal(mu, 1), we can do it. Just give a confidence
interval of 99.9999 % confidence! (And this can be done by merely (say)
tripling the usual 95% one.) This is now "certain to be right", as in the
uniform example! "Certain", as in the excerpts above, meaning merely in the
thermodynamic sense that "we'll never get it wrong"; or better, that other
non-statistical effects (such as broken equipment, universal insanity, nuclear
war, etc) will swamp this tiny statistical uncertainty. Of course, getting
near-perfect hypothesis tests or confidence intervals like the above, may
entail HUGE useless intervals, or whatever; but not necessarily always so.

Not once in your whole career will you ever make a mistake this way! Of course,
it may be objected, as Daryl says:

there is no consistent way to treat small but nonzero probabilities
(say, a 1 in a million chance) as zero, because they have a tendency of
adding up.

But if you're only going to do a thousand of these in your career (and that's
busy!), you'll still never be wrong! Daryl would want to complain that we
might set up a computer simulation to do millions of these tests in a row;
and *now* we'd be in trouble. But the frequentist merely answers - that's
changing the experiment half-way through! We just reset the confidence level
so that *not one* of these million simulations will come out with the wrong
answer (with huge likelihood). This re-setting the levels is a bit reminiscent
of the Copenhagenist constantly shifting the point where the "collapse" occurs,
every time the opposition presses him with a tighter experiment.

-----

Well all this has little to do with QM or physics any more. But while I'm
waffling on, I'll just pick up on another coment from Ben Weiner:

The one thing that still bugs me is the relation of Bayesian approaches
to non-parametric statistics. Or the lack of same ...

Absolutely! This is a constant embarrassment to Bayesians, as Lindley is the
first to admit. I did once attend a Bayesian non-parametric seminar, but it
was rather disappointing. The matters attended to were far from the usual
non-parametric concerns. Non-parametrics were of the few things that ever
struck me as being a form of "absolute" statistics.
They still strike me as having an almost magical way of getting something
out of almost nothing.

Simple example: You are to observe 3 independent measurements from some
probability distribution which you haven't a CLUE what it might be! (Very
un-Bayesian idea!) You know only that it's continuous; (unnecessary technical
convenience.) If you want to estimate the MEDIAN, (not the mean), of the
underlying distribution, you can make an "exact" statement:-

With a "75% chance", the median is between the top one and the bottom one!
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is, as I say, a beautiful "absolute" result; requiring not the
slightest hint of any prior on the underlying distribution! If only all
statistics could be as neat as this!

Naturally, the true-blue Bayesian would throw up his hands in horror at this
example, for several reasons; but especially he would say I've changed...

True, from his every-probability-is-conditional standpoint I have committed
heresy - the conditionals have changed.

But really, most people would think it sensible to regard these as being
essentially the same statement. Certainly, after a hard day, in fading light,
it's hard to distinguish...

P(3 observations straddle the true median) = .75 and

P(true median is within the 3 observations) = .75 .

I would say, (and I think all but fully committed Bayesians would say),
the second is just a re-phrasing of the first...

-------------------------------------------------------------------------
Bill Taylor wft@math.canterbury.ac.nz
-------------------------------------------------------------------------
Of COURSE you're entitled to your own opinions, just keep them to yourself.

Personally, I don't mind talking sometimes as if there were a "real
prior out there", because I'm perfectly aware of the limitations of
this way of talking. This is exactly what I think I'm doing when I
write down a wavefunction psi and say to myself "this is the state
of the electron" instead of "this is the state I am assuming
for the electron". The former is quicker and easier to say. It's
also not clear that it ever meant anything other than the latter!

I know that this seems like a philosophical point, but I think that
this particular issue has big consequences. For instance...

It seems to me that the "state" view of the wavefunction
leads directly to people trying to invent dynamical collapse
mechanisms, to "measure the wavefunction", to "make macroscopic
superpositions", to think that QM has non-local effects in it,
to believe in MWI, and so on. I think that one has to at least
to admit that many papers have been written based on taking one
side of this seemingly pedantic distinction. Also, if the state
view really is correct, then maybe there really *are* many worlds; this
may then have other consequences, etc.

I'm sure that none of this confuses you personally, but my impression
is that most people learning QM take the state view without realizing
that they are assuming something and then get all confused. Also...

If the wavefunction really has the same status as a Bayesian
probability distribution, you would expect there to be a systematic
way to "improve wavefunctions based on your prior knowledge"
just as there is in Bayesian Inference. Understanding exactly the
right way to do this has had tremendous consequences in statistics
and it's easy to imagine it having important practical consequences
in quantum mechanics as well. However, the "state" view may prevent you
from improving your wavefunction at all, because you think it's not
allowed! This is just like frequentists who have ineffective or
non-existent solutions to certain problems because they think of
their probability distributions as "real."

By the way, I know of a paper or two where people do take the
Bayesian view of wavefunctions and try to improve them using maxent
e.g. Casona, Rossignoli and Plastino, Phys.Rev.C 45(1992)1162. If
anyone knows of others, I would be interested to hear about it.

It's so nice when people get your oblique references. Speaking of
references, do you (or anyone out there) have any good references to the
concept of an "improper prior" (by which I guess you must mean
an infinite measure, like Lebesgue measure on the real line, which can't
be normalized to give a probability measure) in probability or
statistics? I happen to be working on a paper which deals with some
crazy ideas, some of which concern the role this notion plays in quantum
theory (where operator algebraists call it a "weight"). I
know Don Page has thought about this kind of thing in the context of
quantum cosmology, but what I'd like to see is what statisticians think
about it. You can get into big trouble with them (e.g., the paradoxes
where you "pick a random real number"), but there still seem
to be cases (like the above) where one wants to think of them as a kind
of prior.

John, most books on Bayesian theory will discuss such "improper priors".
They are very commonly used as indifference priors. See, e.g., Berger's
book on decision theory. Jaynes' manuscript also discusses them.

Vic Barnett, _Comparative Statistical Inference_ (Wiley),
which is a balanced view of both Bayesian and frequentist
statistics.

Also, Ed Jaynes is in the process of writing a book (which of
course flogs his point of view and, being by him, is fairly
polemical). He encourages distribution of the manuscript and
is interested in comments. I have the LaTeX source of a (I
won't say _the_) recent version. Some chapters are fairly
incomplete. If you are interested I could deposit it in our
FTP area for you. It is pretty thick (1", double-sided).

Cheers, Bill

John,

I thought you might also be interested in the following
books, which are written from a historical point of
view:

First and foremost, Stephen M. Stigler's _The History
of Statistics_ (Harvard/Belknap 1986). This only goes
up to 1900.

You might enjoy reading Persi Diaconis' article,
"Bayesian Numerical Analysis", in _Statistical
Decision Theory and Related Topics IV, Vol 1_,
p. 163 (1988). Also, John Skilling's
article, "The Eigenvalues of Mega-Dimensional
Matrices" in _Maximum Entropy and Bayesian
Methods_, p. 455 (1989). While these are a little
off from what you propose, there is enough common
ground to make them interesting, I think.

Cheers, Bill

With age and experience in research come the twin dangers of
dwindling into a philosopher of science while being enlarged into a
dotard. - C. Truesdell, An Idiot's Fugitive Essays on Science.