Friday, March 30, 2012

My previous post on the base rate fallacy dealt with an example of binary hypothesis testing - hypothesis testing where there are exactly two competing hypotheses, such as either a person has a disease or they do not.

Traditional frequentist hypothesis testing is often considered (slightly erroneously) to be limited to binary hypothesis testing, the two being the null and alternate hypotheses. In fact the alternate hypothesis is never formulated, and does not take part in any calculation, so frequentist hypothesis testing typically only tests a single hypothesis. If this seems bizarre to you, then welcome to the club.

In the Bayesian framework, however, binary hypothesis testing is just a special case. In general there can be any number of hypotheses. For example, the murder was committed by Mr. A, or Mr. B, or Mr. C,.....

In this post I will demonstrate an example of testing an infinite number of hypotheses. In doing so, I will investigate the logical position of ad-hominem arguments - arguments based on who is making a particular point, rather than arguments that address that point. We are often told that such arguments are illogical and should take no place in rational discussion and decision making. A Bayesian, however, acknowledges that it is illogical to ignore relevant prior information when trying to reach a decision about something - anything, including a person. We'll get to the nitty gritty in a moment.

First though, to get warmed up, an amusing example concerning a book, ‘Logic and Its Limits,’ by P. Shaw. It is generally a good
book, introducing many forms of logical error that can occur, particularly in
verbal arguments and in the written word, rather than emphasizing the algebraic
manipulation of formal logic. What I describe here should not be taken as a negative review of this book, its really just a little glitch - perhaps more like a typo than anything else. But the author of the book is described in its front
matter as ‘senior lecturer in philosophy at the University of Glasgow,’ and so
this mistake from a section on ad hominem
arguments is deliciously ironic. One of the exercises posed to the reader in
this chapter is to analyze the following statement:

"As
Descartes said, appeals to authority are never really reliable."

At the back of the book, in the
answers-to-exercises section, the author tells us that the statement is:

"self-refuting"

It seems to me, the only way one
might think this is self refuting is if it is assumed that Descartes is himself
an authority. Suppose then that the statement is false as claimed. Then an
authority (Descartes) has told us something unreliable. Since we would have
proof that authorities do make mistakes, we would have to accept that authority
can never be relied upon, as there can be no means to know when an authority is
or is not reliable. Thus, by assuming the statement false we are forced to
accept that the statement is true. Reductio ad absurdum.

How should we
treat the authority of Mr. Shaw on the subject of logic?

(To clarify: the reason we can
assert that Descartes’ statement is correct is that we have performed the
appropriate logical analysis ourselves, and so our statement is not that ‘it is
true because Descartes said it,’ - which would be an appeal to authority - but
that ‘since Descartes, an expert, said it, the falsity of this particular
statement would be a logical contradiction.’ By proving that appeals to
authority are never reliable, we also demonstrate that the statement beginning
“As Descartes said” is not an appeal to
authority.)

Now, on with the main program. Consider that it is possible that the details of an argument may be difficult to access or analyze, but we still would like to assess the probability that the argument is sound. Here is a simple example: a man (lets call him Weety Peebles) has a penchant and a talent for getting his provocative ideas published in the press. In the past you have observed ten occasions when his wild ideas have made the headlines. On each occasion, you have analyzed his claims, and on all occasions but one, found them to be totally devoid of any truth, and to have no logical merit whatsoever. On an eleventh occasion you see in your newspaper a headline '[something interesting if true], declares Weety Peebles.' You don't really want to bother reading it, however, because of your experience. What is the probability that you should read the story?

Suppose that when Mr. Peebles goes public with one of his stories he tells the truth with some relative frequency - unknown, but fixed by the many variables that define his nature. In other words, over long stretches of time the fraction of his stories that are true remains about the same. Our first job is to determine the probability distribution that describes the information we have about this frequency. The possibilities for the desired frequency, f, include all numbers between 0 and 1 - an uncountably infinite number of possible frequencies. For each of these values, f, let Hf be the hypothesis that the true frequency is f. By analogy with the principle of indifference for discrete sample spaces, we'll start with a prior that is uniform over the range 0 to 1. This is how we encode the fact that we start with no reason to favor one frequency more than any other. Now all we need to do is use Bayes' theorem to update the prior distribution using the data we have relating to those ten occasions of examining Mr. Peebles' output.

For a set of hypotheses numbered 1 to n, the general form of Bayes' theorem for the first of those hypotheses, H1, is

P(H1 | DI)

=

P(H1 | I) × P(D | H1I)

P(D | I)

(1)

Where all n propositions are exhaustive and mutually exclusive, the denominator in this equation can be resolved as shown:

P(H1 | DI)

=

P(H1 | I) × P(D | H1I)

P(DH1 + DH2 +....+ DHn | I)

(2)

(The product, DH, means that both D and H are true, while any sum, A+B, means that A or B is true.) Next, applying the product and extended sum rules (see appendix below), this becomes:

P(H1 | DI)

=

P(H1 | I) × P(D | H1I)

Σ1n P(D | HiI) × P(Hi | I)

(3)

Finally, if the hypotheses are drawn from a continuous sample space, then the sum in the denominator simply becomes an integral.

To perform this update we can divide the range of possible frequencies into, say, 200 intervals, each of width 0.005, and let these approximate our continuous sample space. Then, for each of these intervals, we need to calculate P(Hf | DI).

This means calculating P(D | HiI), for each frequency, which is just fi, if it was an occasion where Weety spoke the truth, or (1 - fi) if it was not. We can perform 10 such updates, one for each of the occasions in our experience. Alternatively, we can perform a single update for the 10 occasions in one go, using a result obtained by Thomas Bayes, some time in the 18th century:

P(f | DI)

=

(N + 1)!

f n (1 - f)N-n

n! (N - n)!

(4)

Here, N represents the number of occasions we have experience of and n represents the number of times Mr. Peebles was not talking rubbish. (Note that this formula applies only when we start with a uniform prior, and only when f is constant.)

I performed this calculation very quickly using a spreadsheet (because I'm integrating over little rectangles, I multiply each P by Δf, the width of each rectangle), obtaining the following probability distribution for f, given 10 past experiences, including only one occasion where there was some merit to Weety's story:

Recall the interpretation of the numbers: a frequency of 0 means he never tells the truth, while a frequency of 1 means he always tells the truth. We see that nearly all the 'mass' is located between 0 and 0.5 - as we might have anticipated.

To complete the exercise, it only remains to determine the probability that on our current, 11th occasion, the news item in question is worth reading - in other words, the probability that Weety Peebles is not lying again. We still don't know the actual frequency that determines how often Peebles is truthful, so we have to integrate over the total sample space. We divided the continuous sample space into 200 intervals - each narrow interval, we approximate as corresponding to a discrete frequency. For each of these frequencies, therefore, we take the product of that frequency, f, with its probability, P(f), then adding them all up, we end up with the total probability that the current news story is true. The result of this summation is 0.167, almost a 17% chance that the story is true. (Note that if I had taken f to be simply 1/10, the exact fraction corresponding to our experience, we would have wrongly estimated only a 10% chance of a story worth reading, which would have done Peebles a slight disservice.)

Without me even specifying anything about the content of Weety Peebles' story, you are now in a position to say: 'Mr. Peebles, you are very probably talking out of your arse.'

And that is an ad hom with some seriously refined logic behind it.

In general, we know from our experience that we can learn from our experience. Our common sense, therefore, should have warned us that ad-hominem arguments can be rational. In hindsight, we probably recognize that we use them all the time. What we see about Bayes' theorem is that it provides a formal way of quantifying our common sense.

Thursday, March 29, 2012

A man takes a
diagnostic test for a certain disease and the result is positive. The false
positive rate for the test in this case is the same as the false negative rate, 0.001. The
background prevalence of the disease is 1 in 10,000. What is the probability
that he has the disease?

This problem is
one of the simplest possible examples of a broad class of problems, known as
hypothesis testing, concerned with defining a set of mutually contradictory
statements about the world (hypotheses) and figuring out some kind of measure
of the faith we can have in each of them.

It might be
tempting to think that the desired probability is just 1- (false-positive
rate), which would be 0.999. Be warned, however, that this is quite an infamous problem. In 1982, a
study was published1 for which 100 physicians had been asked to
solve an equivalent question. All but 5 got the answer wrong by a factor of
about 10. Maybe it’s a good idea then to go through the logic carefully.

Think about the
following:

What values should the correct answer
depend on?

Other than reducing the
false-positive rate, what would increase the probability that a person
receiving a positive test result would have the disease?

The correct
calculation needs to find some kind of balance between the likelihood that the
person has the disease (the frequency with which the disease is contracted by
similar people) and the likelihood that the positive test result was a mistake
(the false positive rate). We should see intuitively that if the prevalence of
the disease is high, the probability that any particular positive test result
is a true positive is higher than if the disease is extremely rare.

The rate with
which the disease is contracted is 1 in 10,000 people, so to make it simple, we
will imagine that we have tested 10,000 people. Therefore we expect 1 true case
of the disease. We also expect 10 false positives, so our estimate goes from
0.999 to 1 in 11, 0.09091. This answer is very close, but not precisely
correct.

The frequency
with which we see true positives must be reduced by the possibility that we can
have false negatives also, how do we encode that in our calculation?

We require the
conditional probability that the man has the disease, given that his test
result was positive, P(D|R+). This is the number of ways of getting
a positive result and having the disease, divided by the total number of ways
of getting a positive test result,

(1),

where D is the
proposition that he has the disease, C means he is clear, and R+ denotes
the positive test result.

If we ask what
is the probability of drawing the ace of hearts on the first draw from a deck
of cards and the ace of spades on the second, without replacing the first card
before the second draw, we have P(AHAS) = P(AH)P(As|AH).
The probability for the second draw is modified by what we know to have taken place on the
first.

The formula we
have arrived at above, by simple application of common sense is known as Bayes’
theorem. Many people assume the answer to be more like 0.999, but the correct
answer is an order of magnitude smaller. As mentioned, most medical doctors also get
questions like this wrong by about an order of magnitude. The correct answer to
the question, 0.0909, is called in medical science the positive-predictive value of the test. Generally, it is known as the posterior probability.

Bayes’ theorem
has been a controversial idea during the development of statistical reasoning,
with many authorities dismissing it as an absurdity. This has led to the
consequence that orthodox statistics, still today, does not employ this vitally
important technique. Here, we have developed a special case of Bayes’ theorem
by simple reasoning. In generality, it follows as a straightforward
re-arrangement of probabilistic laws (the product and sum rules) that are so
simple that most authors treat them as axioms, but which in fact can be
rigorously derived (with a little effort) from extremely simple and perfectly
reasonable principles. It is
overwhelmingly one of my central beliefs about science that a logical calculus
of probability can only be achieved, and the highest quality inferences
extracted from data when Bayes’ theorem is accepted and applied whenever
appropriate.

The general
statement of Bayes’ theorem is

(4).

Here 'I' represents the background information: a set of statements concerning the scope of the problem that are considered true for the purposes of the calculation. In working through the medical testing problem, above, I have omitted the 'I', but in every case where I right down a probability without including the 'I', this is to be recognized as short hand - the 'I' is always really there and the calculation makes no sense without it.

The error that
leads many people to over estimate, by an order of magnitude, probabilities
such as the one required in this question is known as the base-rate fallacy.
Specifically in this case, the base rate, or expected incidence, of the disease
has been ignored, leading to a calamitous miscalculation. The base-rate fallacy
amounts to believing that P(A|B) = P(B|A). In the above calculation this
corresponds to saying that P(D|R+), which was desired, is the same
as P(R+|D), the latter being equal to 1 – false positive rate.

In frequentist statistics, a probability is identified with a frequency. In this framework, therefore, it makes no sense to ask what is the probability that a hypothesis H is true, since there is no sense in which a relative frequency for the truth of H can be obtained. As a measure of faith in the proposition H in light of data, D, therefore, the frequentist habitually uses not P(H|D), but P(D|H), and so he commits himself to committing the base-rate fallacy.

In case it is
still not completely clear that the base rate fallacy is indeed a fallacy, lets employ a thought experiment with an extreme case. (These extreme cases, while not necessarily realistic, allow the desired outcome of a theory to be obtained directly and compared with the result of the theory - something computer scientists call a 'sanity check'.) Imagine the
case where the base rate is higher than the sensitivity of the test. For
example let the sensitivity be 98% (ie 2% false positive rate) and let the
background prevalence of the disease be 99%. Then, P(B|A) is 0.98, and
substituting this for P(A|B), we have an answer that is lower than P(A) = 0.99.
The positive result of a high-quality test (98% sensitivity) giving lower
probability that the test subject has the disease than before the
test result was known.

[1] Eddy, D. M. (1982). Probabilistic reasoning in
clinical medicine: Problems and opportunities. In D. Kahneman,P. Slovic, & A. Tversky (Eds.),
Judgment under uncertainty: Heuristics and biases (pp. 249–267). Cambridge,England: Cambridge University Press. (In
this study 95 out of 100 physicians answered between 0.7 and 0.8 to a similar
question, to which the correct answer was 0.078.)

Wednesday, March 28, 2012

In 1948 Claude
Shannon forged a link between the thermodynamic concept of entropy and a new
formal concept of information. This event marked the beginning of information
theory. This discovery captured the imagination of Ed Jaynes, a physicist with
strong interest in statistical mechanics and probability theory. His expertise
in statistical mechanics meant that he understood entropy better than many. His
recognition of probability theory as an extended form of logic meant that he
understood that probability calculations (and therefore all of science) are
concerned not directly with truths about reality, as many have supposed, but
with information about truths.

The distinction
may seem strange – science accepts that there are statements about nature that
are objectively either true or false, and definitely not some combination of
true and false, so the most desirable goal must be to know which of the
options, ‘true’ or ‘false’ is the case. But the truth values of such statements
are not accessible to human sensation, and therefore remain hidden also from
human science. This is a difficult fact for intelligent animals like us to deal
with, but we have learned to do so, partly by inventing a set of procedures
called science. Science acknowledges that the truth of a proposition can not be
known with certainty, and so it sets out instead to determine the probability
of truth. For this purpose, it combines empirical information and logic.

For Ed Jaynes, therefore, Shannon’s new information theory
was instantly recognizable as a breakthrough of massive importance. Jaynes
thought about this new tool, meditated on it, digested it, and played with it
intensely. One of the outcomes of this meditation was a beautiful idea known as
maximum entropy. The title of this blog, then, is a tribute to Edwin Jaynes, to
this beautiful idea of his, and to the many more exceptional ideas he produced.

As a physicist,
I never received much education in statistics and probability – we know the sum
and product rules, we know how to write down the formulae for the Poisson and
normal distributions and how to calculate a mean and a standard deviation, and
that’s about it really. Oh and some typically badly understood model fitting by
maximum likelihood (we call it ‘method of least squares’, which if you know
stats, tells you how limited our understanding is).

During my PhD
studies in semiconductor physics, I became very dissatisfied with this
situation, as it gradually dawned on me that scientific method and statistical
inference must rightly be considered as synonymous: they are both the rational
procedure for estimating what is likely to be true, given our necessarily
limited information. I set out to teach myself as much as I could about
statistics. Not surprisingly, my first investigations led me to what is often
referred to as orthodox methodology. I laboured with the traditional hypothesis
tests – t-tests and so forth – but I found the whole framework very
unpalatable: confused, disjointed, self-contradicting – just ugly. Then I
stumbled on Bayes’ theorem, and my world view was elevated to a higher plane.
Some time after that I discovered Ed Jaynes’ book, ‘Probability Theory: The
Logic of Science,’ and my horizon was expanded again, by another order of magnitude.
Problems that I had thought to be only approachable by the orthodox methods
became recognizable as simple extensions of Bayes’ theorem, and any nagging
doubts I had about the validity of the Bayesian program were banished by
Jaynes’ clearly formulated logic.

It is not that I
am totally against orthodox (sometimes called frequentist) methods. But the
success of frequentist techniques is limited to the range of circumstances in
which they do a reasonable job of approximating Bayes’ theorem. The range of
applications, however, in which the two approaches diverge is unfortunately
quite large, while orthodox theory seems to have nothing fundamental to say about when to expect such divergence.

Bayes’ theorem
works by taking a prior probability distribution and combining it with some
data to produce an updated distribution, known as the posterior probability.
After the next set of data comes in, the posterior probability is treated as
the new prior, and another update is performed. The process goes on as long as
we wish, with presumably the posterior probability distributions narrowing down
ever closer upon a particular hypothesis.

One of the
problems we might anticipate with this procedure, however, is where does the
process start? What do we use as our original prior? The principle of
indifference works in many cases. Indifference works like this: if I am told
that a 6 sided die is to be thrown, with no additional information about the
die or the method of throwing, then symmetry considerations require that the
probability for any of the sides to end up facing upwards is 1/6. For some
more complex situations, however, indifference fails. One of the things that the principle of maximum
entropy achieves is to provide a technique for assigning priors in a huge range
of new problems, unassailable using the principle of indifference.

As Shannon
discovered, information can be considered as the flip side of entropy, a
thermodynamic idea representing disorder – the more information, the less
entropy. Why then should science be interested in maximizing entropy? What we
are looking for is the probability distribution that incorporates whatever
information we have, without inadvertently incorporating any assumed
information that we do not have. We need that probability distribution with the
maximum amount of entropy possible, given the constraints set by our available
information. Maximum entropy, therefore, is a tool for specifying exactly how
much information we possess on a given matter, which is evidently one of the
highest possible goals of honest, rational science. This is why I feel that
‘maximum entropy’ is an appropriate title for this blog about scientific
method.

Search This Blog

About Me

I'm behind the grasshopper. I'm a physicist at the University of Houston. I work on radiation monitoring, using pixelated particle detectors, for NASA's astronauts. Previously, I worked in x-ray imaging and, before that, in semiconductor physics. (I don't know if the grasshopper has his own blog.)