a blog about science, statistics, and rationality - one of my favorite things

Friday, April 27, 2012

Logical v's Causal Dependence

At the end of a previous post, I promised to discuss the difference between logical dependence, the substance of probability theory, and causal dependence, which is often assumed to be the thing that probability is directly concerned with. Lets get the ball rolling with a simple example:

A box contains
10 balls: 4 black, 3 white, and 3 red. A man extracts exactly one ball,
‘randomly.’ The extracted ball is never replaced in the box. Consider the
following 2 situations:

a) You know that the extracted ball was red. What is the probability
that another ball extracted in the same way, will also be red?

b) A second ball has been extracted in the same manner as the first,
and is known to be black. The colour of the first ball is not known to you.
What is the probability that it was white?

You should try
to verify that he answer in the first case is 2/9, and in the second case is
1/3.

Nearly everybody
will agree with my answer to situation (a), but some may hesitate about the
answer for situation (b). This hesitation seems to result from the feeling that
when we write P(A|B) ≠ P(A), then B is, at least partially, the cause of A. (P(A|B) means ‘the
probability for A given the assumption that B is true.’) If true, then there would be no possibility for knowledge of B to influence the probability for A, because the colour of the second ball can have had no
causal influence on the colour of the first ball.

In fact, it
makes no difference at all in what order the balls are drawn, in such cases.
The labels ‘first,’ ‘second,’ ‘nth,’ are really just arbitrary
labels, and we can exchange them as we please, without affecting the outcome of
the calculation.

In case there is
still a doubt, consider a simplified version of our thought experiment:

The box had exactly 2 balls, 1 black and 1 white. Both balls were
drawn, ‘at random.’ The second ball drawn was black. What is the probability
that the first was black?

The product rule
can be written P(AB) = P(A|B)P(B). With this formulation, can we account for
cases where A depends on B. When thinking about this dependence, however, it is
often tempting to think in terms of causal dependence. But probability theory
is concerned with calculations of plausibility with incomplete knowledge, and
so what we really need to consider is not causal dependence, but logical
dependence. We can verify that P(A|B) does not imply that B is the cause of A,
since, thanks to the commutativity of Boolean algebra, AB = BA, and we could
just as easily have written the product rule as P(AB) = P(B|A)P(A).

What is the probability that X committed a crime yesterday, given that he confessed to it today? Surely it is altered by our knowledge of the confession, indicating that
the propositions are not independent in the sense we need for probability
calculations. But it is also clear that a crime committed yesterday was not caused by a confession today.

Edwin Jaynes in
‘Probability Theory: The logic of Science,’ gave the following technical
example of the errors that can occur by focusing on causal dependence, rather
than logical dependence. Consider multiple hypothesis testing with a set of n
hypotheses, H1, H2, …, Hn, being examined in
the light of m datasets, D1, D2, …., Dm. When
the data sets are logically independent, the direct probability for the
totality of the data given any one of the hypotheses, Hi, satisfies
a factorization condition,

P(D1...Dm | Hi, I) =
∏ j
P(Dj | Hi, I)

(1)

(The capital 'pi' means multiply for all 'j'.) It can be
shown, however, that the corresponding condition for the alternate hypothesis, Hi'

P(D1...Dm | Hi', I) =
∏ j
P(Dj | Hi', I)

(2)

does not hold
except in highly trivial cases, though some authors have assumed it to be
generally true, based on the fact that no Di has any causal effect
on any other Dj. (Equation (2) requires that P(Dj|Di) = P(Dj).) The datasets maintain their causal independence, as
they must, but they are no longer logically independent. This is because the
amount that equivalent units of new information change the relative
plausibilities of multiple hypotheses depends on the data that has gone before:
the effect of new data on a hypothesis depends on which other hypothesis it
competes with most directly.

In Jaynes’
example, he imagined a machine producing some component in large quantities and
an effort to determine the fraction of components fabricated that are faulty by
randomly sampling 1 component at a time and examining it for faults. The prior
information is supposed specific enough to narrow the number of possible
hypotheses to 3:

A ≡ ‘The fraction of components that are faulty is 1/3.’

B ≡ ‘The fraction of components that are faulty is 1/6.’

C ≡ ‘The fraction of components that are faulty is 99/100.’

The prior probabilities for
these hypotheses are as shown at the extreme left of the graph below. The graph
is the calculation of the evolution of the probabilities for each hypothesis as
the number of tested components increases. Recall that, from Bayes’ theorem, each P(Hi|D, I) depends on both P(D|Hi, I) and P(D|Hi', I). Each tested component is found to be faulty, so the
information added is identical with each sample, but the rates of change of the
3 curves (plotted logarithmically) are not constant.

Evolution of the probabilities of 3 hypotheses as
constant new data are added.

Taken from E.T. Jaynes,
‘Probability theory: the logic of science,’ chapter 4.

The
‘Evidence,’ plotted on the vertical axis, is perhaps an unfamiliar expression
of probability information. It is the log odds, given by

E = 10 log10

P( H )

P( H' )

(3)

with the factor
of 10 because we choose to measure evidence in decibels. The base 10 is used because of a perceived psychological advantage (our brains seem to be good at thinking in terms of factors of 10). Because we have used a
logarithmic scale, the products expressed above in equations (1) and (2) becomes sums, and for constant pieces of new information, we expect to add a
constant amount to the evidence, if both factorization conditions hold. The slopes of the curves are not constant,
however, indicating that this is not the case,
and consecutive items of data are not independent: ΔE depends on what data have preceded that point. Specifically, wherever a pair
of hypotheses cross on the graph, there is a change of slope of the remaining
hypothesis.

When we calculate P(D|Hi, I) we are
supposing for the purposes of calculation that Hi is true, and so
the result we get is independent of P(Hi), which is why P(D|Hi, I) factorizes. But P(D|Hi', I)is different, because
when the total number of hypotheses is greater then 2, then Hi'is composite and decomposes into at least 2 hypotheses, so P(D|Hi', I) relies upon the
relative probabilities for those component propositions.

No comments:

Post a Comment

Search This Blog

About Me

I'm behind the grasshopper. I'm a physicist at the University of Houston. I work on radiation monitoring, using pixelated particle detectors, for NASA's astronauts. Previously, I worked in x-ray imaging and, before that, in semiconductor physics. (I don't know if the grasshopper has his own blog.)