Sunday, March 25, 2012

Judea Pearl, a UCLA professor of computer science, is one of the world's leading thinkers -- if not the leading thinker -- on conceptual approaches to causal inference. He is author of the book Causality and of numerous articles and presentations. He also operates the UCLA Causality Blog, a link to which appears in the left-hand column of the present page. On top of all this, Pearl recently garnered the Association for Computing Machinery (ACM) Turing Award for his contributions to artificial intelligence.

"Accessible" is not a word I would use to describe Pearl's writings, however. I have previously described the level of Pearl's writing as "quite frankly, well over my head." Heavy with logic symbols, Pearl's texts would, I suspect, challenge even many well-educated students of causality.

Fortunately for those of us seeking greater understanding of Pearl's ideas, Michael Nielsen has written an article trying to explain Pearl's "causal calculus" to a wider audience. I couldn't understand everything Nielsen wrote, but in relative terms, I found his exposition easier to grasp than Pearl's.

Fairly early on, Nielsen introduces the familiar example of smoking and lung cancer to discuss what conclusions can be drawn from correlational (observation) vs. randomized-controlled research designs (he seems to use the word "experimental" generically for any empirical investigation, specifying with terms such as "intervention" or "randomized controlled" when he means that participants are randomly assigned to conditions). Noting that human participants cannot ethically be randomly assigned to smoke cigarettes, Nielsen tantalizes the reader as follows:

We’ll see that even without doing a randomized controlled experiment
it’s possible (with the aid of some reasonable assumptions) to infer
what the outcome of a randomized controlled experiment would have been,
using only relatively easily accessible experimental data, data that
doesn’t require experimental intervention to force people to smoke or
not, but which can be obtained from purely observational studies.

The main points I gleaned from Nielsen's piece were that (a) we can learn more than I previously thought simply from diagramming hypothetical causal relations between variables as in structural equation modeling or path analysis; and (b) one's conceptual model can be translated into conditional probability statements (i.e., given x, what is the probability of y) that potentially can be manipulated to answer causal questions without a randomized experiment. As Nielsen explains:

...Pearl had what turns out to be a very clever idea: to imagine a hypothetical world in which it really is possible to force someone to (for example) smoke, or not smoke. In particular, he introduced aconditional causal probability p(cancer|do(smoking)),
which is the conditional probability of cancer in this hypothetical
world. This should be read as the (causal conditional) probability of
cancer given that we “do” smoking, i.e., someone has been forced to
smoke in a (hypothetical) randomized experiment.

Now, at first sight this appears a rather useless thing to do. But what makes it a clever imaginative leap is that although it may be impossible or impractical to do a controlled experiment to determine p(cancer|do(smoking)), Pearl was able to establish a set of rules – a causal calculus – that such causal conditional probabilities should obey. And, by making use of this causal calculus, it turns out to sometimes be possible to infer the value of probabilities such as p(cancer|do(smoking)), even when a controlled, randomized experiment is impossible.

Returning to the lung-cancer example, it is theoretically possible that smoking leads directly to lung cancer or that an unobserved third variable causes both smoking and lung cancer (also, lung cancer may cause people to begin smoking, but that seems implausible). As Nielsen discusses, we can insert a fourth variable, namely particulate lung residue ("tar"), between smoking and lung cancer in the proposed causal sequence. This inclusion helps us partially break the connection between the hidden third variable and the other variables. Argues Nielsen: "But if the hidden causal factor is genetic, as the tobacco companies
argued was the case, then it seems highly unlikely that the genetic
factor caused tar in the lungs, except by the indirect route of causing
those people to smoke."

Through manipulations such as the above: "the causal calculus lets us do something that seems almost miraculous:
we can figure out the probability that someone would get cancer given
that they are in the smoking group in a randomized controlled
experiment, without needing to do the randomized controlled experiment.
And this is true even though there may be a hidden causal factor
underlying both smoking and cancer."

Ultimately, the manipulation of equations can lead to a formula to estimate the conditional probability of developing cancer given random assignment to a smoking condition, p(cancer|do(smoking)), as a function of "quantities which may be observed directly from experimental data, and
which don’t require intervention to do a randomized, controlled
experiment" (see Equation 5 in Nielsen's article). For any given problem, such non-intervention-based probabilities to plug into the equation may or may not be available.

Nielsen concludes the article by exploring possible future directions in the study of causality. For those interested in causal inference without randomized-controlled studies, Nielsen's article is a must-read.