Causation

—Larry Wasserman

In this post, I will discuss something elementary yet important. Causation versus Association.

Although it is well-worn territory, the topic of causation still causes enormous confusion. The media confuse correlation and causation all the time. In fact, it is common to see a reporter discuss a study, warn the listener that the result is only an association and has not been proved to be causal, and then go on to discuss the finding as if it is causal. It usually goes something like this:

“A study reports that those who sleep less are more likely to have health problems.”

So far so good.

“Researchers emphasize that they have not established a causal connection”

Even better, no claim of causation. But then:

“So make sure you get sufficient sleep to avoid these nasty health problems.”

Ouch, there it is. The leap from association to causation.

What’s worse is that even people trained to know better, namely statisticians and ML people, make the same mistake all the time. They will teach the difference between the two in class and then, a minute after leaving class, fall back into the same fog of confusion as the hypothetical reporter above. (I am guilty of this too.) This just shows how hard-wired our brains are for making the causal leap.

There are (at least) two formal ways to discuss causation rigorously: one is based on counterfactuals and the other is based on casual directed acyclic graphs (DAG’s). They are essentially equivalent. Some things are more easily discussed in one langauge than the other. I will use the language of DAG’s here.

Consider a putative cause and a response . Let represent all variables that could affect or . To be concrete, let’s say is stress and is getting a cold. The variables is a very high-dimensional vector including, genetic variables, environmental variables, etc. The elements of are called confounding variables.

The causal DAG looks like this:

Suppose we only observe and on a large number of people. is unobserved.

The DAG has several implications. First, the distribution factors as . Well, that’s a pretty vacuous statement but let’s keep going. The association between and is described by the conditional distribution . This distribution can be consistently estimated from the observed data. No doubt we will see an association between and (that is, will indeed be a function of ). As usual,

The causal distribution is the distribution we get — not by conditioning — but by intervening and changing the graph. Specifically, we break the arrow into and we fix at a value . The new graph is

The joint distribution for this graph is

where has been replaced with a point mass distribution at . The causal distribution is the marginal distribution of in the new graph, which is,

Using the language of Spirtes, Glymour and Scheines and Pearl, we can summarize this as:

while

We immediately deduce the following:

1. They are different. This is just the formalization of the fact that causation is not association.

2. If there is no arrow from to in the original graph we will find that depends on but does not depend on . This is the common case where there is no causal relationship between and yet we see a predictive relationship between and . This is grist for many bogus stories in CNN and the NY Times.

3. The causal distribution is not estimable. It depends on and but is not observed.

4. The reason why epidemiologists collect data on lots of other variables is that they are trying to measure or at least, measure some elements of . Then they can estimate

This is called, adjusting for confounders. Of course, they will never measure all of which is why observational studies, though useful, must be taken with a grain of salt.

5. In a randomized study, where we assign the value of to subjects randomly, we break the arrow from to . In this case, it is easy to check that

In other words, we force association to equal causation. That’s why we spend millions of dollars doing randomized studies and it’s why randomized studies are the gold standard.

This raises an interesting question: have there been randomized studies to followup results from observaional studies? In fact, there have. In a recent article, (Young and Karr, 2011) found 12 such randomized studies following up on 52 positive claims from observational studies. They then asked, of 52 claims, how many were verified by the randomized studies? The answer is depressing: zero.

That’s right, 0 out of 52 effects turned out to be real.

We should be careful not to over-generalize here because these studies are certainly not a representative sample of all studies. Still, 0/52 is sobering.

In a future post, I will discuss Simpsons paradox. Here is a preview. Suppose is observed. If there is an arrow from to then but when there is no arrow from to then . Nothing complicated. But when we change the math into words, people— including most statisticians and computer scientists— get very confused. I’ll save the details for a future post.

Quote: “This raises an interesting question: have there been randomized studies to followup results from observaional studies? In fact, there have. In a recent article, (Young and Karr, 2011) found 12 such randomized studies following up on 52 positive claims from observational studies. They then asked, of 52 claims, how many were verified by the randomized studies? The answer is depressing: zero. That’s right, 0 out of 52 effects turned out to be real.
We should be careful not to over-generalize here because these studies are certainly not a representative sample of all studies. Still, 0/52 is sobering.”

Sobering indeed, but not in the way intended. I suggest that the blog comment is more a display of a statistician failing to delve into the material and instead merely repeating uncritically a questionable claim. What the 0/52 should alert you to is that this “meta-study” looks cooked. Forensic stats (starting with Fisher’s famed comments Mendel’s data) tells us that data that fit a story too well (in this case, perfectly fitting Young and Carr’s claims) needs to be looked at very closely.

OK, good that it was mentioned that the 52 picked were in no sense a random sample of what could be said to have been studied in some way by RCTs – but how about noticing that Young and Karr looked at only 12 RCTs, not 52 (the 52 are from the 12, so these are in some ways highly correlated results). How many trials were intervention studies of hazards like, say, smoking-cessation interventions? Zero. How many were postmarketing trials powered for adverse drug side effects? Zero. 10/12 are nutrient supplement studies, two involve HRTs. There are hundreds of trials of nutrients and of HRTs, so why these 12?

On the representativeness issue, nutrient epidemiology has long been recognized as the most biased and noisy branch of the field, with study findings usually disappearing upon any reasonable adjustments for measurement error and multiplicity. That has been known for a few decades, so this topic looks nicely cherry picked. Any replication failure by RCTs only shows that those adjustments were right and could have been used to bypass the RCTs altogether. Indicting all of observational epidemiology based on this sub-branch is like indicting the entire field of statistics because there is rampant misuse of statistical significance throughout health and medical science…hmmm, maybe not such a bad idea because…

Ask what criteria Young and Karr used to make such discrete claims of “positive” vs. “negative” let alone verification vs. no verification – all these studies including the RCTs involved at least random error, so how were ambiguous cases shuffled into one of the artificial binary slots? “Significance” is mentioned but what would be the confidence intervals for the difference between the observational studies and the RCTs – especially if the hundreds of relevant studies and RCTs were pooled with covariate control and measurement error adjustments were made? Not a hint in Young and Karr.

And then, should the RCTs be considered sacrosanct? Everyone honest in the RCT business knows those studies can have considerable compliance problems and that participants are far from representative of a practice population and are often picked precisely for having different risks of outcomes than any general population (e.g., low risk for side effects, high risks for targeted outcomes). And that the treatment versions and protocols followed in the trials may be far from what happens in practice. For vitamin and mineral supplements the differences are spectacular, since the commercial supplements and food fortifications in use lead to treatment levels and versions that are all over the place compared to the singular chemical forms and precise doses that RCTs attempt to enforce.

I could go on but the point is Young and Karr called their analysis “informal”, and reported nothing on how they dealt with the aforementioned problems. Apparently clear replicable defensible sampling and analysis protocols and criteria are only for primary researchers. Meta-critics like Young and Karr (and Feinstein before them, a notoriously unreliable reporter as later investigations documented) need not account even for glaring problems with their approach, let alone provide a sound statistical derivation of their claims. I agree some fixing of observational protocols is needed, but statistics needs to set a better example with scientific approaches to the problems, rather than with “informal” treatments worse than the literature under scrutiny.

“Many”? How does “many” imply “all”?
Regardless, the statistical independence implied by a factorial design (assuming the trial was carried out perfectly, an assumption needing scrutiny) is conditional on the study. Unconditionally (meta-analytically, across studies) even results from factorial designs can be correlated by the action of study-related design features such as level and handling of noncompliance, loss, measurement error, and so on. To ignore methodologic and meta-analytic issues like these and pretend the trials were handed down from some perfect-RCT heaven is as naive as the epidemiology Young and Karr criticize.
Noteworthy for its continuing absence in this discussion is the Achilles heel of their claims of “conflict”: no explicit and validated criteria for “conflict” or “agreement” were given in the article. Whatever it was, we can only hope it was not mere difference in whether some P-value was above or below 0.05, a criterion that is immediately discredited by comparing two studies, one with 95% confidence interval from 1.01 to 4.03 and the other with 95% CI from 0.99 to 3.98.

[…] tendency to see patterns and attribute causes. In an effort to discipline this tendency, people like statistician Larry Wasserman formalize the idea that an observed association between X and Y does not alone support the […]

[…] result arises from fairly simple statistics of false positives, publication selection bias, and causation vs. correlation problems. While the math is incontrovertible, some of the assumptions have been challenged: … […]

[…] proved using either the directed graph approach to causation or the counterfactual approach: see here for example. This fact is so elementary that we tend to forget how amazing it is. Of course, this […]