One quick disclaimer before I proceed. When I have quoted one or more Wikipedia articles in the text, it is because I have found them well-written, informative, and adequately illustrative; however, I shall make no claim as to their veracity and/or authenticity because I have not been able to access and verify all the background references therein. If you find an error, please feel free to chide me in the comments.

An important maxim used in science, or more precisely, in the scientific study of relationships between/amongst variables, is that ‘Correlation does not imply Causation’. Indeed, until and unless such causality has been verifiably established through independent means, any attempt to indicate that it does falls under the logical fallacy of questionable cause, cum hoc, ergo propter hoc (Latin for “with this, therefore because of this”).

It is important for all to understand this concept – those who are engaged in scientific studies, as well as those who read about and interpret such studies.

Correlation is a statistical relationship between two or more random variables; for simplicity’s sake, let’s consider two, say, A and B, such that if changes in the values of variable A statistically correspond to changes in the values of variable B, a correlation is said to exist between A and B. This reflects a statistical dependence of A on B, and vice versa, and therefore, statistically-computed correlations can be used in a predictive manner. To pick a completely random example, the epidermal growth factor receptor (EGFR) is expressed on neoplastic cells in colorectal carcinoma. Number of cells expressing EGFR was found to be correlated with the size of the tumor (adenoma), i.e., cells from a larger tumor express more EGFR. Therefore, EGFR expression may be useful as a prognostic biomarker for adenoma progression.

Those who have already identified the problem in this assertion, congratulations! As the paper cautions, although EGFR pathway is important to colorectal carcinogenesis, it is unknown at this point whether the observed increase in EGFR expression is because neoplastic cells make more EGFRper se for some reason, or because a larger tumor would house numerically more of the cells that are capable of making EGFR. This, as you can understand, is an important distinction, and therefore, the authors conclude correctly that “Further larger studies are needed to explore EGFR expression as a biomarker for adenoma progression.”

Such examples abound, all illustrating how correlations can be useful in suggesting possible causal or mechanistic relationships between variables, but more importantly, such statistical interdependence between the said variables is not sufficient for logical implication of a causal relationship. In other words, while empirically A may be observed to vary in conjunction with B, that observation is not enough to assume A causes B.

But what happens when one makes such an erroneous assumption? For starters, one is then disregarding four other possibilities, any or each of which may be true and account for the correlation.

A may cause B.

B may cause A.

An unknown or uncharacterized third variable C may cause both A and B.

A and B may influence each other in presence or absence of C in a feed-back loop, self-reinforcing type of system.

The two variables, A and B, changing at the same time in absence of any direct logical or actual relationship to each other, besides the fact that the changes are occurring at the same time – a situation also known as coincidence. A coincidence may allude to multiple, complex or indirect factors that are unknown or too nebulous to ascribe causality to, or may reflect pure, random chance.

Each of these five hypotheses is testable and there are statistical methods available to reduce the occurrence of coincidences. Therefore, the mere observation that A and B are statistically correlated doesn’t lend itself to any definitive conclusion as to the existence and/or directionality of a causal relationship between them.

Determination of causality is an entirely different ball of wax, and that discussion is beyond the scope of this post. Suffice it to say that in the sciences, causality is not assumed or given. The scientific method requires that the scientists set up empirical experiments to determine causality in a relationship under investigation.

The scientific method works in logical progression.

Initial observations (of a putative relationship between variables) are made.

an explanation is proposed in form of one-or-several hypotheses about possible causal relationships, including one of no relationship (the Null hypothesis).

Certain predictions or models may be generated on the basis of each of the hypotheses, which in turn guide the experimental design.

Experiments are designed to demonstrate the falsifiability of the hypotheses, i.e., to test the logical possibility that the hypotheses could be proven false by a particular empirical observation. Indeed, testing for falsifiability or refutability is a key part of the scientific process.

Once designed, the experiments are used to test the hypotheses rigorously, and the data, analyzed critically to reach a conclusion, accepting or rejecting the hypotheses.

But the method doesn’t cease there. All empirical observations are potentially under continued scrutiny, which involves reconsideration of the derived results, as well as and re-examination of the methodology, especially in the light of newer techniques that are capable of taking deeper and more accurate measurements. Such is the dynamic nature of the scientific method.

Establishment of causality, therefore, has to pass through the same rigorous filters before it can be accepted. But if it does, the conclusions may be considered unimpeachably valid, within the given set of circumstances.

So… Correlation doesn’t inherently imply causation.

Some modern examples are in Part Deux. Please don’t hesitate to comment.

5 Comments

I think you’re being a bit too strict with inferring causality. Epidemiologists are often happy to assert causation just from their studies. They’re aware of the problems, but can often use prior arguments to determine causality. For example, we’re happy to infer that passive smoking causes cancer because we know enough about the mechanisms of smoking and cancer.

And nobody ever did a randomised experiment to see if smoking causes cancer in humans.

@Bob: you raise a very good point. I was going to touch upon it in part deux, but let me put in a few words. I think we do need to be strict in interpreting causality, because it is specifically this modality – mere correlation erroneously interpreted as causation – that provides pabulum to the purveyors of pseudoscience in their everlasting quest for scientific legitimacy, as you are well aware.

I also think – pardon me – it a bit facile to make a statement like "Epidemiologists are often happy to assert causation just from their studies" (that is, if you meant it seriously). The example that you’ve mentioned is particularly important, because precisely for that reason – the lack of direct empirical data linking smoking to cancer – tobacco regulation, in many countries, not just the US, has not progressed beyond a taxation model and use of statutory warnings, on the cartons or packets, that lack authority and are largely ignored by current and would-be smokers.

Also, I am not sure that "we’re happy to infer that passive smoking causes cancer because we know enough about the mechanisms of smoking and cancer" paints the complete picture. The Epidemiologists in this particular case have looked at data from multiple sources before linking smoking to cancer, not the least of which is the empirical observation of the state of the lungs of long-term smokers, using biopsies as well as post-mortem dissections. Individual chemicals present in fresh and burnt tobacco, as well as substances in cigarette smoke (such as carbon monoxide), have been analyzed and checked for their carcinogenic potential and other physiologic effects. Therefore, the inference that ‘Smoking causes cancer’ has a wealth of evidence on its side, way beyond a mere Correlation = Causation scenario.

Nice introduction, Kausik. The correlation-causation conundrum is one reason I’m so happy using mathematical models. A well designed and analysed model will not rely on correlation (many models do), but can uncover all causal relationships. After all, the modeller is the one who puts them there in the first place.

As an aside, I think it was Steve Novella who recently pointed out that (to paraphrase) "Correlation generally does imply causation. However, it most certainly does not always mean causation."

The use of the word imply here is strictly in its mathematical sense: a logical relation between propositions p and q (p implies q) of the form "if p then q"; if p is true then q cannot be false, written as p—>q