October 16, 2009

Correlation

One of the “inspirational thoughts” on my opening page is the observation by the late Stephen J. Gould that

“The invalid assumption that correlation implies cause is probably among the two or three most serious and common errors of human reasoning.”

It’s very easy to equate correlation with causation and take inappprioate action as a result – it’s an example of faulty thinking that I see fairly frequently on forums such as OTN or the Oracle newsgroups.

If you want to get an insight into the difference between correlation and causation, you ought to read Robyn Sands’ note on “Nonsense Correlation”.

Related

I suspect that by now many people know that correlation does not imply cause. In fact, I keep hearing “Correlation does not imply cause” even when it does not apply.

Correlation does imply cause when the correlation was verified after modifying only the suspected cause in a well controlled and well designed experiment.

If you divide sick people randomly into two groups, one group is given a medicine and the other placebo, and the group that received the medicine gets better while the placebo group does not – it is fairly reasonable to assume that the correlation between the medicine and the symptoms does imply that the medicine is what caused the health condition to improve.

One can say that the intention of the scientific method is to verify causes of correlations.

It’s possible that many people are intellectually aware that “correlation does not imply cause” – but I suspect that there is a big difference between recognising the correctness of the statement and automatically applying it.

Correlation between effects may lead you to an attempt to establish cause, but to estabish cause you need:

A plausible hypothesis
An absence of an alternative plausible hypothesis
Predictability

Technically it’s not the correlation between the medicine and the symptoms that implies the medicine is the cause – it’s the match between the prediction and the actual events, combined with the absence of an alternative explanation.

I would prefer to rephrase your closing comment to say that the intention of the scientific method is to ensure that incorrect hypotheses about causes of correlation are ultimately falsified.

(I’m not quibbling about the difference between “verify” and “falsify” here, by the way – in day to day terms something has been verified when all sensible attempts to falsify it have failed – the bit I am trying to emphasise is the intent to eliminate error.)

One of the most annoying trick that pharmaceutical companies do often is come up with an hypothesis that says “Medicine X will lower heart rates”, they do an experiment and find out that medicine X does not lower the heart rate. But since they measure many things, they found out that cholesterol levels went down. So they publish a paper saying “Medicine X lowers cholesterol levels.”. This is not a valid result of the experiment. They would need a second experiment to test just this hypothesis.

When you measure 20 different variables, there is a high probability that one of them will change significantly after your intervention (but not as a result of the intervention!) just by chance.

Statspack is misleading in this way, because it shows too many data points. Surely many of them will be different after I change a parameter or a query. The trick is to know what measurement will differ and how, and to know it in advance.

If act “A” results in state “B” 90% of the time, it is highly interesting, and you want to know the cause.

Robyn’s example is excellent. There is a strong correlation between heart disease and dental health. However, until we know the cause, we do not know if brushing your teeth will prevent heart disease. Maybe eating less pork is good for the teeth? Or maybe good genetics is everything.

If you don’t look into the cause of a correlation, you end up with silly advice like “Don’t use functions in the where clause because it causes bad performance”. If you know the cause, you can say something useful: “If you use functions in the where clause, you may end up not using an index that you want to use. You can solve the problem by modifying the query or by using function index.”

Too much correlation (and data mining!) is in my opinion one of the biggest reasons that US healthcare is where it is. Pharmaceutical companies mine lots of data, come up with a silly correlation, publish a paper, use it for marketing and doctors then prescribe the medicine for conditions it is not actually effective for – resulting in big healthcare expenses, rich companies, and unhealthy Americans.

Prozac is a good example – lots of people pay for Prozac (or our insurance does) believing that it is effective against mild depression. There is no proof that it is effective. The correlation can be attributed to random chance, or to placebo effect. So much money is thrown away due to misleading research and marketing!

The reason I’m bothering to write all this on a Saturday is not because database performance is so important, it because health is important. If we all understood science a bit better, maybe it would be more difficult to get us to spend our money on crap medicines instead of things that work. Database research is often easier for us to read than medical research, so its a good place to start practicing scientific skills :)

It may have been Ben Goldacre ( http://www.badscience.net ) who once pointed out that one of the UK tabloids was clearly intent on dividing everything into one of two classes: things that caused cancer and things that cured cancer. His book is a very interesting, and sometimes appalling, read.

It is extraordinary how even the more respectable papers and news programs will produce headlines and soundbites that are clearly idiotic cherry-picking, compression, and hyping of cautiously stated results from careful investigations.

Actually, Wikipedia has a pretty convincing explanation of why it works. It seems to be a nicotine replacement – nicotine attaches to specific brain receptors and acts as an inhibitor, and Xyban does exactly the same – therefore people who take Xyban no longer need nicotine.

Thats pretty neat.

The research probably went like this:
1) Findings were reported from doctors and patients about this great side-effect of Wellbutrin.
2) Drug company verified findings in a controlled trial.
3) Researches found the reason it works.

Note that step 3 is optional, but step 2 is mandatory for turning an interesting idea into a medicine.
Controlled experiments is what turns a correlation into a cause by eliminating other possible explanations such as chance, the fact that non-depressed people find it easier to quit smoking, etc.

Not really the right way of thinking about it. The purpose of the trials is to eliminate errors and ensure that the step from “simple correlation” to “probable cause” is justified. (And to check for threatening side-effects of course –

Your use of the phrase “(no causation)” is also not entirely appropriate – it is reasonable to recognise causation without understanding (or having a complete understanding of) mechanism. In many cases understanding of mechanism comes later – and results in refinement of treatment or predications of side effects that need to be addressed.

But knowing how a drug works makes for an infinitely more effective trial, especially when checking for negative side-effects …

For example, knowing that paracetamol is metabolized by the liver, and that the real drug that works on the brain is a metabolite of it, calls for special investigations about possible liver damage.

And since knowing how a drug works improves the effectiveness of the controlled trial so much – investigating about how it, or at least trying, is mandatory as well, in my opinion.

Anyway, out of metaphors – Oracle is a “bit” less complex than a human (or even an algae). You don’t need a PhD and a full research team when conducting controlled trials … just average analytic skills, sqlplus and a lot of sweating.

Notice that the drug was first approved by the FDA in December 1985, withdrawn in 1986, and re-introduced in 1989; but it wasn’t until 1997 – more than 11 years after the first release, and 8 years after the re-release – that it was approved as an aid to quit smoking and sold as Zyban.

There’s probably a few years of testing in between – and it’s quite likely that most of that testing would be looking for harmful side-effects.

Then – in 2006 – the drug was also approved for as a treatment for SAD (seasonal affective disorder).

It would be interesting to know how much of the time lag went into:

a) initial observation of the correlation
b) testing for causality
c) testing for side effects
d) research into why it works.

Amusingly when attempting to find Tom Kyte’s excellent brief “In search of the truth. Or Correlation is not Causation” to link on this entry (my old link no longer works since the upgrade of AskTom), it turns out it forms part of an interesting debate between yourself, John, and Tom v Don & Mike Ault – of course you would know that!