For the love of graphs

Never lose sight of the innate beauty of a well-designed graph and the full message that it can convey. (Source: Nikada/iStockPhoto)

Although I'm pretty much allergic to finance and business news in the evening bulletins, I do look forward to Alan Kohler's segment on the ABC News because no one deals with graphs on telly better than him. Graphs can be difficult to interpret and are often misleading but they are one of the most effective ways of communicating complex information. Graphs are used and abused in the name of science so I thought I might take a moment to try and emulate Alan and explain some of the most common problems with interpreting scientific graphs.

It was once unkindly said of my PhD thesis that, torture the data long enough and hard enough and it will tell you anything (a crude misquotation from economist Roland Coase and, given that graphs are simplified depictions of complex data sets, the same can be said of interrogating and over interpreting them.

Here are a couple of beautiful examples that came across my desk recently. I particularly like this one:

What a stunning correlation between the increase in the incidence of autism and the proliferation of organic foods! The statistical fit is excellent but the effect is actually weakening over time: according to this graph it took around $6.9k sales of organic food per case of autism in 1998 while it took around $8k in sales per case in 2007.

Once again a very close fit statistically between the decline in the US murder rate and the decline in market share of Internet Explorer. But this time an analysis of effect reveals a dramatic increase in effectiveness over time with 209 murders per percentage of market share in 2006 against 348 murders per percentage point in 2011; a whopping 66 per cent productivity increase!

Garbage in, garbage out

Rule number one in graph interpretation is that the graph is only as good as the data that goes in — the 'garbage-in, garbage out' rule.

In both these cases above you would have to question the integrity of the original data. And in both cases, I don't know where that data has come from, if it's accurate and reliable or even if it actually exists. (While some references are provided in the first graph, I have not checked their veracity, accuracy or even if they are real). This alone ought to ring alarm bells.

And of course, both of these graphs are meaningless jokes as well as wonderful illustrations of the golden rule that correlation does not necessarily indicate causation. There is no viable mechanism in the sales of organic foods that can cause autism (in fact the causes autism are poorly understood) and the same can be said of the relationship between the decline in the US murder rate and the market share of Internet Explorer.

What is more revealing is how these and other examples of the 'correlation means causation' fallacy play into our preconceptions and prejudices.

We have evolved as pattern-seeking creatures, our survival has depended on spotting correlations and acting on them. From cycles in the growth and flowering of food plants and the migration of animals through to climatic phenomena and other environmental forces; if we noticed that one event was linked to another, that could directly affect our survival. But the next step of attributing causation between two phenomena is purely a construct within our own heads. Determining causality is an intellectual pursuit, detecting correlation is mostly just observation.

The underlying prejudice in the two examples I've given here is obvious. The autism v organic food sales graph has been doing the rounds of skeptic groups and pro-vaccination lobbyists. They have chosen autism because it is a favourite condition of the anti-vaccination lobbyists and they have chosen organic food sales because the skeptics are sceptical of the more extreme health claims for organic foods while the anti-vaccers tend to uncritically embrace those same claims. Bringing the two together in this way is obviously fallacious and designed to ram home the point that correlations can be misleading - something the anti-vaccination groups seem unable or unwilling to grasp.

Even more simple to explain is the demise of Internet Explorer and the decline of the US murder rate. This graph was obviously created by those tech-savvy people who can detect significant differences between the various search engines (I can't!) and who regularly express frustration with Internet Explorer — a frustration that could lead to wanting to kill someone (even if only as overblown hyperbole).

These are classic cases of 'if it looks too good to be true, it probably isn't'. Swallowing the causation indicated by a correlation is all the easier if that causation is already deeply planted in your own innate bias. If it plays to our preconceptions, it's easier to accept and more likely to slide past our bullshit detectors.

Define the cause

Thus the cardinal rule when interpreting graphs must be to explain the physical effects between variables prior to looking at the correlation as an illustration of that phenomenon. Define the cause before observing the effect.

While there is some variation of global temperature above and below the trend, the correlation between the two is very strong. And it ought to be. We have known for over 100 years that CO2 molecules can store heat energy and, as a consequence, the more CO2 there is around in the atmosphere, the warmer it will be.

We can quantify the amount of energy that CO2 can store and use this to make predictions as to what the temperature ought to be at a given concentration. And, as this experiment has rolled out over the years, the observed temperatures agree with the predicted temperatures as the concentration of CO2 in the atmosphere increases. Correlation and causation are linked here in the most fundamental way.

Let's look at another case:

I found this on a pro-science website showing a very strong correlation between the take up of anti-measles vaccines and the decline of the incidence of measles in England. We know that vaccination against measles works and the mechanism by which it acts. Graphs like this provide the observations of how that relationship has played out in the real world.

Again this is data from Great Britain and mostly we see what we expect. We know smoking is a leading cause of lung cancer (around 80 per cent of cases of lung cancer are attributed to smoking) and so it's no surprise that, as rates of smoking among men dropped from the early 1970s, there was a corresponding but delayed drop in the reported incidence of lung cancer (the disease can occur decades after an individual has given up).

That's a brilliant correlation for the males in this graph but what about the women? While their rates of smoking have always been lower than the men and they experienced a similar drop in rates from the early 1970s, the incidents of lung cancer actually increased from 1975. This is due to the relatively high rates of women smokers 30 years ago and the two to three-decade lag in contracting cancer. So, while initially it might appear counter intuitive, this increase in female cases of lung cancer is entirely consistent with the underlying causal relationship.

So the take home lesson here is that interpreting graphs can be a fraught exercise.

The apparent ease with which the likes of Alan Kohler can do it belies some basic homework and research that needs to be done behind the scenes. As appealing and simple as a graph may appear, beware the hidden bias in the underlying data.

You need to question both what's behind the graph and what it's actually telling you. But never lose sight of the innate beauty of a well-designed graph and the full message that it can convey.

About the author:Dr Paul Willis is the director of RiAus, Australia's unique national science hub, which showcases the importance of science in everyday life. The well-known palaeontologist and broadcaster previously worked for ABC TV's Catalyst program. This article was first published on the RiAus website.