Category: spurious correlations

Last week saw the release of the latest Roberts & Winters collaboration (with guest star Keith Chen). The paper, Future Tense and Economic Decisions: Controlling for Cultural Evolution, builds upon Chen’s previous work by controlling for historical relationships between cultures. As Sean pointed out in his excellent overview, the analysis was extremely complicated, taking over two years to complete and the results were somewhat of a mixed bag, even if our headline conclusion suggested that the relationship between future tense (FTR) and saving money is spurious. What I want to briefly discuss here is one of the many findings buried in this paper — that the relationship could be a result of a small number bias.

One cool aspect about the World Values Survey (WVS) is that it contains successive waves of data (Wave 3: 1995-98; Wave 4: 1999-2004; Wave 5: 2005-09; Wave 6: 2010-14). This allows us to test the hypothesis that FTR is a predictor of savings behaviour and not just an artefact of the structural properties of the dataset. What do I mean by this? Basically, independent datasets sometimes look good together: they produce patterns that line up neatly and produce a strong effect. One possible explanation for this pattern is that there is a real causal relationship (x influences y). Another possibility is that these patterns aligned by chance and what we’re dealing with is a small number bias: the tendency for small datasets to initially show a strong relationship that disappears with larger, more representative samples.

Since Chen’s original study, which only had access to Waves 3-5 (1995-2009), the WVS has added Wave 6, giving us an additional 5 years to see if the initial finding holds up to scrutiny. If the finding is a result of the small number bias, then we should expect FTR to produce stronger effects with smaller sub-samples of data; the initial effect being washed out as more data is added. We can also compare the effect of FTR with that of unemployment and see if there are any differences in how these two variables react to more data being added. Unemployment is particularly useful because we’ve already got a clear casual story regarding its effect on savings behaviour: unemployed individuals are less likely to save than someone who is employed, as the latter will simply have a greater capacity to set aside money for savings (of course, employment could also be a proxy for other factors, such as education background and a decreased likelihood to engage in risky behaviour etc).

What did we find? Well, when looking at the coefficients from the mixed effect models, the estimated FTR coefficient is stronger with smaller sub-samples of data (FTR coefficients for Wave 3 = 0.57; Waves 3-4 = 0.72; Waves 3-5 = 041; Waves 3-6 = 0.26). As the graphs below show, when more data is added over the years a fuller sample is achieved and the statistical effect weakens. In particular, the FTR coefficient is at its weakest when all the currently available data is used. By comparison, the coefficient for employment status is weaker with smaller sub-samples of data (employment coefficient for Wave 3 = 0.41; Waves 3-4 = 0.54; Waves 3-5 = 0.60; Waves 3-6 = 0.61). That is, employment status does not appear to exhibit a small number bias, and as the sample size increases we can be increasingly confident that employment status has an effect on savings behaviour.

So it looks like the relationship between savings behaviour and FTR is an artefact of the small number bias. But it could be the case that FTR does have a real effect albeit a weaker one — we’ve just got a better resolution for variables like unemployment and these are dampening the effect of FTR. All we can conclude for now is that the latest set of results suggest a much weaker bias for FTR on savings behaviour. When coupled with the findings of the mixed effect model — that FTR is not a significant predictor of savings behaviour — it strongly suggests this is a spurious finding. It’ll be interesting to see how these results hold up when Wave 7 is released.

This week our paper on future tense and saving money is published (Roberts, Winters & Chen, 2015). In this paper we test a previous claim by Keith Chen about whether the language people speak influences their economic decisions (see Chen’s TED talk here or paper). We find that at least part of the previous study’s claims are not robust to controlling for historical relationships between cultures. We suggest that large-scale cross-cultural patterns should always take cultural history into account.

Does language influence the way we think?

There is a longstanding debate about whether the constraints of the languages we speak influence the way we behave. In 2012, Keith Chen discovered a correlation between the way a language allows people to talk about future events and their economic decisions: speakers of languages which make an obligatory grammatical distinction between the present and the future are less likely to save money.

James and I have written about Galton’s problem in large datasets. Because two modern languages can have a common ancestor, the traits that they exhibit aren’t independent observations. This can lead to spurious correlations: patterns in the data that are statistical artefacts rather than indications of causal links between traits.

However, I’ve often felt like we haven’t articulated the general concept very well. For an upcoming paper, we created some diagrams that try to present the problem in its simplest form.

Spurious correlations can be caused by cultural inheritance

Above is an illustration of how cultural inheritance can lead to spurious correlations. At the top are three independent historical cultures, each of which has a bundle of various traits which are represented as coloured shapes. Each trait is causally independent of the others. On the right is a contingency table for the colours of triangles and squares. There is no particular relationship between the colour of triangles and the colour of squares. However, over time these cultures split into new cultures. Along the bottom of the graph are the currently observable cultures. We now see a pattern has emerged in the raw numbers (pink triangles occur with orange squares, and blue triangles occur with red squares). The mechanism that brought about this pattern is simply that the traits are inherited together, with some combinations replicating more often than others: there is no causal mechanism whereby pink triangles are more likely to cause orange squares.

Spurious correlations can be caused by borrowing

Above is an illustration of how borrowing (or areal effects or horizontal cultural inheritance) can lead to spurious correlations. Three cultures (left to right) evolve over time (top to bottom). Each culture has a bundle of various traits which are represented as coloured shapes. Each trait is causally independent of the others. On the right is a count of the number of cultures with both blue triangles and red squares. In the top generation, only one out of three cultures have both. Over some period of time, the blue triangle is borrowed from the culture on the left to the culture in the middle, and then from the culture in the middle to the culture on the right. By the end, all languages have blue triangles and red squares. The mechanism that brought about this pattern is simply that one trait spread through the population: there is no causal mechanism whereby blue triangles are more likely to cause red squares.

A similar effect would be caused by a bundle of causally unrelated features being borrowed, as shown below.

Everett, Blasi & Roberts (2015) review literature on how inhaling dry air affects phonation, suggesting that lexical tone is harder to produce and perceive in dry environments. This leads to a prediction that languages should adapt to this pressure, so that lexical tone should not be found in dry climates, and the paper presents statistical evidence in favour of this prediction.

I’m working with the Language in Interaction project to create an App game about linguistic diversity. It’s a game where you listen to several recordings of people talking and have to match the ones who are speaking the same language. It’s quite a lot like the Great Language Game, but we’re using many lesser-known languages from the DOBES archive.

But first – we need a name. Help us create one with the power of Iterated Learning!

Given this blog’s link with Chen’s study (see Sean’s RT posts here and here), and that Sean and I recently had our own paper published on the topic of these correlational studies, I thought I’d share some of my own thoughts in regards to this video. First up, the video provides some excellent animation, and it does a reasonable job at distilling the core argument of Chen’s paper. However, I do have some concerns, namely the conclusion presented in the video that “even seemingly insignificant features of our language can have a massive impact on our health, our national prosperity and the very way we live and die“.

This is stated far too strongly. After all, the study is only correlational in nature, and there are no experiments supporting this claim. Also, the video makes no mention of the various critiques that have popped up around the web by professional linguists, such as this excellent post by Osten Dahl. Of course, we could hand wave away these critiques, and argue it’s just a fun video. But I worry these popular renditions often lend significant media weight to dubious and unsubstantiated claims, with the potential to influence social policy. Still, we can’t completely blame the video. There’s somewhat of an academic smokescreen at work in the way Chen writes up the paper — it reads as if he had a particular hypothesis, and then tested this using an available dataset. I’m not 100% sure this is the whole story. I wouldn’t be too surprised to hear the initial finding was discovered, rather than actively sought out in a strict hypothesis-testing sense. This is all conjecture on my part, and I could be completely wrong here, but it does seem like Chen was fishing for correlations: you throw out your line into a large sea of data, find a particularly strong association, and then proceed to attach an hypothesis to it. Such practices are exactly the type of problem Sean and I were warning against in our paper. And as Geoff Pullum pointed out: Chen’s causal intuition could easily have been reversed and presented in an equally compelling fashion. It just happened to be the case that the correlation fell in one particular direction.

Besides the numerous theoretical and methodological critiques of the paper, the simple fact of the matter is that Chen’s work is being presented as if it’s demonstrated a causal relation. Let’s be clear about this: he hasn’t even got close to making that point. All he’s found is a strong correlation. So far, the best we can say is that we’re at the hypothesis-generating stage, with the general hypothesis being that differences in grammatical marking of the future influence future-oriented behaviours. Now, if we are to test this hypothesis, then experimental work is going to be needed. I doubt this will be too difficult to do given the large literature into delayed gratification. One useful approach might be found in the Stanford Marshmallow Experiment:

Here, you could control for a whole host of factors, whilst seeing if delayed gratification varied according to the language of particular groups. Surely Chen would expect there to be differences between those populations with strong-FTR languages and those with weak-FTR languages? Also, I wouldn’t be too surprised if we discovered that marshmallow consumption is linked to a propensity to save as well as road traffic accidents, acacia trees and campfires. In short: Marshmallows are the social science equivalent of the Higgs Boson. They’ll unify everything.