Google Correlate does not imply Google Causation

I need something sexy, something to lure new readers to this new blog and get them excited. So let’s talk about statistical correlations. No, wait, failed statistical correlations!

Google Correlate is a nifty new Google product that takes data sets and finds search terms that correlate with them. For example, if you set it to “correlate over time” and enter a data set of average US temperature, it might return the search term “skiing”, because people are most likely to ski when it’s cold and so searches for skiing will be correlated with temperature. You can also just enter in Google search terms and see what other search terms they’re correlated with.

The results seem to fall into two categories: obvious and nonsensical.

Ones with clear time patterns are obvious. If you enter in skiing, you’ll get “how to ski”, “buy skis”, “snowboarding”, “ski resorts”, and the like. If you enter in a news trend that was only popular at one point, you’ll get both related terms and other news trends only popular at that one point – for example, “school shooting” brings up “jan berenstain”, not because the Berenstain Bears books secretly cause school shootings (…one hopes) but because she died the same week as a relatively big one and so people were searching them around the same time.

Things that don’t have obvious time patterns seem to bring up results that are both nonsensical and very-very convincing-looking. The worst are diseases.

This is Google Correlate’s result for heart attack. It matches it to “pink lace dress” with a correlation of .88 (for comparison, a study comparing cigarette use vs. lung cancer rates across different social groups found a correlation of .71).

Figure 1: Correlation between interest in heart attacks and in pink lace dresses, by time.

As far as I can tell, this is just an artifact of Google having lots and lots of search terms and you would expect some of them to be heavily correlated by mere coincidence.

Google also has a correlate-by-state feature. This one has even weirder results for heart attack, like “can you get a” and “is it a” (note that these are the entire search terms). I understand that “is it a heart attack” is a reasonable question, but I don’t understand who would just enter that phrase into Google and hope it would figure it out. I’m kind of imagining someone having a heart attack going on Google, typing as far as “is it a…” and then falling over dead, but I assume the real explanation is more prosaic, like someone expecting autocomplete to work but being disappointed.

Google’s state-by-state feature seemed potentially really exciting to me. I wrote a while back on the effect of parasite load, and I had the dataset lying around with different states ranked on different metrics. I entered the data for parasite load and got the following search terms: “Toy Johnson”, “Bernie Mac”, “booty models”, “Harvey suits”, “Beyonce clothing line”.

Figure 2: Correlation between parasite prevalence and interest in booty models, by state.

I didn’t actually know what most of these were (I kinda thought Bernie Mac was a real estate conglomerate, which turns out to be false) but upon closer investigation they are all black people or Stuff Black People Like. So I think what’s happening here is that the high-parasite load states are all in the South and relatively poor with low access to health care, which also selects for black people. This obviously has significant implications for the study’s attempt to determine that high parasite load causes certain social trends.

My next thought was “if I multiply this data set by negative one, I will have an objective pipeline to figuring out Stuff White People Like. That sounds interesting.” So I tried it, and my results were: “black albino”, “shake that eminem”, “tony hawk pro skater”, and “green day time of your life”. I was sort of hoping that “Black Albino” was the name of a band or something (it would actually be a pretty good one) but no, it turns out white people are just fascinated with the idea of black albinos. White people are kind of weird.

Figure 3: A black albino. Happy now, white people?

But let’s keep going through the state-by-state data set. My next Big Social Statistic was “importance of family ties, by state”. States with higher family ties were more likely to search for: “how to swim”, “composition book”, “noni juice”, “muscle men”, “girl kiss”, “Toyota Tacoma 2008”.

Figure 4: Correlation between strength of family ties and interest in swimming, by state.

A lot of these seem related to physical fitness, or ruggedness (the Tacoma seems to be a very sporty, rugged car), or masculinity. I’m not really sure what to make of this.

The last Social Science Statistic in the dataset was Religiosity, which correlated with the following search terms: “Christmas themes”, “rotary cutter”, “Honda rebel 250”. Christmas themes seems sort of plausible. I dunno about the rest.

So as far as I can tell Google Correlate is not very interesting. It doesn’t reveal any deep connections between concepts, or even guess what concept my dataset came from to begin with. For something potentially so powerful this is disappointing.

I can think of two possible uses for it. The first is as a sanity check to make sure your data aren’t completely confounded. If you think you’re measuring average number of roof tiles per house or something, and your data’s Google Correlate results come back with Toy Johnson and Beyonce clothing, you’re probably just measuring race and for some reason different races have different numbers of roof tiles on their houses. Which means if you think you’ve found a correlation between roof tiles and something fascinating like voting record, you’re probably just being confounded by race. This is a real problem in a lot of studies.

The second is as a cheap hack for creating datasets. I entered “Jesus” in and got a state by state list of who searched for Jesus. It looked a lot like my state-by-state map of religiousity. The correlates were all things like “Apostle”, “Paul”, “preaching”, and for some reason “Abednego”, who is a very minor Biblical character who has no business being in the top ten correlates of Jesus at all. If you wanted to make a cheap map of state-religiosity in order to correlate to parasite load or whatever, Google Trends seems like a plausible method.

On the other hand, I tried to see if I could recreate their state map of parasite load. I asked it to correlate “metronidazole”, a medication commonly used in the treatment of parasitic diseases, on the grounds that people with parasites would be prescribed metronidazole and then look it up to see if it was safe. The result looked only a little like my map of state-by-state parasite data, and the number one correlated search term (r = .89) was “Is Lil’ Wayne gay?”

> So if nothing else, this exercise has proven my suspicion that the sort of people who worry about whether Lil’ Wayne is gay are, in fact, crawling with parasites.

With this and “It is quite uncontroversial among historians that Lincoln attempted to summon the dead,” I’m quite enthralled by your abilities at obviously joking statements which are quite literally true.

Meta

Subscribe via Email

Jane Street is a quantitative trading firm with a focus on technology and collaborative problem solving. We're always hiring talented programmers, traders, and researchers and have internships and fulltime positions in New York, London, and Hong Kong. No background in finance required.

Metaculus is a platform for generating crowd-sourced predictions about the future, especially science and technology. If you're interested in testing yourself and contributing to their project, check out their questions page

Beeminder's an evidence-based willpower augmention tool that collects quantifiable data about your life, then helps you organize it into commitment mechanisms so you can keep resolutions. They've also got a blog about what they're doing here

MealSquares is a "nutritionally complete" food that contains a balanced diet worth of nutrients in a few tasty easily measurable units. Think Soylent, except zero preparation, made with natural ingredients, and looks/tastes a lot like an ordinary scone.

Altruisto is a browser extension so that when you shop online, a portion of the money you pay goes to effective charities (no extra cost to you). Just install an extension and when you buy something, people in poverty will get medicines, bed nets, or financial aid.

AISafety.com hosts a Skype reading group Wednesdays at 19:45 UTC, reading new and old articles on different aspects of AI Safety. We start with a presentation of a summary of the article, and then discuss in a friendly atmosphere.

Nectome is building the first brain preservation technique to verifiably preserve your memories for the future.

Triplebyte is building an objective and empirically validated software engineering recruitment process. We don’t look at resumes, just at whether you can code. We’ve had great success helping SSC readers get jobs in the past. We invite you to test your skills and try our process!

80,000 Hours researches different problems and professions to help you figure out how to do as much good as possible. Their free career guide show you how to choose a career that's fulfilling and maximises your contribution to solving the world's most pressing problems.

Giving What We Can is a charitable movement promoting giving some of your money to the developing world or other worthy causes. If you're interested in this, consider taking their Pledge as a formal and public declaration of intent.