Google-Wide Association Studies

April 26th, 2016, 1:00pm by Sam Wang

I will comment on the East Coast primaries at the end of the post. First I will write about something more interesting: Google Correlate!

>>>

In human genetics there is a form of analysis called a genome-wide association study (“GWAS”). In this kind of analysis, the researcher looks for bits of DNA that show up more often in people with some trait or disease. Motivations for doing this kind of study include (a) finding genetic variations that contribute to a condition, so they can be studied; and (b) providing a way of estimating the chance that a condition will occur. However, GWAS is full of challenges. One of my research interests is autism. Autism is strongly driven by combinations of genes, yet GWAS has only succeeded in identifying a small fraction of the risk. Many of these bits of DNA have all kinds of other effects (this is a project in my lab…and hey, I’m recruiting!).

The Google Correlate method for political prediction is analogous to GWAS…but better! In this analogy, Google search terms are the “genes.” Thousands (maybe millions) of Google search terms are statistically associated with the frequency at which a state votes for Donald Trump, Ted Cruz, John Kasich, Hillary Clinton, or Bernie Sanders supporters. Some of these terms make intuitive sense; others are mind-bending.

I wrote about this idea the other day. (To learn how the method works, and to do it yourself, read this first.) Today I want to explain a little further. I will show some fascinating and often hilarious results.

Reader N. turned me on to Google Correlate, which is basically part of the engine behind Google Trends. Correlate takes a pattern you give it – baked bean sales by state, robbery rates over time, or whatever – and gives back the search terms that have a similar pattern. There are billions of search terms – similar to the number of DNA “letters” in the human genome.

N. created a text file of vote shares by state. Here is what the first lines of a file for Trump would look like:
Iowa,0.513
New Hampshire,0.186
South Carolina,0.357
Nevada,0.302
Alabama,0.306
Alaska,0.492
Arkansas,0.455
Georgia,0.347
Massachusetts,0.125
…

These Trump support numbers are fractions of the total Trump+Cruz+Kasich vote. Percentages are okay too, since Google Correlate rescales everything to a range of -1 to +1. Typos are not okay – Correlate is very unforgiving of misspelled state names.

If you upload Trump, Cruz, and Kasich files, Correlate gives back a list of the most-correlated search terms. Those lists look like this:

These lists were generated using vote-share data that excluded Ohio (Kasich’s home state) and Wyoming (a unusually nonrepresentative voting process, even by the standards of caucuses). That was N.‘s decision, after playing around with the data a bit.

The way to read the table is as follows: state-by-state, Trump support is correlated with the frequency of “DeGrassi season 13” with a correlation coefficient of 0.7438. Why? If a term shows up on this list, it doesn’t necessarily mean that the person doing the search supports Candidate X. It could also mean that relatives or neighbors of Candidate X’s voters tend to make that search.

To return to the GWAS analogy, such indirect connections are essential to genomic analysis: the snippet of DNA that is tracked is usually close to, but hardly ever identical to, the snippet that causes a trait. In genetics, even when there is a causal connection, it is not so obvious what is going on. For example, one gene, whose protein was thought to be mainly for how blood cells adhere to one another, turns out to be important in how synapses adhere as well – with implications for schizophrenia. The point is, we should be careful to avoid overinterpreting or overgeneralizing from the search terms above. But we should also keep our minds open about what we find!

Now, to examine some of these “hits”:

John Kasich. Places that like Kasich are richer in some fairly policy-wonkish search terms: “net cost,” “renewable portfolio standard,” the economist Joseph Stiglitz, Financial Times writer Martin Wolf, and Vox writer Dylan Matthews. These terms have a ring of plausibility. They might be good fodder for small talk…if you are talking with a Kasich supporter!

But then there are terms that I don’t entirely understand: Route 73 and Haven Pizza. Maybe someone can explain those to me. It is also true that with billions of search terms to choose from, occasionally a correlation will arise by chance. These might be false positives.

Donald Trump. Note that the correlations are weaker. That could be because Trump support is broad-based in the Republican Party. Or it could be that the connection between the voter and the Google-searcher is indirect (i.e. they are different individuals who live near ne another).

At first, this list was quite puzzling. A prominent cluster of search terms is pop culture-related: DeGrassi (that’s a TV show that focuses on the problems of teens), Kids’ Choice Awards, and Nickolodeon star Alexa Nikolas. And…”never had a boyfriend.” Wait, what? I thought Trump voters skewed older.

N.‘s coworker had a thought. According to correlations from Neil Irwin and Josh Katz at The Upshot, Trump supporters are abundant in communities where many people have not finished high school, or places with lots of mobile home dwellers and “old economy” jobs like manufacturing. She suggests that in some families, the children of Trump voters are the only ones in the house who use the Internet. And at least some of them are looking for entertainment and relationship advice online. I do not think it is totally advisable to ask Google about boyfriends…but these searchers are helping us to predict voting patterns. So…you go, girls!

>>>

The final step in making a prediction is converting these search terms to predicted vote share. Each of the search terms has a correlation score attached to it. One could do a weighted sum using those scores…but they are not that different, so N. calculated a simple average of the terms’ coefficients, state by state. She then did a simple linear regression between that average and the vote-share for that candidate. The resulting linear fit is a formula that can be used to predict vote share in places that have not voted yet. N. repeated this for the other two candidates. The result is a set of inferred values of vote share – what a genomics researcher would call “imputed” values.

And what about today’s primaries? The PEC poll estimate and the Correlate-based estimate are in close agreement:

Because none of these five states have voted yet, none of them went into the Google-Correlate approach. The two green columns of estimates were made independent of one another. Today we will see how these two methods do. They both suggest that in the East Coast primary, Donald Trump will get about 150 delegates (this includes 43 out of 54 district-level Pennsylvania delegates, whose estimated faithfulness is 0.8), over 85% of the total to be given today.

I have one final thought on the analogy with GWAS. These days, modelers who attempt to predict votes from demographic factors are relying on linkages that they suppose to exist between populations and voting behavior. The education/Trump connection is an example of that. In human genetics, that would be called a “candidate gene” approach. Such approaches can turn up a good result if the hypothesis happens to be true. Often, they lead to results that have not held up so well over time.

In both human genetics and political modeling, constructing a good model requires the modeler to be an extremely good guesser. Google Correlate has the potential to search millions of possibilities at once. I think a clever modeler could really go to town with this tool.

>>>

Democrats, I have not forgotten you. Here as an added bonus, search terms that correlate with Clinton and Sanders support. The age/race divide is quite apparent. Where Sanders supporters are found, you can evidently get quinoa soup, PBS Nova, and people who do not know what “wonky” means. And baked goods. A lot of baked goods. It is a veritable cornucopia of Stuff White People Like.

26 Comments so far ↓

Very enjoyable and interesting post. About “haven pizza,” New Haven style pizza is a type of pizza that originated in Connecticut, and has spread out from there. https://en.wikipedia.org/wiki/New_Haven-style_pizza How that connects with John Kasich, I’m not quite sure.

This is mindbendingly interesting. It’s rare to find an analysis that is so completely new. I admit I had expected search terms high on the lists to be more obviously associated with the candidates (religion and abortion with Cruz, unemployment benefits with Trump, etc).

This is more like Neural Nets, where peeking at the values at the hidden nodes gives you almost no insight. One can come up with “just so” stories that sound pretty plausible (like Trump voters’ kids searching for help getting a boyfriend) but how to confirm that?

if pollsters get data on browsing habits for respondents, they can run a relatively easy analysis to tie specific searches more concretely to voting habits. To Dr Wang’s point though, this still would be correlational, not causal.

Interesting. Most pundits think Trump will win around 100 delegates, +/- 5 or 6 today. If he wins 154 delegates, he will cross the psychologically important mark of 1000 before the race goes to Indiana.

A couple of posts ago Sam posted the .csv files for Trump, Cruz, Kasich. The post is titled “Two independent ways of predicting GOP primaries lead to highly similar forecasts”. If you go to the comments, Sam responds to one of them by posting links to the files

I’m still a little bit confused with the final step to make predictions. So I know that “Degrassi season 13” is correlated with Trump at 0.7438. I then average this with the corr coefficients of the term for the other candidates?

When he says “Each of the search terms has a correlation score attached to it.” he meant a relative popularity score of the search term for each state, not what he shows in the picture. you can download an example csv here: https://www.google.com/trends/correlate/csv?e=id%3Au4WEFVag_a7&t=all Those state by state popularity numbers are linearly correlated to the state by state vote shares. You then take an average over all of the 100 search terms for each state, regress that average against the vote share, and then use the average and the regression to predict the unknown states.

I think it’s more likely that the baked goods are strongly correlated with each other despite people searching for them independently? So because he’s correlated with one of them, he’s closely correlated with many of them.

(Also, this wouldn’t be affected by another Sanders – what we’re getting correlations for is the voting percentage pattern.)

I’d be interested in seeing Clinton Sanders with open primary/caucus vs closed. With his huge advantage with independent voters (the majority of voters) this would be much more informative and predictive.

Google correlate is pure comedy.
However, many search terms [pizza, slang, pop culture, highways with sunset views] may tie in with student demographics. Surely one of CURRENT main reasons to research Kasich is a compare and contrast school assignment. Extra credit exam essay _with_ browsing priveleges? Whatevs.