Trump on a glide path (since mid-March)

April 27th, 2016, 9:19am by Sam Wang

The race has been stable for weeks, varying only by factors that are local to each state. Last night’s voting confirmed that – there was nothing new revealed. In terms of voter sentiment, the GOP race has been essentially unchanged since March.

How do we know this? Two reasons. The first is that national polls have been stable for four weeks, since March 22. The second is the remarkable success of a predictive method based on Google Correlate, which relies solely on past voting and web search patterns – and does not use polls or demographics at all. Here is how PEC and N.‘s Google Correlate method did (click to enlarge):

PEC, using a simple approach based on polls and border counties, did well. So did Google Correlate, when its results were fed through the delegate rules process.

Even more remarkable is the Google Correlate-based estimate of vote share. The chart below uses Google Correlate-based predictions (“Google-Wide Association Study” results) based on data excluding the candidates’ home states, and combines both Democratic and Republican primaries:

I’ll update this as more information becomes available, especially if I can find demographics-based predictions. At this moment, the closest thing I have is “538 demo,” which indicates an early estimate made by FiveThirtyEight assuming some kind of state-by-state constant shift. Google Correlate did notably better.

The “media data pundit” state-of-the-art is is demographics, which can give some predictive information, for instance in this year’s Democratic nomination. However, this year’s multicandidate Republican race has seemingly not lent itself well to such an approach. Demographic variables like %evangelicals are crude proxies for voter preference. Amazingly, Google search terms like “DeGrassi season 13” do much better. A future challenge is to understand why.

The pre-East Coast primary delegate estimate was 1303 (IQR 1271-1326, probability 90%). The main effect of yesterday’s voting was to reduce uncertainty. I think it is reasonable to say that yesterday’s voting ended any realistic doubts about Trump being the eventual nominee. That is on a par with previous Republican nomination races: Romney and McCain were considered by their party to become the presumptive nominee in late April.

Since Google Correlate estimates larger margins than PEC does, it should give even less uncertainty. To demonstrate that, I have fed Google Correlate-based vote estimates into the PEC delegate estimator. Trump’s median projected delegate count then becomes 1334 (interquartile range 1306-1341), with a 99.6% probability of getting to 1237. The Google Correlate-based histogram looks like this:

I had briefly considered switching to Google Correlate-based estimates as the official PEC estimate. However, that approach does have a latent assumption that whatever new voting comes in is enough to capture any swings in the race. Before I take any such step, I have to think about the implications of that. Though come to think of it, polls have the same problem – less so in frequently-polled states, more so in rarely-polled states.

35 Comments so far ↓

While Google Correlate is fascinating it is still a new method whose weaknesses have not been probed. We need to think hard about how this method could be led astray, and devise tests for robustness. (PEC comments section is probably as good a venue for that as any)

That being said, the method definitely adds information as an awesome and completely orthogonal crosscheck. If N. were coming up for tenure at any Poli Sci dept in the country, she would be a shoo-in.

I’m not deep in this, but if I understand correctly, the Google Correlate method comes from comparing search trends in states that vote early in the calendar to search trends in states that vote later in the process. That creates a prediction method which is highly useful specifically during the second half of a long primary season, yes? But it would be useless as a predictor of Iowa, New Hampshire, or the general election, and at best of limited utility before some critical mass of geographically diverse election returns have been generated (perhaps until after the first Super Tuesday?). Since this iterative voting presidential primary situation only comes up once every four years, and is highly competitive less often than that, it is not the time to forget what we know about polling.

Also, if there is a true inflection point in the race, Correlate would be powerless to catch it. It may only be performing so well because, as you said, the race has been stable since March.

It also looks to me like, in this situation, the best result come from averaging both approaches.

The border counties method is grounded in study of demographic parameters that are well understood to be statistically significant in predicting voting behavior: ethnicity, education, population density, income, and of course, geography. When this method goes awry, it is for known reasons (for example, cannot use rural/suburban border counties to extrapolate into urban centers).

Google Correlate is very powerful, I admit, but the mechanism is somewhat opaque. We can speculate as to why Degrassi reruns or New Haven pizza work so well, but it’s not as straightforward as demographics.

Reading through what I wrote above, it occurs to me maybe I’m just displaying the inherent conservativeness of my field. Certainly a few more home runs like yesterday and Google Correlate will have silenced any misgivings.

The next time you update, I’d be especially interested in seeing this probability histogram next to one where a Trump loss in Indiana has been manually fudged in. (or made clear in some other way) Intuitively, it seems like a lot of the variance in these results comes down to Indiana.

It is refreshing to see Sam saying Indiana doesn’t matter. What a buzzkill for the talking/writing heads.

But given what another commenter reported, namely that the last time there was a “brokered convention” Eisenhower was the nominee, if Trump is close, the Republican party’s choices are suicide and Trump.

As an American, I would like to point out for history the partial list of creatures from the depths of Hell that Trump has slain. Jeb Bush, Marco Rubio, Ted Cruz, Kasich, etc. etc. etc. The entire deep bench has been sent howling into the night. Bravo, Donald! Whatever your political affiliation, it will be a breath of fresh air to have an election between the old guard and the old guard.

Sam, if you’re still interested in demographic models from the Democratic side, I’ve been tinkering with a county-level demographic model of the Democratic primaries since late March. It considers black population, Hispanic population, whether a state is a caucus state, and whether the state is in the Census-designated South to account for a significant difference in exit polling between Southern and non-Southern black voters. It got a ~9 point RMSE in margin last night.

For the general election I could imagine a Nate-Silver-ish scheme in which a Google Correlate model based on recent state polling can be used as a proxy “poll”, rolled in with some weighting function for states where polls don’t exist or are getting stale. It could make it possible to get started on a fifty-state electoral-vote model now.

I was wondering last night what would have happened had N shopped this methodology around to other prediction sites. You think they would have been interested in publishing it? Sam does an absolutely horrible job of protecting his turf. ;)

I would suspect the weakness of the correlate method would be the level of interest in a particular race. I’d have to look into Google’s methods here but I’d suspect search terms aren’t all they use given the various methods of tracking now days.

While demographic factors certainly have a longer track record, I don’t think they’re necessarily more straightforward. There isn’t anything straightforward about how melanin or senescence impacts voting behavior. Demographic factors are still proxies for belief systems, culture, etc, and in that way, while we don’t fully understand DeGrassi season 13 as a cultural phenomenon, it may actually be more closely linked to factors that actually influence voting. This is much like ordination and the associated challenges in interpretation.

Apologies, this should have been reply to Amitabh Lath’s 04/27 12:24 comment.

One thing that struck me about the Correlate lists is that they seem suspiciously “clean”, given how much internet traffic is devoted to pornography. Were the results filtered by Google or are cookie recipe searches just more common?

Is it easy to rerun the numbers assuming a Cruz victory in Indiana? If so, I think it would (a) be interesting, and (b) be a good test of the quality of the narrative. (Is it really make or break, or it the Trump train too far down the tracks at this point?)

It seems like ethnicity, education, income, geography are crude pre-big-data ways to understand demography. I expect search behavior when well worked through eventually to become vastly more accurate and effective at predictions. For example, county-level search data may become available which would greatly increase the accuracy of N’s type of analysis.

For the general election, there will be plenty of polls both national and of every swing state. With plenty of polling data I wonder will there be any use for demographic information in estimating the outcome, and if so if there will be an application for something like Google Correlate.

These estimators like border counties and Google Correlate are only needed when there is sparse or no polling as in Indiana. The problem is what do we do when GC (inevitably) gets a prediction wrong. With geographic/demographic extrapolation techniques we can figure out what implicit assumptions we made that went wrong. Or at least try. I have not wrapped my head around the big data method sufficiently to map out a similar diagnostic path.

The fundamentals are baking. I’m not sure we can definitely say there will be a landslide until we see some numbers, but unless there is a recession, or some terrorist attack, perhaps combined with Clinton and Sanders unable to reach an accord, Hillary Clinton looks to be more likely to win right now.

I’ve got a silly question. Does anyone have a prediction about how Indiana will vote for the GOP based on Google Correlate? If so, please do tell! (And then let’s see how that prediction does in a week.)

Of what use is Google Correlate in the general election? All the voting takes place on one day. Can we use Google Correlate to, dare I say it, unskew the polling, or to extend the power of polls to poorly polled states?

– When something in the race changes, the future results will be different from the past ones, so anything that just extrapolates out past results (correlate, border counties), will clearly lose accuracy. So, we’ll always need polls and/or actual voting to detect changes. Conversely, if past extrapolating methods persistently differ from polling/actual results, we might suspect that something changed in the race.
For related reasons, we may possibly want to exclude overly old state data from our data set or include some new polling data in our data set?

– Maybe we could split our training data into a few subsets and see which search terms the resulting trends have in common. Hopefully there would be a bunch of them, and maybe these would be the “stable” ones and possibly they would be more predictive than all of the terms combined. (Or maybe not!)

– I wonder how Google would do as a polling company. Track down people with representative search patterns, ask them their opinion on whatever political question, then extrapolate out. (Raises privacy concerns, of course.)

Are you asking about the algorithm, or the reason why it would be predictive (a ‘why’)?

The algorithm is basically: ask Google for a list of (~100) search terms where search frequency seems to correlate with past results. Fit a linear model between each those search frequencies (individually) and vote share. Calculate predicted vote share in a future state based on search frequencies in that state. Take the average of the predicted vote shares.

My attempt at summarizing a ‘why’ (based on Sam’s comments) is that the search terms are actually categorizing swaths of populations (i.e. if you gathered up individual-level data, you could cluster people by their search terms), which you can kinda use almost like “demographics”. These swaths match up well with vote shares this race, even though the swaths are not anything traditionally measured as “demographics”.

I speculate that Google must be doing something to only output correlations that have reasonable statistical significance (at least, in number of searches).

I also should note that it’s “predictive” in that it leverages information about past states for future states. It obviously isn’t predictive in the sense that it can see whether there might be future events in the race that will affect those future states.

Here’s my take. I think of the Google searches as a bunch of data on consumer behavior. Because demographics are correlated with consumer behavior, lots of companies use demographics to try to predict consumer behavior for use in marketing. Demographic-based voter prediction is an attempt to predict voting behavior using similar models, with polling data as an input. The Google Correlate method is inferring one set of behavior (voting) from a population based on a different set of behavior (web search). I don’t think we have to think of the prediction model as having a direct demographic meaning, but rather it’s correlated with demographics in some way. Instead of “Bernie Sanders is doing well with young white voters”, it’s “Bernie Sanders is doing well with Whole Foods shoppers”. There’s an overlap between those things, but it’s not perfect, and one might be more predictive than the other. Demographics, consumer behavior (Google searches, but it could just as easily be Amazon purchases) and voting are all separate variables, all correlated with each other in some unknown way. The more tightly correlated any of those two variables are with each other, the more accurate the prediction. The Google Correlate method is probably working well because the data set is so huge that there is a reasonable chance of finding a few cases with predictive value.

Sam, I hope you react to this with humor and not venom (I am nervous of bringing up anything to do with 538).

While writing the previous post (guessing that the search-term technique may be conceptually equivalent to clustering people into non-standard demographics), I recalled that someone actually tipped off 538 that racially-charged search terms appeared to correlate with Trump’s early primary results. However, 538 relayed the correlation in factoid-type posts about xenophobia within Trump’s supporters, and didn’t pursue the technique itself. It’s possible that the original source was testing specific hypotheses about search terms, and they simply didn’t think to extend it.

Just another datapoint that they’re not really “on the ball” with new ideas this year despite their size. Any new modeling efforts seem to be going towards sports and the like (which, I guess, was the point of going to ESPN).

(On second thought, maybe it’s because of their size? This is a great technique, but it might be one that is relatively specific to our current situation, where it is late in a “competitive” primary. So it could be that larger ‘slower’ organizations wouldn’t want to invest into publishing something both one-use and experimental, while PEC can iterate faster and release interesting albeit less-proven findings.)

I’ve been beating that drum in the comments over there for a long time, to no avail.

Anyway, if I had to guess, they made an editorial decision not to pursue such methods on a regular basis because that would read as taking a normative stand on the nature of Trump’s support (which I have no problem doing, but I am not trying to sell anything).

That “tipoff” came fron NYT (The Upshot). They ran a story containing that information, which, unfortunately, I don’t have the time to locate. Could be the oft-cited “Trump’s supporter—a certain kind of Democrat” one.

A question re: google correlate:
At what point was there enough data to make this work? We, at this point, have several dozen states worth of data to work with, but how effective would the method have been had we attempted to use it from earlier days of the campaign? Does google offer the kind of historical data that would allow for testing of that?

That’s an interesting idea and tractable without any help from google. Run the correlate analysis with Iowa and New Hapshire, try to predict South Carolina, Then use the first three only to predict Nevada etc. I would be particularly interested to see if big polling upsets could be predicted by this method using only data from the states that had already voted. If Sam had been using this before Michigan would we have known there was a problem with the sanders-clinton polls?