Technical note: Correlated change among states (revisited)

August 28th, 2008, 2:51pm by Sam Wang

The next week and a half will be stormy, not only in the Gulf but also for polls. Seeing where political events take the median EV history will be clearer once we have polls taken after the Republican convention.

In the meantime, let’s return to an topic that is obscure, but loved by some hardcore polling enthusiasts: co-variation.

A substantial fraction of my mail continues to concern the subject of correlated swings between states, and how that correlation may affect the snapshot. I addressed this topic before, giving an argument for why it does not matter for a snapshot. However, some people remain unconvinced. So let me take another bite at the question – with data.

The central assumption of the Meta-Analysis is that on average, recent polls sample a population equivalent to participants in a real election held today. Election Eve polls perform quite well (see here and here) in predicting Election Day outcomes. In the face of this fact, how should we think about the idea that outcomes among states are coupled?

At a basic level, this information is already contained in the primary polling data, which give margins in all states. Consider the following four states:

TX — CO — IA — NJ

where TX is the most reliably Republican, NJ is the most reliably Democratic, and the others are somewhere in between. If states are constant relative to one another – in other words, if fluctuation is uniform nationwide – McCain could win {TX}, {TX, CO}, {TX, CO, IA}, or {TX, CO, IA, NJ}, but no other combinations. Reflecting this, the order of polling margins is usually TX < CO < IA < NJ (or the other way around, depending on your political preference).

This tendency toward rank-ordering is related to why the probability distribution in the right sidebar is spiky. At any given moment, only a few states are in play. Recently, Obama campaign manager David Plouffe cited his focus on 18 states. Most of these states are represented in the jerseyvotes (voter power) calculation in the right sidebar. The number in substantial doubt on Election Day will certainly be fewer.

Co-variation is said to come up in two cases. Let us consider them.

Case 1: Polls reflect hypothetical voting, but a state is won by the candidate who is not leading in polls.

One intuitive view says that an unexpected outcome is evidence that polls are biased, and other states are likely to be off in the same direction. However, the current state probabilities indicate if we guessed the winner using a probability>50% criterion, on average we would get 3 of them wrong. This reflects the fact that the race appears to be near-tied in six states. Surprises are to be expected. Indeed, they are the reason for the existence of the Meta-Analysis.

Case 2: Sentiment differs between the polled sample and the voting sample.

This can happen for various reasons. Voter sentiment can change over long periods, especially when the election is months away. Some voting population may be undersampled, such as voters who have cell phones but no landlines.

The tendency for opinion in different states to shift (or differ) from polls in the same direction and by similar but nonidentical amounts has been addressed in detail by Nate Silver here and here. Instead of discussing the subject of how these shifts may vary from state to state, today I want to bring up a more fundamental modeling question: How much does adding covariation affect the statistical properties of a current-poll snapshot or a projection? This leads to the action item: In a snapshot or future projection, to what degree of detail is it necessary to consider covariation in detail?

Let’s look at the modeling question with a simple, worst-case calculation using today’s data.

To simulate covariation, let’s assume that true voting differs from the polls in every state by the same amount, S. This assumption is the simplest – and most extreme – version of the covariation idea. Modeling this case puts an upper bound on the size of the effect.

However, we don’t know what S is. So let it vary. For instance, if the election were today, it might vary over a range of -1% to +1%. In November the range might be wider. In all cases I will distribute it randomly around zero. It can also have a bias, but that’s not today’s subject.

In the Meta-Analysis, varying S is easily done by setting biaspct (which is S) to a range of values, calculating the probability distribution for each value, and averaging the distributions. Here is what we get:

Adding covariation to the Meta-Analysis

(Click the image to see details)

In all three cases, the distribution is spiky, as implied by the rank-ordering principle. As expected, the peaks are in exactly the same locations in all three distributions.

The statistics of the distribution are (all values given in Obama EV):

In all three cases, the median and mode are the same. With a covariation of +/-1%, the confidence intervals get 3-7 EV wider. With a covariation of +/-2%, the confidence intervals get 12-21 EV wider.

Covariation of +/-1% is an amount of variation one might expect on Election Eve – the snapshot approach. In this case covariation does not affect the basic estimate, but it does increase the uncertainty by a modest amount.

Covariation of +/-2% (or more) represents a case in which national voter sentiment goes someplace – that is, a prediction. The more the covariation, the wider the spread. The decreased certainty of the outcome is reflected in the tails of the distribution and the decreased probability of getting to 270 EV. If you understand the Popular Meta-Margin, you will not be surprised by this. Co-variation of +/-2% creates some scenarios in which opinion moves by almost the amount of the Meta-Margin and the trailing candidate has a chance. The Meta-Margin has the advantage of telling you explicitly what this quantity needs to be.

(By the way, I think “win probabilities” are intrinsically inaccurate for an event two months off, mainly because of the difficulties of figuring out the right assumptions. For this reason I don’t report it, and I think the differences above are of no consequence.)

The bottom line of these models is this: For a snapshot, covariation doesn’t matter. For a long-term prediction, changes in national sentiment need to be modeled in some way that should include partial or total covariation.

Finally: Since the biggest source of uncertainty in modeling long-term changes is knowing how large the national changes will be, an adequate assumption is total covariation. This is a simple but effective approach to long-term prediction.