How Our Prediction *Really* Works

September 17th, 2014, 11:10am by Sam Wang

I hear that the Princeton Election Consortium calculation has come under criticism for being statistically overconfident. I think there is confusion here, which requires a little explanation – and an appreciation for what I’ve learned since I started doing this in 2004. Basically, after 2012, any predictive calculation started to build in Election Day uncertainty. By conflating a 2010 snapshot with the 2012/2014 predictive model, Nate Silver has made a factual error.

The key difference is between the snapshot and the prediction. Our snapshots are precise because they give a picture of conditions today. Our November prediction builds in the possibility of change occurring in the coming seven weeks. Thus the November prediction above (today at 70%) will, in the near future, usually be less certain than the snapshot (today at 80%). As a reminder, the predictive model is documented and is open-source.

I explained this in 2012. As an example, when our current prediction method is applied to past Presidential races, they gave a cliffhanger in 2004, and clear Obama wins in 2008 and 2012. A polls-only approach suggests that this year, Senate control is also a cliffhanger, with a slight advantage for Democrats+Independents.

I’m sure there are more points I have missed. Have at it in comments. Please be nice about everyone, including any rival sites. Nonsubstantive and rude comments will be moderated.

The Dems will win between 50 and 52 seats. Women, Latinos and the Black vote will be the difference. The Democratic ground game will be huge in this election since Obama has given the party his method of tracking Democratic voters and getting them to the polls. You cannot argue with him since he crushed both McCain and Romney in the last 2 Presidential elections. GOP are in big trouble since you cannot win elections by suppressing the vote and waging a war on women and minorities.

That is my thought as well. Barring a Black Swan event – such as a new financial meltdown or a 9/11-scale terrorist attack on American soil – I believe Democratic candidates will over-perform due to highly effective GOTV efforts.

Just look at how many races that currently have the candidates within 2 % of each other:

Well since Nate’s model has been moving more and more in line with Sam’s rather than vice versa (so far at least) I feel like I have to side with Sam on this one. As I see it historical data and such is helpful for thinking about an election say a year or six months out, but once polls start becoming available you really should side with them more. So it made sense to say that the Dem’s had a tough election ahead of them back in December, but it makes a lot more sense to look at polls rather than GDP figures from the spring to figure out if Kay Hagan is going to be reelected as of right now.

The GOP gained 63 seats that year. As Nate points out, this means that the actual result was more than 11 standard deviations away. Note that erfc(11) = 1.4409e-54. In other words, if you held an election in America every attosecond since the beginning of the universe, you still would never expect that result to happen.

It should be pointed out that after the election we will still not know, with absolute certainty, which model is best. However before the election both Nate’s model and Sam’s model help us to see where to focus our energy and where not to waste it. –bks

Any individual election gives us no indication which model performed better, except in extreme cases. However, since Prof. Wang’s model is open source, we can rerun the current model on historical data (which he has done) and evaluate it’s accuracy. Can the same be said for Silver? I get the feeling that he adds so much special sauce that there is no way to run it against historical data, but correct me if I am wrong.

Hi Sam — it would be helpful to hear you explain how your current model/November prediction estimates uncertainty differently than the other “polls-only” aggregators. Don’t have a sense of this because I haven’t tried to figure out the other models, but would be helpful to have you outline what differences you think exist, if any. Thanks!

Our predictions were the same in those two cases, so this is mainly an issue of error calculation. I agree some re-examination of how to calculate error bars is necessary, and I have thought about how to do it better. (To recall Keynes: “When the facts change, I change my mind.”)

However, I think most readers care more about “sign errors,” such as missed calls in 2012.

Your claim about sign errors essentially assumes people don’t understand statistics, which is true for the most part. But for those of us who do, it undermines the integrity of your predictions.

Consider:
Model A predicts D51%, R49% with MOE of 3%
Model B predicts D40%, R60% with MOE of 3% Result is D49%, R51%

Model B had no “sign error”, but its prediction was statistically much worse. It signaled that the actual result was highly improbable.

In general, Nate’s complaint with your model was my first impression upon viewing this site recently. You are extremely confident about the range of possible outcomes, and your histogram has an extreme lack of tails no matter what it’s technically supposed to be predicting (which isn’t entirely clear).

To be fair, Nate has a different problem, in which he seems to over rely on his “fundamentals” calculation and doesn’t respond properly to poll developments that buck a state’s expected behavior. That’s how 538 totally missed Montana and North Dakota in 2012, and why they are the last to respond to the evidence that Orman is probably leading in KS.

I’m a complete layman – that said, the factor that never stops screaming in my ear is turnout. I’d like to see the statistic of how those who don’t vote would vote, and I’m fairly certain that as a group they’d vote overwhelmingly democratic. The higher the turnout, the better the chances for the dems – hence, dems do much better in presidential election years.

Which brings us to the WANG/SILVER polemic. Competent professional pollsters succeed or fail by their assessment of how likely people are to turn out. Sam’s model assumes that – on average – they’ll get it right, but what if they don’t? What if the Dems ran a saturation national ad campaign promising to raise the minimum wage to $25/hr on the first day after they gained control of the house & senate (using the nuclear option), writing and publishing the bill in the newspaper? This would cause an absolutely insane uproar and suddenly every person in America would know that an election was occurring that could impact their lives at least as strongly as – say – the Scottish independence referendum, and in a much more predictable way. This would destroy the likely voter assumptions, taking the model down in the process, no?

But a legitimate black swan event is something that comes out of the blue and can’t be anticipated. 9/11, for example. But then again, to Bin Laden that wasn’t a black swan event – it was a carefully staged event designed to provoke the very reaction it provoked. And it worked, sadly.

Publishing a piece of proposed legislation in the NYT and a promising to pass it word for word on the first day of session would be a black swan event to GOP strategists but not to the people who paid for the ad. It doesn’t have to be $25 and it doesn’t have to be minimum wage. It doesn’t have to CHANGE a single voter’s allegiance – all it has to do is drastically increase turnout by way of drastically increasing awareness of the election’s existence, and of the importance of who controls the House of Reps. to the pocketbooks of “unlikely voters”.

Did you ever see “The Shooting of Buckwheat” on SNL? Every few months there’s something like that – often something entirely unworthy of viral attention – that utterly saturates the media and water cooler conversation. How hard could it be to manufacture such a faux black swan centered around the Nov. election? You could take any poll out there and change the turnout assumption and it would completely skew the result to the left, right? So – in an election where all the cards are stacked against you, why not do something drastic to increase turnout?

that scenario (for no real reason) reminded me of this laborious modeling assignment in college geology calculating the after effects of seismic events. It gets crazier and harder the bigger the event. Your wage hike would be like a very large Earthquake. Would Dem voters get the message and turn out? Or would it just get every registered Republican out and backfire? Or would it be so radical that you split Dems on the issue and cause some to defect while uniting Republicans, creating a red tsunami? (If the third seems unlikely, remember that raising the wage to $10.10 will destroy 100,000 jobs. $25 would probably have even liberal economists like me using the R-word). I think it’s this uncertainty that makes events like this fun to ponder, but unlikely.

This is literally a second-order nerd argument. I questioned (berated) Sam earlier about the covariances of his polling errors and the variance of the aggregate. I did the same with Nate Silver last year – but with no response. ( I may have screwed up on making comments on 538). This has NOTHING to do with the direction of the W vs L. It’s just about how to square and sum a bunch of errors. (BTW, I really like 538’s work with the empirical error distribution.) After a few thousand elections we can compare forecast error distributions.

I respectfully disagree that “error calculations” are a secondary issue. Nate’s arguing that your overconfidence down the line is distorting your big picture probabilities. He’s not objecting to your polls-only approach, but saying your have an unreasonable degree of confidence in small polling leads, typified at the extremes by the Angle and 2010 House examples. I don’t know that he’s right but I’d love to hear a debate on this ground.

Okay, I believe you when you say you’re rethinking your error estimate, and I mean no disrespect. I have read your code carefully, and I fail to see how your error estimate is empirically checked in your code. Consider, for instance, these two lines in your m file:

Your guess as to the systematic error is 0.7%. Why? Historically, has it been 0.7%? Are you projecting what you think will be the systematic error this year? Is this the trend for recent polling years or for all polling years. If recent, where is the cut-off? Does the prediction fall apart if you extend it 2 years more?

Than, all of a sudden, on line 27

if and(h0) % election is soon, so combine current and long-term
blackswanfactor=3;
systematic=1.0;

As you get closer to the election your systematic error increases and your two-tailed fudge factor triples. How did you determine those numbers? Are they based on historical information? An informed prior with a Bayes posterior using current data? Shouldn’t the results be less uncertain as we approach the election? The central limit theorem says that this sampling distribution approaches a normal distribution, but you have 2 sigma = 70%. Why would that be?

I appreciate that your code is available for download, whereas Nate’s is secret, but given the track record these notions of uncertainty DO seem kind of like arbitrary assumptions. Without a historical analysis that justifies them, why would I believe them to be reasonable?

The truthful answer to this is that after launching the calculation using the Presidential value, I found sources of error that led to my larger estimate for “systematic.” This was based on poll vs. vote discrepancies in past Senate elections. The problem is then to introduce this better value without causing a fake news event. I chose the transition to <35 days.

Regarding the second point you flag: on the time scale of <35 days, the poll aggregate starts to assume random-walk qualities. That is, the snapshot starts to predict future behavior. You can see this in both the 2012 Presidential and the 2014 Senate time series.

Thank you for making such an informed point.

P.S. I myself have pondered whether the blackswanevent parameter should be lowered from 3. Anyway, play with tcdf(MM/systematic,blackswanfactor) and tell me any insights.

This is tangential topic to the discussion, but I personally would presume the independent nature of Mr. Orman means, in effect, we could have a three party system in the Senate. We have 101 voters. 51 votes will be required for a majority. If the Republicans win 48 to 50 seats and can persuade enough Independents to vote with them, they can form a ruling coalition.

The only way they can rule without a coalition is to have at least 51 votes.

The chances are best that we will have some kind of a ruling coalition in charge of the Senate.

To me, that’s the “statistic” that I’d like to see measured: What are the chances that the Republicans will be in charge; what are the chances that the Democrats will be in charge, and what are the chances that a ruling coalition will be in charge?

When people talk about the short era when Obama had a Congressional majority, they often forget that Joe Lieberman frequently functioned as a third party of one, and played his role as perpetual swing vote to the hilt.

It seems like that would be a tough statistic to produce due to multiple unknowns. Either party could offer a desirable position, or possibly support on pet or pork projects. If true courting were engaged, we would assume Republicans would point to red as a natural choice for a long term career as a Kansas Senator. Democrats, on the other hand, might point to the 24 seats the GOP will defend in 2016 and entice Mr. Orman with six years as a committee chair instead of two. There’s nomway to quantify these factors, so sadly this is one of the few times in life that math doesn’t solve everything.

Estimated from Meta-Margin errors in 2008/2012, in Presidential and Senate races. The salient point is how well the *aggregate* does. In 2010, Democratic Senate candidates overperformed a bit. We can get into whether the error is well estimated, offthread or on.

The ‘arbitrary’ nature of fundamentals certainly seems to be a weak point of Silver’s approach. If, for example, you introduced 50% uncertainty into each Senate race, then you’d provide more information (info just as relevant as non-poll data) and reduce ‘swings’ in probability. If that’s true, then Silver really just equivocated “big percentages” with “volatility” with “bad.”

I am just the lowliest of laypersons, but in your code do you account for the fact that an election is not a single day event? One of the remarkable aspects of the last elections is that so many ballots are cast days and weeks in advance. Such that it would seem the measure of a Blackswan’s impact would be, if the votes are cast before, proporationaly reduced. Also, states that allow for easier voting, including casting ballots before election day encourange more participation from voters, and IF they are motivated, could skew mid terms with more impact because their total vote will always be lower than Presidential years. Do factors like these, and other factors such as weather (which was for a long time one of the most cited predictors of election results) factor at all in your predictions?

Thanks, yes, it would be great if you could speak to the estimation and performance of the error calculation “on thread.” I get that you think Nate misrepresented you, but please don’t punish the rest of us! Naturally some of his questions have become ours. The most serious challenge he makes is that you are more confident in small polling leads than is empirically warranted, that this overconfidence has been several times exposed, and that it distorts your bottom line probabilities. I don’t have a clue if he’s right but would like you to answer the charge.

It is my feeling that there are three issues here, which deserve to be carefully separated. Here is how I would sort them out for you today:(1) How to estimate the accuracy and uncertainty of a snapshot.This is in some sense impossible because there is no election to validate it.(2) How to estimate the degree of movement between a snapshot today, and a snapshot on Election Eve. This can basically be done by calculating by how much, and how quickly, the snapshot varies over time. Let’s call the amount of that movement sigma_movement.(3) How to estimate the accuracy of an Election Eve snapshot. This can be done by the obvious means of comparing the snapshot with the election outcome. Let’s call the distribution of those differences sigma_systematic.

In the notation above, the outcome of the election is, by definition,
OUTCOME = SNAPSHOT + sigma_movement + sigma_systematic.

In this framework, if we can understand the sigmas, then we can make a prediction.

In fairness, Silver has reasonably called out my pre-2012 writings, in which I mistakenly interpreted the sharpness of the Election Eve snapshot as meaning that sigma_systematic

I think SBernow is right in pointing out that Nate Silver’s basic claim that your model is wrong, his basic criticism, has to do with you sigma_systemic. My reading of your final 2012 prediction is that there was less than a 1% chance of Romney winning. Nate Silver predicted approximately a 20% chance of Romney winning based on systemic error that could be correlated across states. It’s not necessarily that his error is right and yours is wrong, but it does seem like you might be underestimating systemic error.

In your 2014 Senate Election Day Model you say “I have assumed that the systematic bias has an average (rms) value of 0.7%.” I find it likely that the value should be closer to 2-3%. You seem to be suggesting now that you have tested this value against past elections, but the description does not sound like that. And if you’ve only tested it against the last 2-3 elections, those could underestimate the possibility of error.

I think you are doing valuable, accessible work, but I think his criticism is valid as well.

Thanks for that. I doubt I’m alone in using election forecasts, and these sorts of debates, as a way to become more numerate. (Humanities guy.) What helps me most (ironically?) is not elided math but when it’s spelled out at some length. We’re not so stupid that we can’t grasp logic and illogic, sometimes.

BTW, from Silver’s write up I could not tell if he is doing the covariances correctly in his simulations. He puts all his errors into virtual urns and then draws then over and over with replacement. Thing is, if the urn-value covariances are not zero, the probs of drawing the empirical values in the urn are not distributed uniformly – the covariances screw up the distribution in interesting ways. It’s an easy (but tedious) correction with so many covariances. I think your swan mtd – which I do not understand btw – might work just fine. (I’m guessing that the swan – or bs – method just takes the 36^2 – 36 covariances and adds a small amount. (There are a LOT of covariances in question and even small values times over a thousand covs adds up.) :-)

I have a serious question…..With Alaska, GA and LA not being decided Election Night I feel the numbers when everyone goes to sleep will read Dems 49- R’s 46 IF Orman wins in KS…what is your opinion of what he would do? Im not math genius but this seems to have a high probability of happening (I give IA CO and NC to the Dems and ARK to the Rs) Would ORMAN bother waiting a month (could he?) to make the Dems the Majority? At that point the Dems would only have to win one (most likely AK)….just wondering why no one has considered or even talked about this scenario, since he could instantly give one side the majority…Cant tell me his people haven’t thought of this. Just wondering what you think.

Regardless of whatever sauce or “systemic error” that whichever method uses, it is hard to talk about which one is the “more accurate” method since everything spits out probabilities in the end.

What would be a nice metric, though, is to not actually compare who takes what state or who wins the presidency, but to measure the margin of victory against each method’s “predictions”. While PEC is the only one that presents a poll median that can directly translate into a win percentage for a state, other sites like 538 and NYT have individual state-by-state win probabilities that could be used as the vote margin.

In this case, you could also substitute the margin of error between win margin and PEC state poll medians to arrive at a new systemic error rather than the 0.7%.

So, what would be a good way to compare different (series of) advance predictions? Perhaps assigning each prediction a Brier score, and then averaging those over time? That seems like it could be useful.

In your prediction code (Senate_November_prediction), it looks like you “correct” the standard deviation sd_mm for the “Orman effect” by a root-sum-of-squares of the original sd_mm with Orman_offset. However, if I’m reading it right, the formula used doesn’t actually square the sd_mm inside the square root (so if Orman_offset=0, sd_mm would still changed by this line of code). Is that what you want? (I only see a percent or two change in the result if sd_mm is squared.)

The general idea here was to estimate variation in the Meta-Margin (sd_mm) to predict the future. Obviously that should be sd_mm^2, not sd_mm. Thank you. Gotta fix that.

The next question is how to build in uncertainty about the Kansas Senate race. As you can see, my intent was to combine that uncertainty with sd_mm in quadrature, i.e. assume that poll movement is independent of other states. Now that we’re on the topic…what is the appropriate term to put in place of Orman_offset*Orman_offset? Perhaps Orman_offset*(1-Orman_offset), if it’s a binomial trial…

(Note that I just implicitly put in the idea that 1.0% of Meta-Margin = 1 Senate seat, which holds for the data set so far.)

The best I could do so far was to imagine adding a truncated binomial to represent “Orman uncertainty” prior to that race’s phase change in early September. (“Truncated” because the new “random” variable is identically 0 after the phase change.) Then a semi-calc gave:

sd_mm=sqrt(sd_mm^2 + f S^2 p (1-p) )

with f = 1-O_fraction, S = sensitivity of mm-to-senate seat (1.0%, I think, in your calc), and p is the binomial probability.

The overall mean will be increased by pfS. (Should pS be selected so that the pre-phase-change mean = the post-phase-change mean?) Of course, the biggest increase to sd occurs for p=0.5.

In comparing this model to some sample data runs, I noted:

1) the actual sd was about 10% less than estimated. I suspect that’s because the two series are not really independent: I was adding the binomial only to the pre-phase-change data, where the mm was generally lower.

2) Not just the sd but also the mean are increased by this approach, with the result that the prediction of D+I control probability was actually increased relative to doing nothing: applying your prediction algorithm with Orman_offset =0, I get a prediction of 63% (vs your 70% with a non-zero Orman_offset). That increase makes me nervous…

Anyway, as I said at the start, I’m blundering around here and am probably completely misunderstanding your thinking. Sorry for not having something more useful to submit…