Feeding Karl Rove a bug

November 9th, 2012, 1:40am by Sam Wang

Early on Election Night, the New Hampshire results made clear that the state polls were on target, just as they were in 2000-2008 – more accurate than national polls. At that point it seemed more interesting to watch Fox News for reactions. At first they were filled with confidence of a Romney win. As data came in, a funereal air fell over the proceedings. And as is well known by now, Karl Rove became wrapped up in his calculations and had to be called out by Megyn Kelly.

Do such biases ever help? What about analytical improvements, like the layers added at FiveThirtyEight? Today I report that by a quantitative measure of prediction error, we did as well in Presidential races as Nate Silver, and on close Senate races, we did substantially better – 10 out of 10, compared with his 8 out of 10. Let’s drill into that a little.

For us the keys to success were (a) a high quality data feed, and (b) avoiding the insertion of biases. Indeed, Mark Blumenthal and Andrew Scheinkman at Pollster.com gave us great data. After that we chose a a median-based polls-only approach to minimize pollster biases.

I will be honest and say that an Election Eve test is not very interesting. Long-term predictions are of greater importance – as well as other ways that aggregation adds value, like tracking ups and downs, as we did. By Election Eve, anyone who is looking at the data honestly can figure out what will happen the next day. Still, let us go along with this week’s media frenzy.

First, the obvious: of the 51 races, one was essentially a coin toss – Florida. Nate Silver, Drew Linzer, and Simon Jackman won the coin toss; Scott Dillon and I lost (though I briefly made a good guess). Is there a better way to quantify this?

One way is to look at our final polling margins, compared with returns.

Whenever a candidate led in pre-election polls, he won. This was true even for a margin of Romney +1% (NC). Evidently state polls have a systematic error of less than 1% – as good as 2008! (Also, like 2008, pre-election polls substantially underestimated actual margins, this year by a factor 0f 0.8 +/- 0.3. Majority-party voters in nonswing states like to vote – or minority-party voters don’t.)

Since Florida was a coin toss, it is better to examine our state win probabilities, as suggested at Science 2.0. The closer the probabilities are to 1.00, the more confident they are. Probability should also measure the true frequency of an event. If I say a probability is 0.80, I expect to be wrong 1 out of 5 times. Our record of 50 out of 51 (counting Florida as a loss) means that our average probability should have been about 0.98. It was 0.97.

This can be quantified using the Brier score, as described by Simon Jackman of Pollster.com. This score is the average of the squared deviations from a perfect prediction. For example, if Obama won a race that we said was 90% probable, that’s a score of (1.0-0.9)^2 = 0.01. If we were only 70% sure, the score is (1.0-0.7)^2 = 0.09. The average score for all 51 races is the Brier score. The Brier score rewards being correct – and rewards high confidence.

For the Presidential races, the Brier scores come out to

Presidential Brier score

Normalized Brier

100% confidence in every result

0.0000

1.000

Princeton Election Consortium

0.0076

0.970

FiveThirtyEight

0.0091

0.964

Simon Jackman

0.0099

0.960

Random guessing

0.25

0.000

We appear to be slightly better than our very able colleagues. The additional factors used by the FiveThirtyEight model include national polls and maybe some other parameters. It seems that these parameters did not help.

A more interesting case is the Senate, where the 10 closest races had these probabilities:

State

538 D win %

PEC D win %

Arizona

4%

12%

Connecticut

96%

99.8%

Indiana

70%

84%

Massachusetts

94%

96%

Missouri

98%

96%

Montana

34%

69%

Nevada

17%

27%

North Dakota

8%

75%

Virginia

88%

96%

Wisconsin

79%

72%

Note that a number of these races (Indiana, Montana, North Dakota, Virginia) were races I designated as knife-edge at ActBlue.

I have indicated in red the cases where the win probability pointed in the opposite direction as the outcome. These are not exactly errors – but they are mismatched probabilities. The Brier scores come out to

Senate race Brier score

Normalized Brier

100% confidence in results

0.000

1.000

Princeton Election Consortium

0.039

0.844

FiveThirtyEight

0.221

0.116

Random guessing

0.250

0.000

In this case, additional factors used by FiveThirtyEight – “fundamentals” – may have actively hurt the prediction. This suggests that fundamentals are helpful mainly when polls are not available.

Update: I have added a normalized Brier score, defined as 1-4*Brierscore. This is a more intuitive measure. Thanks to Nils Barth. I’ll update this post with more information shortly.

210 Comments so far ↓

Digging even deeper … for Montana, Nate’s “state fundamentals” appear to have been the reason for his wrong call, but in North Dakota, Nate does not list at all the polls from Pharos and Mellman that appear to be the basis for your correct call. I knew that Nate weighted polls, but not that he ignored some completely …

Well, it doesn’t matter! And thanks to Act Blue, I contributed to those Democratic victories!

Now it’s time to start writing to the Senators that I helped elect to call out the vital importance of filibuster reform, and how much I’m counting on them to do something about it!!!

Might be an issue of using a different information source for polls. Nate excludes leaked campaign polls (don’t know if this is the case with Mellman, probably not), but at least for Pharos that might have been simple oversight.

Also, this could perhaps be interpreted as an argument against adjusting for house effects, but there are counterarguments to that one.

(1) leave ties alone, i.e. call them ties
(2) adjust house effects
(3) bring in the national trend, which appeared to be moving toward Obama
(4) take my lumps and say that polls-alone gets us 50/51, and move on from this parlor trick where I dropped a card!

I tried many, many times to contribute to the people running again Michele Bachmann and Paul Ryan. I went to their official websites and clicked on the Contribution buttons. In both cases I was directed to ActBlue. I then filled out the necessary info and pushed the appropriate button. I would then get an icon whirring around while my donation was processed — forever and forever. The transaction was never completed.

I tried this five or six times with my primary browser, Chrome. It never worked for either of them.

I then tried it using Internet Explorer 8 and the most recent version of Firefox.

Same story. Impossible to make a donation.

I sent emails to both websites telling them of my problem and asking how I could contribute.

In neither case did I ever get a reply.

In my opinion, both these candidates were so incompetent that they deserved to lose.

I live in Arizona — but I went to the website of Sen. McCaskill in Mo. and of three other candidates for the House scattered around the country. In all four cases I was able to donate without problems.

So what is the matter with Act Blue? In my particular case, at least, it actively kept me from making donations.

How about adding Linzer’s scores? I’d also be curious about your larger take on his approach. It’s just one election, but his model sure looks like an Oracle right now — essentially called the outcome in late June.

Yes, I think you should look at Professor Linzer’s probabilities again, Dr. Wang. For the swing states, where their were lots of polls his confidence intervals seem to be point blanc bullseyes in every case, particularly Florida, just a nats width on the 50% line.

What are the bars in the figure – one standard error? No bars = not enough polling?

There seems to be something odd going on in that figure – the final poll margins seem to be underestimating the actual ones consistently. ( Discouragement/enthusiasm? Or something else? )

Do the other years have this same effect? Depending on how one is calculating poll-vs-vote error one might get different measures as to the reliability of state polls. Could this account for any of why PEC and 538 disagree on how confident we can be in the state polls? Ironically enough, this would be a “systemic” error that doesn’t show up as a constant bias across all states. I could imagine that this might cause Silver to underestimate the reliability of state polling near the 50% mark?

No bars = no polls, so we had to fill in the previous margin. In the code someplace we put an error bar in to allow calculation of a probability. Since the Meta-Analysis only cares about the probability, which for these states is 0 or 1, the error bar is not a critical parameter.

In regard to the decreased slope – good eye. Yes, it happened in 2008. Follow that link.

I think the difference in confidence between PEC and 538 may arise from the use of national polls over there, since he wrote about the large uncertainties. Maybe state polls too – I don’t know. However, I do not think that estimating probability with great exactitude is in his interest, given that he is trying to appeal to a wide audience that does not mind uncertainty.

This “Brier score” is a very weird measure. The only score that motivates giving your exact probability is the log of the probability that you gave the final result – smaller (less negative) is better.

Is there any standard way to combine something like the Brier score with a notion of predictive power that rewards earlier predictions?

It seems a bit tricky, because you want to reward people who got it right early on, but only if they continued to get it right up to the end.

For example, however they did it, I’m impressed that Votamatic seems to have converged on the final EV result very early on and barely deviated it from over the course of the campaign. Intuitively, that should be given more predictive credence than someone who gyrated wildly but then nailed things the night before the election. [Btw, not a criticism at all of sites that were tracking the current mood of the populace — different measures for different goals.]

There ought to be a Mount Rushmore of election forecasters! Thanks so much for what you guys do! I can’t express enough how thankful i am for this! You and Nate kept me sane throughout this long bleeping election during those times i wanted to just shove my head down my toilet with these damn pundits and media members with their constant Ro-mentum BS!

Thanks to Sam’s Act Blue recommended places to put a buck to best use, I donated (small) sums to just four Senate races. ND, IND, MA, and VA. Sam batted 1.000 for me, and I couldn’t be happier.

(Well, yes, I could, if the O team hadn’t so badly misconceived their approach to the first debate, weakening their upward pull on Dem house chances. But you can’t have everything — not this year anyway.)

David, interesting article in the NYT today on the insider view of debate prep on both sides, how much & many warnings O got of just the effect you (& many others) describe, and O’s failure to take them seriously. The story of the reactions inside the O campaign team while monitoring the first debate live is fascinating in a morbid sort of way.

Dr. Wang, you changed your FL EV estimate because the axis of evil (Rasmussen, Gravis, ARG) distorted the models forecast with a few late polls.
You assumed nonparametrics would remove their bias.
But it was asymmetrical bias. Non-gaussian.

Dr. Wang.
PEC was the major influence on my 332 EV estimate. I took all my data from here. But I didnt change from 332 because Rasmussen set off my cheater detection module.
I made a series of 2×2 payoff matrices that demonstrated Ras actually could maximize payoff by selective cheating because of asymmetrical enthusiasm.
In essence I believed Ras would be wrong enough that FL was actually going Obama.
It would be interesting to see if some sort of cheater detection could be quantified and incorporated into the models.

wheelers, could you provide some more details on your payoff matrices. What sort of “payoff” are you referring to? Affecting the election, things like the PEC model, or something else? What “enthusiasm” is involved? And what is the “cheating”? Is it manipulating the numbers, or do you have something else in mind?

Froggy I used asymmetrical enthusiasm in the GOP base as one payoff. I took the estimated value from RAND actually (likelihood), because they were sampling a captive population. In essence RAND solved the responder problem by paying $2 for each survey returned.
For another matrix I used the payoff as increased contributions from Republican donors– which probably isnt independent.
I ran simulations then to repeat the experiment.
Both payoffs privileged cheating over the payoff of maintaining the Rasmussen reputation.
I got the experimental design from Hofstaders Metamagical Themas where he was testing the existence of the Superrational with the Platonia Dilemna.
I mean, its just a crude approximation and not a rigorous test.
A more rigorous test would quantify Rasmussen against both other pollsters and events…
And you see…he didnt have to change his methodology. He failed to capture the cell phone demographies and the hispanic vote in NEV and CO in 2010. There was no market force pressuring him to change. Everyone consumed Rasmussen.
The aggregators just averaged and weighted his stuff and still used it. Wang, Silver, Blumenthal, etc.
But I believe his frequent polls and cavalier treatment of hispanic and cell demographies (dependent im sure) pulled the aggregations off course.

and this is pretty standard classic evo theory of cooperation and cheater detection stuff.
im just not as good at explaining as Dr. Wang.
I think Rasmussen had an effect on all the aggregators. They all used his data and hes probably the most prolific pollster. In Dr. Wang’s case I just dont know if the CLT is as proof against cheating as it is against random variation.
;)

I actually think this post (or the presidential part of it) is a bit premature. The final results are not in yet (and in some places, notably Ohio, won’t be in for a couple more weeks). None of the races will be flipped, but the margins will change, probably in Obama’s direction in most swing states.

Why am I mentioning that? Because in terms of margin, the swing state polls actually don’t seem to have been that accurate this year. There was a considerable systematic error in most swing states, but in Romney’s direction. Obama is up by 7 points or so in Iowa and NH. The poll median has him up by 2. With other states the margins are off by less (so far; we should wait), and are off in Obama’s direction in OH (I expect this to change by the time we have the final result, though). But on the whole, this argues for a smaller level confidence, not a higher one.

538 will be a lot closer on the margin than you wre once the ballots are counted. Obama went up one point on the provisional ballots, late counted last time.
Your work is really good but I do not see the value in the prediction. The meta margin was 2.46 and the prediction 2.2., 2004 repeats itself
You would have gotten a lot closer to the result by just trusting the math- your strong point.

Jackman writes: “For example, while the averages compiled by HuffPost Pollster and the other polling aggregators were correct in forecasting Obama the winner of the key swing states, HuffPost’s averages understated the president’s victory margin by 2 to 3 percentage points in Wisconsin, Nevada, Iowa, New Hampshire and Colorado (as of this writing, based on the current AP vote count).”

Actually, it is interesting to note that FiveThirtyEight’s margins of victory were closer to the actual results. Taking all the closest states, the average absolute difference with the actual victory margins are smaller than all of PEC, Votamatic and Pollster.
I am keeping a spreadsheet, but as mentioned above, we might want to wait a little for the final margins to be known before making too firm conclusions.

Pat – Good work! You predicted my next question 12 minutes before I posted it. When we know the final margins, where will you be posting your analysis. I would definitely like to see it — don’t want to miss it.

I suppose the easiest would be to send a link to a shared Google Doc?
Just a practical question: where do I get the final margins for all states in PEC? (beyond the swing states listed in the right-hand column)

The “Power of your vote” column has all the swing (and a few non-swing) states. The rest are not published anywhere I can see here, but you know the method, and can just calculate them by hand from Pollster charts (where there is any polling at all, that is).

The scale of Brier scores (0 = perfect and guessing 50% for everything guarantees 0.25) is confusing if you’re not used to it – a 0–1 range is clearer. Perhaps clearest is 1 – 4*B, so 1 = perfect, 0 = chance, negative = worse than chance. This “Normalized Brier” accords with intuition: 100% = perfect, 0% = no information, negative = worse than nothing. (From quick look at literature, various normalizations of the Brier score seem used in different contexts, so this seems an ok term.)

Comparison set makes a big difference in Brier score – computing both for whole nation (to show overall uncertainty) and swing states (to show hard-to-tell range) would be interesting. Using a common category also helps one compare scores, to see how Presidential and Senate predictions compare.

Since the Brier score scales inversely with number of predictions, predicting safe states improves your score – thus the “whole nation President” and “swing states Senate” Brier scores are not comparable (apples and oranges). Let’s say that 40/50 states were 0%/100% sure – then guessing 50% for the 10 remaining states already gets you B = 0.05.

Using normalized Brier score (1–4*B, as above) and using only 10 states for Presidential gives (approximately) 85%, 82%, 80%, while normalized scores for Senate are 84% (Sam) and 12% (Nate). (Presidential swing state scores are slightly better b/c no penalty for not being 0%/100% on other states.) By this measure, Sam’s Presidential swing states and Senate swing states are about the same (and both very good), while Nate’s Presidential swing states are v. good, but Senate only just better than chance.

A scatter plot, as Simon Jackman does, is quite informative, both for a single set of predictions and for comparing two sets of predictions, especially a close-up of the transition region. (Formally, x = win prediction %, y = (2-party) outcome %.)

Should be a sigmoid curve (roughly), crossing at (50%, 50%) with sharpness of transition showing confidence of predictions; bias is if “crosses” above or below 50% outcome. Especially interesting data points are:
* Wrong side – correct outcomes are in NE and SW (UL/LR), showing wins for >50%, losses for <50%. Any predictions in NW or SE are (binary) misses.
* Ranking mismatches – when ranking of probabilities disagrees with final outcomes, either due to incorrect ranking of red/blueness, or due to particularly strong or weak confidence in prediction, say due to extensive or missing polling.

Professor Wang – In the Jackman article you link to, he also calculates the Root Mean Square Error and Median Absolute Error for his state-by-state point predictions. How did PEC, 538, and Jackman perform compared to one another using that measure?

It wasn’t a primary goal in our analysis. We put very little effort into estimating margins in nonswing states on the grounds that it had little practical consequence. The numbers are available on this site.

Professor Wang – Thank you for the reply. That is entirely understandable. At the same time, predicting the state-win probabilities may not have been the the top priority of all sites. So, using the Brier score should not be the sole or primary way of analyzing the results from different sites. Brier score and RMSE seem to be reasonable ways to look at the projections produced by various sites, to see where their relative strengths and weakness are, while taking the site’s priorities into account.

As well, I think the issue raises interesting questions for forecasters and those who follow forecasting. Predicting win probabilities, or predicting vote share: which of the two projections should be a higher priority for forecasting models, and why? Why should one be a higher priority the other? Or, are each the two projections more useful for different purposes: win probabilities more useful for some purposes, vote share more useful for other purposes? Etc.

Professor Wang, it would be great to hear some insights from your perspective. Why do you put a higher priority on win probabilities than on vote share? Using your model, more accurate vote share predictions result in more accurate win probability predictions, or is that incorrect? If it’s correct, wouldn’t that be a reason for giving the two equal priority?

I have a question about the order of states. The median electoral vote this year was located in Colorado, just as it was in 2008. All along, if you had allocated Obama the states that went for Kerry or Gore + Nevada (all of which Obama won by 9+ in 08) that would have been 263 electoral votes). So it was clear that he needed Ohio, Virginia, or Colorado to push him over the top.

But according to all reports during the campaign, the Obama team felt most confident about Ohio. For example, this report the day before the election:

“Chicago… feels nearly as certain of carrying Ohio; and that Obama is just a tad ahead in Virginia. As for Colorado… Team Obama believes… too close to call.”

But in the end, Obama won Colorado by about 5, Virginia by about 3, and Ohio by less than 2 (at current count). This is also exactly what happened in 08, when he won Colorado by nearly 9,Virginia by more than 6, and Ohio by less than 5.

So why were the Obama strategists/pollsters off in their assessment of these states? Part of the Colorado thing might be a mini-Nevada effect, for example the public polls also underestimated Obama, as they did in 08, and the same for Bennet in 10, when polls showed Buck by 3. But still, now we have 2 consecutive election cycles in which Democrats have performed in the order Colorado > Virginia > Ohio. It would behoove pollsters to learn how to poll these states a bit better.

Oh, one partial answer to my own question is that the size of the margin may not be the only factor in assessing probability – the stability of the lead is important too. Maybe the Obama campaign was more certain the lead was durable in Ohio for some reason. Still don’t see why though – it ended up a little too close for comfort…

Yes, I suppose this has to do with the “elasticity” of the state. Same with Pennsylvania: it was virtually uncontested, but ended up with a much closer margin than other battleground states (Iowa, New Hampshire, Nevada, Wisconsin).

With CO and VA, the Hispanic vote sort of suggests itself (take into account that most polls underestimated not so much the Hispanic turnout, as Obama’s margin of victory with it, probably as a result of not doing any polling in Spanish).

But the substantial misses of the polling with IA, NH and WI (as of now) – now that’s more interesting.

Are you going to follow up on your analysis of whether aggregate House of Reps vote corresponded to Republicans retaining majority or was result of re-districting? If Dems actually won majority of aggregate vote, very importnat to communicate that to the chattering class, which persists in reporting GOP House majority as reflection of the “will of the People.”

what about the house Dr. Wang. I seem to recall we were prediciting 210+/- 10 any idea what the others were predicting, not sure if can do a brier on this one, as i doubt people had probabiliites for each seat.

HI Sam,
Your analysis of Nate versus PEC above is great except, people will only remember 50/50 versus 49/50, not 0.0091 versus 0.0076. I do wish your work got the same level of attention his does since your approach is better.

I have one specific suggestion: I know you prefer not to be underconfident; but Florida was a clear case where you knew the margin was tied, and that the actual vote margin would be less than 50K votes. Why not simply call it tied instead of tossing a coin?

If you had done so, you would have been correct 50/50 states, and would probably get the same kudos that Nate is (deservedly) getting.

I too am seeing a trend line drawn through the actual data that has a greater slope than the ‘perfect prediction’ 45-degree line. So this means that the predictions were actually more accurate in the closer races, and were a bit overly conservative in the less competitive ones, with the conservative-bias error proportional to the real-result margin, yes? Is this an algorithm problem, or a problem with an inaccuracy of polling?

Also, my take-away understanding of the Brier score is that even though PEC didn’t quite call as many states right as, say, Nate Silver did, it gets a better “grade” because PEC made more confident (smaller margin of error) predictions, is that right? This was my feeling about Mr. Silver’s results–that they were a bit overly “hedged.”

Sam, have you done (or will you do) any analysis of the voting of demographic subgroups that appears to have been so decisive in this election? I have the sense that there are some emerging received wisdoms — e.g., the “gender gap” favoring the President — that could benefit from a deeper look ….

Actually, I ended up getting all 9 battleground correct in my final EV Map – but I used a time-honored method for calling Florida for Obama: Since it was a TIE statistically, I merely looked at the TREND of the polls over the final 2 weeks – and since they were definitely trending from Romney to Obama, I broke the tie in his favor.

No muss – no fuss! (Plus I used thepsychological component of assuming that the political attempt to block Democratic votes would only infuriate Democrats all the more – and thus help Obama in the end.

Dr Sam, in one of your interviews you said that, because they have so much more money, the campaigns can do even more with polling data than you can. I have always believed that, with Carter supposedly being told he had lost weeks before the election as an example. If that is true how is it possible that Romney was surprised when he lost? Did they lie to him? Was his analytic team incompetent?

I was pondering this. The only thing I can think is that they were gaming out the most hopeful scenario, and began to believe it themselves. Ultimately, their polling shop had to be run by a few people. If they set up an internal conversation where they mainly trusted one another, then they might become impervious to external criticism – especially if they mistook it as being driven by partisanship as opposed to data. That’s what I mean by motivated reasoning.

Excellent work! Thanks for all the hard word. Regarding motivated reasoning, wouldn’t that also explain many people looking at PEC, 538 etc for feeling confident about the democratic victory. Of course, the candidates have to have their pollers provide the right information – otherwise, there is no value in having a team to doing the polling analysis than watching FIXed news!

I’m guessing that his pollster perhaps might have known Romney was losing but did not tell him because they knew how committed he had been for 7 long years and they did not want to crush him or they were afraid, from long experience with CEO types, that bad news causes the death of the messenger. From all reports, Romney was not at all prepared for the results.

“If that is true how is it possible that Romney was surprised when he lost? Did they lie to him? Was his analytic team incompetent?”

This may not be all that different from what others have said, but I think they made up their minds about a turnout model before they ever even started polling. They couldn’t believe they were losing because they couldn’t believe that Obama voters were actually going to vote.

To be perfectly clear, you must admit, that 538 had a bit more uncertainty due to the fact that it incorporated some chance of systematic bias among all the polls one way or another.
So, since in these elections there was no bias (state polls provided the best information, and you used it without including that extra uncertainty) your result is better by Brier score. Would it be an election with a systematic bias – it would be quite worse.
Not to diminish your modeling, I’m a big fan of both models.

The campaign I feel bad about, which everyone seemed to miss, was the NV Senate race, with Berkley (D) vs. Heller (R). This also came down to the wire, but no one highlighted it.

I feel slightly guilty, because that race was the one where I drew the line (I spent a lot of money on out-of-state elections for Senate and HoR; not much compared to Adelson’s millions, but proportionately a lot more than I probably should). If I had had any suspicion that she would come so close, I would have tossed a little more.

I give you 51/51, Sam. You said Florida would be very close, and it was. You admitted that calling it red was a guess, but you made it clear that your real call was just that it was very close. Close enough for me.

As for Nate’s “fundamentals,” I think they bake in the stereotypes people have about states. “Well, Montana is one of those big mountain west states. Of course it’s red by nature.” As you just demonstrated, ’tis better to kick those kinds of assumptions to the curb and stick to the polls.

Dr. Wang, in your jackaroyd appearance, you said something that leads me to believe that you feel the GOP hasnt really taken away any lessons from this–
Do you think there is going to be intraparty war between the tea party faction (fundamentalists) and the reformers?
And what are your tea leaves for what will happen in 2014?

Sam, Is it just possible that the Obama (and Romney) campaigns were monitoring yours and Nate’s blogs and coming to the same conclusions. Or even possible that they simulated your software for their own private polling data.

I must say Obama campaign team exuded confidence well before polling began (“I will shave my moustache” comment) and Cutter’s satisfied smirk when PA was declared for Obama on election night.

Sam, regarind your reply to my question above about the increased inaccuracies in the results for the higher-margin races, you reply that it is polling inaccuracy, but that doesn’t totally explain what I’m seeing in the Pres-margins-returns-2012.jpg graphic. If the higher-margin races simply had less accurate polling results due to less polling, I’d expect to see a 45-degree trend line implied by the number, but a broader vertical scatter of the points on the graphic at the high-margin ends, equally distributed above and below the 45-degree trend line. Instead I see a discernably consistent trend line of higher slope than the 45-degree line. Why would it be that the polling at both the high-margin ends of the graph have consistenly too-conservative results?

Otherwise, if you look at “probability”, you have so many states at the 99% level that is becomes noise to the data. I think there should be a big difference in calling a state at 100% with a 20 point margin of victory or with a 10 point margin of victory, even if both ultimately are a 100% probability of a win.

i think #drunknatesilver is punishment enough for Nate.
wow, I loved this.

“Wang, the Princeton professor, believes pundits and computer-aided analysts can coexist.
“It’s possible to be Homer and write about the wine-dark sea,” he said. “But sometimes you want the guy with the thermometer.”

Its possible to be an uberl33t Poll Jedi and quote TS Eliot and Homer.
Third culture intellectuals FTW.

I can understand the skepticism. I generally trust Nate because I understand how his model functions. Sure, he could tweak something on the back that might have the appearance of normality, but then it comes down to is he a trustworthy person and does the integrity of the NYT add to that credence.

On the other hand, Nate is still under the purview of the NYT editors, and although his model might be safe from their reach, his editorializing is not.

Wheelers cat, if we’re talking about pulled posts let’s not neglect Dr. Wang’s mid-October post on Nate Silver’s error bars, which was only briefly on line before it disappeared, never to be seen again. (I have a copy on my computer at home, should anyone be tempted to try and deny its existence.)

When I wrote, “should anyone be tempted to try and deny its existence,” I certainly didn’t mean that Dr. Sam would do that — he’s far too much of a stand-up guy for that sort of behavior. There was actually some good stuff in that “lost” post regarding sources of error, things that might have later made their way into later posts.

Hmmm, I actually thought I recycled most of that stuff in other posts. It’s ok – remind me (and everyone) of what it said. I have been pondering whether it’s of interest. If I recall, some of it was fairly nerdy inside baseball…

Here is a link to a spreadsheet I made, comparing for each state the margins given by 538, PEC, Pollster and Votamatic with the actual (preliminary) result.
The average error for the 10 closest states is given below the table.

Is it accurate to say that PEC was more accurate in projecting probabilities, by the Brier score, and 538 was more accurate in projecting the vote share & margin of victory?

If so, this raises an interesting question. Which is more important in assessing the results produced by a projection model: the accuracy of the probabilities, using a Brier score, or the accuracy of the vote share/vote margin? Which one, and why?

Or, are they equally important? Or, are each the two projections more useful for different purposes: more accurate probabilities are more useful for some purposes, more accurate vote shares/margins more useful for other purposes? Etc.

It seems so. Though admittedly an average (absolute) error of 1.89% for PEC or of 1.46% for Fivethirtyeight for the 10 closest states is not that huge a difference.
In general, it looks like all aggregators (i.e. the polls) missed in the same direction: mostly underestimated Obama’s support in all swing states, with the notable exception of Ohio (and possibly North Carolina).

Pat – Sorry, I only took a hurried look at your spreadsheet before going to work. I see you already did a fifty state calculation.

I wonder what it was in the 538 model that made it more accurate on vote share. My first hypothesis on what made the biggest difference would be the use of “state fundamentals” particularly for non-swing states, where there was less polling data.

Yes, this is especially clear in non-competitive states like Hawaii or Tennessee, where state polls alone were probably rare and missed the final margin by quite a bit. Nate’s fundamentals apparently helped in those cases.
It also seems to have helped a bit (to a smaller extent) in swing states, but not sure why.
We may just have to wait a little longer for the final results to come out.

You guys all did great, but because of high-quality poll data and because you did not let your own bias influence your science. Will the same pollsters be good in 2016, or is it stochastic? Did Scotty Rasmussen have a finger on the scales, or just bad luck? And will the focus on poll aggregators in 2016 feed *forward* into the results because of anticipatory actions by the pollsters themselves?

I do not have any faith in Intrade as predictive or as a “performance benchmark”. But I do believe in competitive performance, bidding theory, and evo theory of cooperation.
Rove and Rasmussen are going to see their market value fall.
Precipitously.

Sam, correct me if I am wrong, but I believe your model produced a predicted vote-share by state as well as standard errors about that estimate (which is ultimately what was used to create the win probabilities).

Have you checked yet how the standard errors on the state-by-state vote shares performed? For example, was the vote-share for Obama within the 95% confidence interval about 95% of the time?

I ask because a major concern with your model was that you were understating the uncertainty due to bias in state polls (it’s clear from the result there was no large systematic bias, but there may have been idiosyncratic bias your model missed causing it to overstate the certainty of the result).

(1) The one significant difference between Dr. Sam and Nate was in Montana. Nate’s site shows that the polls he used favored Tester, but his model also includes the state partisan tendency, and that tipped it towards the Republican. Nate was wrong this time.

(2) A broader issue: Oddly, Sam and Nate may be doing a disservice by bringing these accurate predictions to public attention. If it were widely known that Obama was a 95% favorite to win the election, then people might not have stood for hours in line to vote – and that might have caused Romney to win! We may be better off with ignorant pundits, or with a moratorium on polling during the final week.

As one who has been interested in election modeling for a long time, I first became interested in what Rasmussen was doing back when he first started. Initially I was very curious as to how his robocall methods would fare, since he was able to get reasonably decent sample sizes.

But, as time wore on, I discovered that he was way off on individual states, particularly in 2000. As I studied his methods more, I found that he was not using acceptable weightings (i.e., his R/D/I assumptions versus demographic stats.)

Now we have the fact that his robocall methods are not supplemented with cell phone contacts. In short, his work is biased and not reliable at all. I have found that while he has gotten lucky in at least one national percent call (2008) he is awful for state calls. Nate Silver has already written about this. Moreover, I suspect that his methods of robocalling have problems that bias his results. Cannot nail this down, but that is what I think. Perhaps some of you who are very much more astute than I am can shed some light on this.

I would very much like to think that the market will censure him, but I am afraid that he will always have a right wing R segment that will like to hear his results —- not all that different than those same people listening to Rush Limbaugh. Intentional or not (I say it is intentional) he leans R and panders to those who want to hear that.

In conclusion, I view Rasmussen as a total disgrace to statistics. But I fear he will still be around for the same reasons that Limbaugh is.

but everyone still used him.
number of polls dropped from 1700 in 2008 to 1200 in 2012.
ALL the aggregators were held hostage to regulatory capture and the cartel of the red house effect pollsters.

lets say Silver now refuses to use Rasmussen data in the future. The NYT will say that is “unfair”.
And its inefficient.
We just need to understand how to remove error accurately from poll houses that exhibit asymmetrical political bias behavior.

Since Nate weights his polls he doesn’t have to refuse to use any poll. That is his basic model anyhow. His predictions were still pretty solid overall. Not like there aren’t left leaning polls which is why Sam’s median works. My real issue with 538 is Nate now writes like he has a word quota and the paywall. Every once and a while I’d have to ‘toss my cookies’ to continue. Not a big deal but annoying.

I agree with many comments stating that any evaluation should be delayed until all vote counts are certified.

I also think that any evaluation of a model’s success should be done by an independent third-party. Someone with the statistical knowledge but with no interest in demonstrating that one model is “better” than another. I think Andrew Gelman at Columbia would be an awesome person to do this. It seems some other commenters are doing this now by building a spreadsheet.

It would also be great to see an evaluation that included past predictions. Not sure how to set this up, but some metric to evaluate a model’s prediction at -6 months, -3 months, -1 month, and -1 day from election. For instance, Votamatic had a steady prediction of 332 Obama EVs, plus or minus a few bumps, for a very long time.