Guess the beans in the jar!

November 24th, 2008, 1:00am by Sam Wang

Are you any good at those contests in which you guess the number of beans in the jar? If so, I sense an opportunity in the Minnesota recount. The first entry comes from FiveThirtyEight: Franken by 27 votes. Several commentators have linked to it withgreatcredulity.

This seems like a classic setup for biased assimilation. If Franken wins by anywhere from 10 to 50 votes, fans will love it. If not, then people will remember the disclaimers. So – let’s all get into the game. Be a prognosticator for a day!

A statistically responsible move in any model is to calculate the uncertainty. What is it in the case of the 27-vote prediction? At least +/-200 votes.

Go start by reading the post. A summary: As the Minnesota recount progresses, some precincts have more challenged ballots than others. Counting the challenged ballots would favor Franken slightly, perhaps enough to tilt the election.

The analysis is predicated on the idea that there’s a systematic relationship between the number of challenges per precinct and the net effect on the Franken-Coleman margin:

…the fewer the number of challenged ballots, the better Franken is doing, and the higher the number of challenged ballots, the worse he is doing; the relationship is in fact quite strong.

Then the idea is to estimate the net change in support for Franken and Coleman:

We can address this phenomenon more systematically by means of a regression analysis.

The data (precinct-level, I hope) are fitted to an eight(!!!)-parameter model that takes into account all the challenges. After setting six of the parameters that relate to challenges to zero. this formula remains:

franken_net = t * 8.922 – 3.622

where franken_net is the net gain that comes from an uncontested recount, t is Franken’s initial vote fraction, and 8.922 and 3.622 are the two remaining fit parameters. When t is plugged in, the result comes out to a net projected gain of 242 votes for Franken. Since the initial lead was Coleman by 215 votes, this would lead to a 27-vote victory.

But there’s a problem. Fit parameters always have uncertainties. In this case, the uncertainties are not given. Let’s assume some modest uncertainties. For example, what if the values are 8.922 +/- 0.5 and 3.622 +/-0.5? Running through the calculation presented, this makes the net projected gain anywhere from 26 to 458 votes. The final result would then be anywhere from Coleman winning by 189 votes to Franken winning by 243 votes. Or, to put it another way, a more accurate prediction would have been “Franken by 27 +/- 216 votes.” So 27 is basically a random guess.

Well, let’s all play. I’ll guess as follows. Model 1: Assume that precincts where 0-2 ballots were challenged reflect the likeliest change after challenges are resolved. They represent a net of 91 votes for Franken (28+31+32)/(2233+419+154)=0.032 vote per precinct. If the remaining 133+59+26=218 precincts perform similarly after challenges are resolved, they will yield a total of 7 more for Franken, for a a net gain of 91+7=98 votes. Model 2: Assume that total net gains in high-challenge precincts are similar to low-challenge precincts, i.e. Franken +91, for a total of 91+91=182 votes. So the range of final margins is Coleman by 33 to 117 votes. If you want an exact guess, I’ll go with Coleman by 117.

Wow, that was fun! Speaking of biased assimilation, maybe now I’ll get some links from right-leaning sites.

Enter your own guess in comments. Give reasons if you have them.

P.S. In comments, the topic arises of the degree to which challenges represent an attempt to throw the credibility of the recount into question. Perhaps relevant, via electoral-vote.com, is Minnesota Public Radio’s images of challenged ballots. Aggressive challenges seem to have been made by both sides.

46 Comments so far ↓

I think this highlights a cynical and calculated risk by Nate. I think he is counting on media ascertainment bias. And I think it is a shrewd move, though as an academic, it raises all kinds of ethical flags for me. I think the modeling is just unnecessary obfuscation to give the easily dazzled something concrete (and difficult to understand) to hang their credulity on. IMO, the model isn’t strictly speaking, a terrible one, but it is simply more complicated than the evidence suggests is warranted.

If Nate is ultimately close to being right by chance, then he will be elevated to a god of political statistics. If he is even correct in predicting a Franken win but is off on the magnitude, he’ll still get plenty of credit. If Colman only ekes out a win, then Nate’s hemmed, hawed, and hedged enough for most people to give him a pass, even if he doesn’t get any kudos. I think the only possible risk for Nate’s reputation is if Coleman pulls out a comfortable win.

In any event, we could have concluded much the same without any of the recount data after election day. And we could be even more certain if we could get a good handle on how many ballots went from “disqualified” to “counted”.

Of the limited information that we have, there are general indications that recounts would, if they had a systematic favorite, slightly favor Dems. (Poorer urban districts with older, less well-maintained machines, poorly staffed precincts with less clear instruction/assistance for voting, more Dem first time voters, and let us not for the Lizard People.)

Anyway, given the extreme closeness of this race and reasonable error rates by machines as well as by voters, it isn’t unreasonable to call it a near tossup, putting Nate in a good position to win regardless of the model he chooses to justify his risk. If there are only 10,000 undervotes that can be recovered in the recount, then those votes need only be sampled from an underlying binomial with parameter p=0.512 in order to give Franken even money at recovering 240 votes in the recount.

Which is just another way to reiterate what you’ve already said. The thing is damn close, and the actual parameters of interest (the binomial p and n) are obfuscated by challenges. I’m not convinced that those challenges even follow something that can be estimated by regression, and even if they could, as you point out the error bars are rather large.

On the more constructive side, does anyone have any better ideas of how to estimate how many ballots can go from “undervote” to a Franken or Coleman vote? And better yet, would anybody have a good way to guess at what p those were sampled from?

I thoroughly enjoyed this post. When I first saw Silver’s projection, I was laughing so hard. Even more incredible was the amount of responses, revealing how many people took it seriously.

It’s come to the point where Nate Silver can post a flaky (to say the least) analysis and get away with it. He’s amassed a huge number of fanboys who know nothing about precision (or stats, for that matter), and he’s managed to get enough media exposure to gain credibility.

He hasn’t attempted -at least publicly- to assess his model, given the election results. In the end he was way off (compared to PEC and electoral-vote). Remember all this ‘the race will tighten’ and justifications for random tweaks that he was making? Why doesn’t he critically look at his own methods? Well, I guess he isn’t an academic.

However, I was seriously disappointed -from an academic viewpoint- with Sam’s post-election assessment (see the AAF’s excellent post – Nov 19, 2008 at 12:47 pm in How meta-analysis did, 2004-2008). The only justification for Sam’s assessment is that, while he is a top researcher and his methods are therefore much more robust, the internet world is very different from the academic world. There’s no scrutiny, no peer-review process and success depends on one’s ability to promote oneself to the public.

In any case, this site will continue to be my main source for polling data analysis. It is (arguably) the most accurate, transparent (the code is available for all to see), and most scientific. It never fell for the narrative fallacy which is the most annoying aspect of fivethirtyeight.

JJtw: “I think this highlights a cynical and calculated risk by Nate. I think he is counting on media ascertainment bias. And I think it is a shrewd move, though as an academic, it raises all kinds of ethical flags for me. I think the modeling is just unnecessary obfuscation to give the easily dazzled something concrete (and difficult to understand) to hang their credulity on. ”

I think this assessment is overly harsh, and the explanation is much simpler:

Nate needs content.

He is still working out how to keep his site’s traffic up when there is no imminent election. His calling card is complicated statistical modeling. It’s what his fans want, and demand. So, at least for now, he is going to be modeling everything he can get his hands on that might interest his readers. Like talk radio and foxnews, you can’t keep your audience by saying things like: “The issue is unclear and can be seen many different ways. We’ll just have to wait and see.”

I don’t see this as particularly unethical; it just highlights the different missions of his site versus others, including this one. So long as he is at least trying to give a reasonable statistical take on whatever he is addressing, I think he’s within bounds. None of which means that this particular analysis of his is worth much, just that I don’t see cynicism so much as customer service.

He does say that “The error bars on this regression analysis are fairly high, and so even if you buy my analysis, you should not regard Franken as more than a very slight favorite.” Of course, media that quote him tend to skip that part (which, in any case, I think he should have emphasised more). Wouldn’t it be interesting if he announced those error bars, and they turn out to be +/- 216 votes?

He did say one thing that I think is completely correct, and ought to be what the public demands: the default position, or interim count, that the state announces treats every challenge as assumed to be successful. This creates a powerful incentive, for PR reasons, to issue a huge number of challenges.

If the state instead gave its interim counts based on each precinct’s election judge’s initial ruling, we would have a pretty good picture of what — unless the judges are grossly incompetent — ought to be very very close to the final result in those precincts.

Also, the state ought to be posting images of every challenged ballot.

JJtw – One way to think about overvotes/undervotes is to calculate the sum (# Challenges)*n+dFranken+dColeman, where “dFranken” and “dColeman” are the columns in the table, and are negative numbers. This sum gives the minimum number of new votes that must have been allowed in those precincts.

It’s hard to say what’s going on in those last 26 precincts because the number of challenges can’t be estimated from the table. It’s probably similar to the 5-9 case.

So in total, there were at least 650-700 new votes that were allowed without challenge. A total of about 1900 challenged ballots in 791 precincts need to be resolved.

Also, note that the number of new votes per precinct is highest where the challenges were densest. This suggests to me that there were some real problems in those precincts. Perhaps the challenges are not capricious.

I agree that fivethirtyeight needs content and the prediction has enough qualifiers and hedges such that its nothing more than an educated guess. The actual post is not claiming to be much more than that. Its a little distressing to see the guess getting reported as actual news, though.

I went ahead and bought the book anyway, so no need to send it to me when I hit the bullseye with my pick.

Franken by 1. (That’d be my vote then, it’d be nice and glorious, even though Al stole a girl from me back in the early 80s in Hollywood.)

However, I’d like to ask– is this site going to become just a midrash on Nate’s site? I would still visit it, I guess, but my advice to Sam would be not to get sidetracked by the annoying success of others…

CJC – Yeah, sorry about the midrash-iness. I swore I wouldn’t go down that road. But somebody wrote asking about it. I somewhat regret responding. But now that I have, at least let’s try to make it fun. (Al stole a girl from you???)

There’s also an issue that there is a decline in interesting poll-y things to comment on. In this respect AAF has hit the nail on the head.

Obviously it’s time to move on to other interesting topics! I love error bars as much as the next guy, actually much more, but surely there is something else to say…

I guess Franken by 215 because that is how much Coleman was ahead in the initial count so it seems somehow appropriate, if irrational, for it to go exactly that far in the other direction. also I can guess just a well as the next guy. I’ll be waiting for my book. I want a signed copy. :)

For my arbitrary guess, I tried to think of a good physical or mathematical constant…many of them are too small (pi, e) for a vote margin, but then it hit me: particle physics defines the fundamental nature of the universe, and the minnesota recount must obey the laws of physics. Thus, I go with Franken by 1/(fine structure constant) ~ 137. This means, of course, that Feynman diagrams can be used to predict senate races; a possible blog topic for the underused particle theorist (of which there must be many, until the LHC comes around).

By the way, completely ignoring or under-emphasizing the error in predictions is extraordinarily common in the baseball statistics world, whence Mr. Silver originates.

In regard to error analysis, many things make sense now, especially since it seems unthinkable that he would not understand this. As opposed to my origin in physics, where uncertainty is well loved. I have even been known to calculate the uncertainty of an uncertainty. But I thought he had a background in economics. Don’t they do error analysis there?

I think the Silver analysis was a reasonable attempt to make an educated guess, but it was poorly contextualized. What he should have done was report the Hotelling regression bands (which almost certainly exceed 27 votes at any reasonable level of significance) and clarify that that means the center is completely meaningless.

He would win a lot more respect in my eyes if his prediction had been “Franken to lose/win by -23 to 73 votes with 95% confidence” or something along those lines. Accurately predicting the margin would be quite worthwhile, I believe.

I agree – Nate’s predictions and discussions are entertaining, but I was a bit dumbfounded that he actually boiled that mess of half-baked assumptions down into a number – and put it in the headline! 27 +/- 216 indeed.

Particularly ironic is the fact that Nate has (probably) already managed to affect the recount. When he made a post suggesting that the winner might be the campaign that challenged the most ballots, both campaigns immediately started challenging more ballots! Never mind that he soon contradicted himself in a later post … or that this latest prediction is now relying on the very same number-of-challenges variable that he himself influenced!

Honestly, at this point it’s an utter mess.

Me, I predict that they *both* win, getting an equal number of votes. Plus or minus an uncertainty of 250. I figure that gives me the best odds of being within 50 votes of the final tally, and guarantees that I will at least have picked the winner correctly. ;-)

I’ll go with Coleman by 30. My reasoning is as follows: Sam gave the range 33-117 for Coleman, but picked 117. Given the uncertainties at play it seems silly to worry about more than one significant figure, I’ll round the lower value of 33 to 30. While I would much rather have Franken win, if Coleman wins I’d at least have a chance at a consolation prize.

He does give some indication of error bands — The t-stats are listed as 2.89 on t and 2.36 on the constant. Although he doesn’t give us the full covariance matrix, we can at least tell that the standard error of t is about 3 (8.92/2.89) and the standard error of _cons is about 1.5. Far larger than your assumptions of 0.5 each.

Of course, the covariance will likely reduce the overall uncertainty in the point estimate somewhat.

Why didn’t he bother using the built-in post-estimation testing functions in Stata to give an uncertainty band on the estimate?

I was reading the comment thread on his post. There are some sophisticated criticisms. I guess by now geeks are the ones left reading.

To calculate the uncertainty (u’) of an uncertainty (u), just remember that u itself is a variable that can be sampled. This comes up, really it does. For example, the SEM of a weekful of polls is often less than expected from sampling error. How often does that occur, and what should one do about it? For the terminal nerds, An Introduction To Error Analysis by Taylor is a good place to start on such things. One of many fine leisure readings in my house.

Stats Guy – The error might be three times larger? Now that is just sad. I believe it could be made smaller by removing some of parameters #3 through #8. Or by constraining the fit. Put that in your spline and smoke it!

I’m picking Franken by 13 in the final official recount before the lawsuits. But I really would prefer to see a tie as maybe the best outcome, because I think neither side would sue over a coin toss. (Although this coin already seems to have landed on its side.)

This is an awfully petty critique, Sam. The point of the analysis is that there is a strong relationship between the fraction of challenged ballots in a precinct and the extent to which the state’s reported results have tended to favor Coleman. This relationship implies that the running totals provided by the Secretary of State and the Star Tribune may be misleading, and that Coleman may not in fact be the favorite to prevail in the recount, even though he has a nominal lead.

As far as things are headlined and disclaimed: (1) in fact, they ARE disclaimed, e.g. I write that “even if you buy my analysis, you should not regard Franken as more than a very slight favorite”; (2) I’m not writing principally to an academic audience, and (3) the dataset I’m working with is publicly available, and my methodology is both fairly trivial and fully disclosed, and could be recreated (or improved upon!) by you or probably anyone else reading this in no more than 20 minutes.

I predicted Franken by 250 on Nate’s board, from a very crude linear fit to just the first dribs and drabs of data back before the challenges exploded. In the first couple days, the excess of Coleman’s challenges over Franken’s obviously represented the number of thoroughly bogus challenges that Coleman is bound to lose: that was the basis of my projection, and I am sticking to it.

Nate – Thank you for coming by. Despite any disagreements, it is kind of you.

In regard to pettiness, I apologize. My complaint is not about the existence of a regression, but with the presentation of apparent precision without a directly accompanying statement of error. I agree with you that you have made a technically correct disclaimer. But you must be aware of the sillier reactions to your post.

Many of your readers don’t know much statistics, but they like and trust you. I think people like that need a little connecting of dots. Why not do something to prevent them from going in wrong directions? Even saying “30” instead of “27” would be a help. Perhaps it’s the teacher in me.

In regard to the fraction of challenged ballots being a correlative clue: If I understand your data correctly, there seems to be at least one other correlation – see my post above. The minimum number of new ballots per precinct also increases with the number of challenges. This suggests to me that there may be genuine problems with counting in high-new-ballot precincts. This is a natural explanation for those precincts having a higher number of challenges.

I do think that by calling attention to the possibility of bad-faith challenges, you may increase the possibility that the hearing is not biased by who appears to be ahead. That’s good.

You are completely correct that your collection and presentation of data makes it possible for others to do simple analysis, as I have done here. It’s interesting, it’s enjoyable, and I appreciate it.

I understand why people are counseling the critics of Nate to ease up a bit and I do understand where Nate is coming from in defending himself. However, I find it rather difficult to see any of this criticism as anything but constructive.

To Nate: You can still communicate uncertainty without writing in academic jargon. In defending yourself in point (2), I see no constraints imposed upon you that can’t be trivially solved. Your lede could read something like: “The outcome of the recount is very uncertain. It seems just as likely that Coleman will win by 73 as it is that Franken will win by 127 votes. According to my calculations, the real answer will be somewhere in between, at around Franken +27. To be honest, this is a real hard one to predict.” That’s perfectly layman language and accurately conveys the uncertainty. More importantly, it gives prominence to the uncertainty. To do otherwise IS statistical deception. Trumpeting the prediction of a model while burying the uncertainty reminds me of the disclaimers in smoking and pharma ads. Nate is a good enough writer, an insightful enough blogger, and well-trained enough Statistician (Nate, you did take at least Stigler’s intro series right?) to be expected to communicate such uncertainties without undue trouble. Precisely BECAUSE we know Nate’s record we have higher expectations.

And to the critics of critics: None of the critques has been ad hominem, and practically everything mentioned on this blog is at least correct in spirit. There is nothing wrong with asking Nate to highlight his uncertainty more prominently or to call him out when we think he’s been less than forthcoming. Nor is there anything wrong with criticizing the premise of Nate’s model. Sure, Sam adds a liberal dose of snark and maybe that could be left at the door, but the heart of his criticism was methodological. And it is a bit of “pot meet kettle” for Nate to ask someone with statistical quibbles to avoid snarky comments. Where would 538 be without the occasional snark? So, to Nate and his defenders, keep up the rebuttals. It is a good back and forth. But engage the criticisms (both statistical and spin-related critiques) and see what you can do with them. There is decent crowd-sourcing potential from a few academic geeks here. You could probably use that to your benefit. What seems certain to me is that the worst response is to be defensive.

PS
Sam, I glossed over the biased assimilation link you included above. My ascertainment bias comment was rather parallel to your original post. Sorry for the redundancy. I was thinking along the lines of publication biases wherein “interesting” results picked up and published in good journals inflate the reputation of the scientist. So a mediocre scientist can be well-regarded as long as he has a few home-runs to outshine his whiffs. It might be easier to get a job with 2 pubs in Nature and 10 questionable ones in the Proceedings of the National Academy of Navel Gazing than it is to get a job with 20 solid publications in Genetics and Genome Research.

This thread has been interesting and enjoyable. I’d be very interested in Sam’s and commenters’s thoughts on where next we might find an election (0r other public-data accessible) that is as engaging to the general public in the U.S. as the 2008 U.S. head of state election where some kind of good statistical analysis can give the public and the press accurate, useful information.

I can’t recall in which Woody Allen movie someone said something like “Another last word freak.”

I don’t want this comment to be viewed as a thread-ender, but I think that if Sam has sufficient time we may be able to tempt him into something forward looking that he hasn’t yet considered.

Basically, the celebrity of Nate Silver and FiveThirtyEight led to mainstreaming of the idea that poll aggregation was a useful activity. However, my posting and the comment thread call attention to the issue that among many readers, core lessons of a statistically-based point of view have not stuck. Thus the interest in “a 27-vote margin.”

Celebrity is a double-edged sword – it draws people in, which is good – but it also focuses them on personalities as opposed to subject matter.

I am definitely interested in further discussion of what’s a good substrate to fire people’s imaginations. The first thing that comes to mind is hurricane tracking. But I think that to really work, there needs to be a personal element. In this respect political campaigns are perfect.

I posted a critique of Silver’s “27 votes” review on an online site. The focus being (potential and actual) issues with the variables considered and not considered, assumptions, lack of providing margin of error, CI. I referred to 27 as his “prediction.” 2 responses essentially called me an idiot for doing so rather than a “projection” and not knowing the difference. (Another slammed me for daring to criticize Silver).

To refresh my memory, I found these definitions of the two from a stats book that now confuse me:

Prediction–An estimate based on the analysis of a past series of data for a point outside the series.
Projection–A prediction based on certain assumptions.

So, a projection is a prediction assuming certain conditions, whereas a prediction is not based on certain assumptions? The difference being the use of assumptions?

(Isn’t a prediction dependent on an assumption that certain conditions relevant to the past series of data remain the same and relevant for the predicted point outside the series?)

Seeing references here (contra elsewhere where I don’t have faith in their use of the two terms) to Silver’s 27 as a prediction leave me wondering about what label is correct for this case and the difference between the two terms (the use of the word “prediction” for defining “projection” is a bit awkward, at least from the definitions I’ve encountered). My polisci grad work is too remote in my mind to clearly recall my elementary statistics.

Of any help in clarifying the concepts and proper labeling/understanding of Silver’s 27? (He uses a prediction in his franken_net variable and obviously certain assumptions in his “suggested value.”)

In 4th grade I won a jar of jelly beans by guessing the amount in the jar. I guessed 700 (the total was 738) while everyone guessed around 200. I’m 43 now, and winning that guessing contest is still the greatest moment of my life.

Olav — the Meta-Analysis gave a prediction of 352, not 364. The 364 was an unofficial number, based on a subjective tweak to the Meta-Analysis results. It is listed at the top of this web page in error.

The fact is, the two sites’ predictions were very close to each other, with the biggest difference being that 538 tried to give specific state-by-state picks, which (i) I don’t think Sam tried to do, and (ii) came out pretty accurately (I think 50/51, including DC). I think this 50/51 result is probably where the widespread view that Nate was amazingly accurate comes from, rather than his overall EV prediction. (by the way, his state-by-state projection generated a 353 EV prediction, almost exactly the Meta-Analysis prediction).

Also, it’s a bit silly to compare the two sites’ predictions, because their predictions were getting at different things.

538’s prediction of 348.6 was an average based on simulations. But the five “Most Likely Obama EV Totals” on 538’s model were, in descending order, 311, 353 and 364 (all between 900 and 1000 per 10,000 simulations), followed by 338 (a bit over 800 per 10,000), and, much farther behind, 291 (at around 500 per 10,000 simulations).

I don’t know if Sam’s method provides that kind of “five most likely results” information or if the math doesn’t allow for that particular kind of analysis.

And please, I hope nobody jumps on me for using “prediction” “projection” “pick” “most likely result” or whatever else basically interchangably. None of those distinctions are relevant when the time comes to take your ticket to the window and see whether you won.