Grading policies

November 3rd, 2014, 8:33pm by Sam Wang

How should PEC be graded? As the last few polls trickle in, let me give a suggestion for how to evaluate predictions after the election. Late tonight I’ll give actual predictions (and give you a chance to record your own predictions).

My preferred measure is the Brier score. As I explain this concept, I’ll refer to some suggestions from FiveThirtyEight and Drew Linzer.

Are “well-calibrated” probabilities enough? FiveThirtyEight has suggested that probabilities should be well-calibrated, i.e. 50% probabilities should be correct 50% of the time, 75% probabilities should be correct 75% of the time, and so on.

This is sufficient if one thinks of prediction as being like gambling, i.e. the avoidance of money loss. The problem is that if I were to give win probabilities of 50% for every race, I could call that well-calibrated. But it would not be informative.

Ideally, we’d want a measure that rewards confidence, but does not reward random guessing – and really sticks it to you if you get a prediction wrong. There’s a simple measure that does this: the Brier score. Here’s how it works.

Basically, you express your win probability as a fraction (i.e. 100% is 1.0, and 0% is 0.0). Then score the outcomes as wins (1.0) and losses (0.0). Calculate the difference between the probability and outcome, and square it. That is a Brier score for one prediction. Average the scores for all your predictions to get your overall Brier score. The lowest score wins.

Here is an example.

In this example, the person forecasting races A and B just gave 50% probabilities, which are basically random guesses. He/she ended up with a Brier score of 0.25. The person forecasting races C and D made more confident predictions, and ended up with a Brier score of 0.04.

And of course, two wrong predictions (let’s call that the Dick Morris score) would lead to a Brier score of 1.0, which is very high.

(Note that the Brier score is not the only way to go. A few weeks ago, reader Forrest Collman made a pitch for a logarithmic-scale evaluation. Check that out.)

The Brier score concept is fairly commonplace. I believe other aggregation sites will be using it for evaluation. In 2012, Rationality.org used Brier scores to evaluate our Senate predictions. We did considerably better than FiveThirtyEight, in large part because of two wrong calls by FiveThirtyEight (North Dakota and Montana). I am not certain we will do better this year – there are so many uncertain races. But we’ll try!

How do we reward correct predictions made far in advance? The truth is that all prognosticators should perform at similar levels on Election Eve. A better test is whether we made predictions that were ultimately correct, weeks or months before the election. Here is what Drew Linzer says:

A good election forecast zeroes in on the correct outcome as quickly as possible, without overreacting to daily noise http://t.co/vILBieob2q

I have to think about what the best measure would be. My first thought is to calculate an average Brier score over the entire campaign, starting in June. If anyone has further ideas, I’m all ears.

Note: A previous version of this essay made incorrect statements about The Monkey Cage’s stance on how to evaluate probabilities. I regret my error. I am also told by John Sides that HuffPollster will be using Brier scores to evaluate predictions. Good for them!

14 Comments so far ↓

What about calculating Brier scores for the average of each month leading up to the election?

It doesn’t give one, ‘winner, winner chicken dinner’ result, but it would test the various ideas that 1. Broader models are more useful earlier; 2. Everyone is about the same at the end; and assuming 1 and 2 are more or less validated: 3. Identify when the methodologies effectively converge; and related: 4. Whether the fundamental based models underweight polls for an appropriate duration of the campaign.

Wouldn’t it be possible to calculate a running Brier score over all the snapshots for the whole period of evaluation? It would just mean imposing the Brier score calculations on the set at every point, which wouldn’t be very computationally intensive. That way you could look at several measures: average score over the campaign, range of scores (max – min), point in time past which the score stayed below a certain point, etc.

Ooh I like that idea. It’s easy to get the average over time, but that chart could be very interesting to look at. It could be useful to look at the high and low points and try to figure out what in the data caused it, possibly suggesting ways to improve the model next time around.

Averaging the score over time seems reasonable. I picture it as integrating the squared error between your prediction over time curve and a line representing the actual outcome, and it’s hard to see a simpler way to generalize a Brier score to cover past predictions.

Sam,
You mentioned WP, 538, etc. Is this a grading system just for PEC? Do the other poll aggregators agree to develop the Brier Score for their Predictions and communicate that to the Media?

Even with all the posts on being unable to depend on the polls, almost everyone is called for the same winners in each state and same overall result. So it would make sense, if there are many ties, to have another method to break ties relative to the different processes used.

Right now , on Election day, PEC and 538 are exactly the same on predicting who wins every state. On the probability of winning the biggest difference is 18% on Kansas. Comparing ALL poll aggregators yields almost the same similarities.
Given that 538 uses a lot of special sauces like House bias and fundamentals to adjust almost every poll, RCP rejects some polls, and PEC just takes the median of all polls, I find this intriguing and slightly humorous.
Since almost all are singing the same song, we do need something similar to Drew Linzer’s chronological charts.

I think the margin of the victory should be considered in the evaluation of the prediction. For example, If the the victory margin is 1%, then 70% winning prediction is the best prediction. If the the victory margin is 2%, then 90% winning prediction is the best prediction. If the the victory margin is 3%, then 100% winning prediction is the best prediction. If the the victory margin is 0%, then 50% winning prediction is the best prediction. Anything in between can be linearly interpolated.

Is anyone else producing seat-count histograms? The correlation between your histogram and one with 100% in the bin for the actual result could be interesting. It would reward a peaky distribution that centered on the result, but a wide distribution would score better than a peaky one that missed.

You would get a simple zero to one score, but I don’t know if it would be any more useful than the Brier score, and it still only works on the final prediction unless you do an average over time.

Looking at the model comparisons on The Upshot, it looks to me like Drew Linzer’s model scored best at DailyKos just because they had the most pro-Republican forecast (Other than WaPo, which had the right answer, but far too much certainty and got NC wrong). But the polls-only models seem to have outperformed the polls+fundamentals models on Election Night, with HuffPo probably in second place. It doesn’t look like you are going to be winning any prizes for the most accurate prognosticator, but I think you’re still doing a good job and a great thing putting this out for us.

Both A and B win. So the Brier score is the same for both prognosticators.

But candidate A won by 15 points, and candidate B won by 1 point.

Intuitively, I would say prognosticator 2 was better than prognosticator 1. Prognotiscator 2 recognized the “iffyiness” of race y and the “sure-thing-ness” of race x, whereas prognosticator 1 missed that.

So this leads me to suggest some kind of weighting of the “raw” brier scores, where the weight is a function of the actual winner’s gap.

Unfortunately, as I tried to come up with examples, I discovered that ranking of prognosticators depends critically on choice of weighting scheme, and there is no “obviously right” weighting scheme. And the ultimate ranking of different prognosticators will depend on a somewhat arbitrary decision about the weighting scheme.