Main menu

Post navigation

Lies, Damn Lies, and Predictions

Oh man, not another election! Why do we have to choose our leaders? Isn’t that what we have the Supreme Court for?
— Homer Simpson

Nate Silver is now putting Barak Obama’s chance of reelection at around 85%, and he has been on the receiving end of considerable criticism from supporters of Mitt Romney. Some have criticized his statistical analysis by pointing out that he has a soft voice and he is not fat (wait, what? read for yourself – presumably the point is that Silver is gay and that gay people cannot be trusted with such manly pursuits as statistics), but the main point seems to be: if Romney wins the election then Silver and his models are completely discredited. (E.g. here.) This is like someone saying that a die has approximately a 83% probability of not turning a 2, and others saying, if I roll a die and it turns a 2, this whole “probability” thing that you speak of is discredited.

But still, when someone offers predictions in terms of probability, rather than simply stating that a certain outcome is more likely, how can we evaluate the quality of such predictions?

In the following let us assume that we have a sequence of binary events, and that each event has a probability of occurring as a and of occurring as . A predictor gives out predicted probabilities , and then events happen. Now what? How would we score the predictions? Equivalently, how would we fairly compensate the predictor?

A simple way to “score” the prediction is to say that for each event we have a “penalty” that is , or a score that is . For example, the prediction that the correct event happens with 100% probability gets a score of 1, but the prediction that the correct event happens with 85% probability gets a score of .85.

Unfortunately this scoring system is not “truthful,” that is, it does not encourage the predictor to tell us the true probabilities. For example suppose that a predictor has computed the probability of an event as 85% and is very confident in the accuracy of the model. Then, if he publishes the accurate prediction he is going to get a score of .85 with probability .85 and a score .15 with probability .15. So he is worse off than if he had published the prediction of the event happening with probability 100%, in which case the expected score is .85. In general, the scheme makes it always advantageous to round the probability to 0% or 100%.

Is there a truthful scoring system? I am not sure what the answer is.

If one is scoring multiple predictions of independent events, one can look at all the cases in which the prediction was, say, in the range of 80% to 90%, and see if indeed the event happened, say, a fraction between 75% and 95% of the times, and so on.

One disadvantage of this approach is that it seems to require a discretization of the probabilities, which seems like an arbitrary choice and one that could affect the final score quite substantially. Is there a more elegant way to score multiple independent events without resorting to discretization? Can it be proved to be truthful?

Another observation is that such an approach is still not entirely truthful if it is applied to events that happen sequentially. Indeed, suppose that we have a series of, say, 10 events for which we predicted a 60% probability of a 1, and the event 1 happened 7 out of 10 times. Now we have to make a prediction of a new event, for which our model predicts a 10% probability. We may then want to publish a 60% prediction, because this will help even out the “bucket” of 60% predictions.

I don’t think that there is any way around the previous problem, though it seems clear that it would affect only a small fraction of the predictions. (The complexity theorists among the readers may remember similar ideas being used in a paper of Feigenbaum and Fortnow.)

Surely the task of scoring predictions must have been studied in countless papers, and the answers to the above questions must be well known, although I am not sure what are the right keywords to use to search for such work. In computer science, there are a lot of interesting results about using expert advice, but they are all concerned with how you score your own way of picking which expert to trust rather than the experts themselves. (This means that the predictions of the experts are not affected by the scoring system, unlike the setting discussed in this post.)

Related

11 thoughts on “Lies, Damn Lies, and Predictions”

Brier invented the concept of scoring rules to solve this problem. Suppose one tells the predictor: “if you predict that the probability of the event is x, then I’m going to pay you f_0(x) if the event does not happen, and f_1(x) if it happens.”.
(And the functions f_0 and f_1 are public knowledge.)

Of course, one aims to select f_0 and f_1 in such a way that, for each p \in [0,1], the expected gain (1-p) f_0(x) + p f_1(x) is uniquely maximized at x = p.

Another way of giving a truthful score is the logarithmic scoring rule. Let’s say there’s a random variable X in {0,1} and you predict Pr[X = 1] = p. Then if X = 1 actually happens I pay you log(p). Otherwise I pay you log(1 – p). This incentivizes you to give your true beliefs.

Any rule that incentivizes an expert to reveal their true beliefs is called a strictly proper scoring rule.

Lets say we give the predictor a penalty c(p) that depends on the probability p that he assigned to the output. If he told his best guess p, he would expect a penalty of c(p)p+c(1-p)(1-p). We want him to be honest, so he shouldn’t get a higher or lower expected outcome by changing the estimate that he announces. That is, c'(p)p-‘c(1-p)(1-p)=0 so c'(p)p should be the same for p and 1-p. One obvious choice is c(p)= -log(p) (and this would be the only choice that worked if there was more than two possible outcomes, because then we would need c'(p)p to be constant). If the predictor know the true distribution P of the outcome X, his expected penalty is known as the information entropy H(X). If he don’t know it, he would get H(X)+ D(P||Q) where P is the true distribution, Q is his belief about the distribution and D(*||*) is the Kullback-Leibler divergence. This divergence is a measure of how far he is from the truth. Of cause we cannot compute this, since we don’t know the true distribution of X.

If a predictor predicted a lot of events (not necessarily with same distribution, so it could be presidential elections), the average of his penalties would be an estimate of the average value of H(X)+ D(P||Q). There is no way of telling how large a part of this comes from the randomness in X and how large a part comes from the ignorance of the predictor (this should be clear: It is not possible to tell if the world is deterministic, so in particular there is no way of telling if the result of the election is determined. It could be that H(X) was always 0). However, by subtracting the penalties between two predictors, we would find the difference between their average Kullback-Leibler divergences to the truth, which might be a good measure of who makes the best predictions.

Ops, I read the article some hours ago, and consider if I should write a blogpost on my own blog or just a comment here. When I decided to write the comment I didn’t press refresh before writing, so I hadn’t seen the above comments.

For a process like the election, there is another potential way : Choose many random subsets of voters and see what is the fraction of subsets in which candidate A wins. Ideally, if the “correct” winning probability of A is p, then whp, the candidate A will win on p fraction of the subsets. So, if a predictor outputs q, the predictor is penalized something proportional to say |p-q|. This seems to be truthful, isn’t it?

Anindya: without judging whether this works for a direct election, where the outcome is based solely on the number of votes for a given candidate, I don’t think this approach would work for an election system as found in the US. In particular, the fact that the probability of winning depends on the actual distribution of votes by state (so that having 50.1% of the votes is some key state is much more important than having 100% in some other) should prevent the random subset approach to be truthful — shouldn’t it?