Nate Silver & Calibration

Nate Silver, needs no introduction. While I should have read his book by now, I have not. From my student Kane Sweeney, I learn I should have. Kane, if I may ape Alvin Roth, is a student on the job market paper this year with a nice paper on the design of healthcare exchanges. Given the imminent roll out of these things I would have expected a deluge of market design papers on the subject. Kane’s is the only one I’m aware of. But, I digress (in a good cause).

Returning to Silver, he writes in his book:

One of the most important tests of a forecast — I would argue that it is the single most important one — is called calibration. Out of all the times you said there was a 40 percent chance of rain, how often did rain actually occur? If over the long run, it really did rain about 40 percent of the time, that means your forecasts were well calibrated

Many years ago, Dean Foster and myself wrote a paper called Asymptotic Calibration. In another plug for a student, see this post. An aside to Kevin: the `algebraically tedious’ bit will come back to haunt you! I digress again. Returning to the point I want to make; one interpretation of our paper is that calibration is perhaps not such a good test. This is because, as we show, given sufficient time, anyone can generate probability forecasts that are close to calibrated. We do mean anyone, including those who know nothing about the weather. See Eran Shmaya’s earlier posts on the literature around this.

Meta

5 comments

Hah! I’m only the messenger: It is Fudenberg and Levine who open with “their construction, which relies on a series of approximations, is somewhat complicated; this note provides a shorter and simpler one”!

The short answer to your question is that I think of calibration as being a necessary condition for a `good’ forecast but not a sufficient one. Now for a longer answer.

Imagine the following sequence of outcomes: 0,1, 0, 1, 0, 1, 0, 1,…….Consider now three distinct sequences of probability forecasts. The goal is to predict the probability of seeing a `1′.

0.5, 0.5, 0.5, 0.5,…..
0,1, 0, 1, 0,……..
0.1, 0.9, 0.1, 0.9,…….

The first and second forecast are both calibrated with respect to the sequence of outcomes. Thus, very different forecasts can be calibrated with respect to the same data. The third forecast is NOT calibrated, but I think you would agree is more `informative’ than the first. This suggests that calibration does not capture everything we would want in a measure of forecast accuracy. In particular, the third forecast seems to have captured the pattern in the data. There is a measure of this `pattern matching’ called coherence or resolution.

However, one of the results that Dean Foster and I left on the cutting room floor, was this: any probability forecast could be adjusted to be calibrated without reducing its `resolution’. If memory serves, Teddy Seidenfeld may have been the first to make this observation.

By the way, as was pointed out to me by a colleague, Nate Silver’s recent probability forecasts on which states would go for Obama were not calibrated!