The Problems With Forecasting and How to Get Better at It

Are political scientists any good at making predictions? Jacqueline Stevens, a professor of political science at Northwestern University, argued in an Op-Ed last Sunday that political scientists make for lousy forecasters. Ms. Stevens suggested that National Science Foundation grants to political scientists should be awarded by lottery instead of toward “research aimed at political prediction,” which she said is “doomed to fail.”

My position on these issues is somewhat complicated. (I’ve found my name invoked in various places both in support and opposition to Ms. Stevens’s thesis.) I do think that the subject of prediction is fascinating and essential — enough so that I just wrote a book about it called “The Signal and the Noise,” which is set to be published this September.

The book covers a wide range of topics – not just politics – but two things are fairly clear in a political science context. First, Ms. Stevens is right that there is a problem – prediction has gone very badly in the discipline. But second, her proposed solution might make matters worse. Some of the habits she encourages are the same ones that have helped produce such lousy forecasts in the first place.

Ms. Stevens cited the work of Philip E. Tetlock, a professor of psychology at the University of Pennsylvania, whose 2006 book “Expert Political Judgment: How Good Is It? How Can We Know?” described the results of a landmark two-decade study that he conducted that examined the success rates of experts who made political predictions. Mr. Tetlock’s experts included a number of political scientists, but also included those in government, journalism and other fields – anyone who wrote or thought about politics potentially qualified.

Mr. Tetlock found that the experts’ predictive judgment wasn’t very good. Most of his experts were outperformed by a statistical algorithm, and many were worse than “dart-throwing monkeys.” Most had no clue about some generation-defining events — like the collapse of the Soviet Union — until they began to occur. Mr. Tetlock found that credentials made little difference: having a Ph.D. in political science, for instance, was not a significant factor either way in predicting success.

In fact, these efforts have gone badly. Models based on these “fundamentals” alone have missed election results by an average of eight points since they began to be published widely in 1992. (Those models that combined economic and polling data have had considerably better results.) This is worse than you would do just by glancing at the Gallup poll, or even by just guessing that the outcome of the election would be split 50-50.

It was also much worse than what the models advertised. Most of them claimed to have pinpoint accuracy, and would have given odds anywhere from hundreds-to-one to billions-to-one against some of the outcomes that actually occurred, like the virtual tie between George W. Bush and Al Gore in 2000. (Many of the models had envisaged a Gore landslide instead.)

I’ve gotten various reactions since I’ve published these results, but some have verged on utter denial. Some political scientists have obfuscated the problem (intentionally or not) by treating the data the models used to fit their equations as tantamount to actual predictions – in essence, claiming credit for “predicting” the past. (Here’s a tip: I have a model that says you should bet a lot on George Mason to make the Final Four in 2006. You’ll make a fortune. Now you’ll just have to get your hands on a time machine.)

The political scientists have also noted that some of the forecast models have done better than others. To be clear, I do think that some of them are more soundly constructed. But so far, the results of the “fundamentals” models when tested on real data have been consistent with a hypothesis of no forecasting skill but instead some random variance centered around a poorly performing mean. Cherry-picking the most successful models may be the equivalent of attributing genius to the octopus that predicted the World Cup.

But there is also another, more sophisticated defense of the failures of prediction. “Prediction is simply not what we do,” writes Seth Masket, an extremely talented political scientist from the University of Denver. Instead, Mr. Masket and others say, the goal of political science is to explain the world rather than to predict it.

There is an interesting discussion of this theme in the comments section of the political scientist Matthew Dickinson’s blog. As I wrote over there, “I find the whole distinction between theory/explanation and forecasting/prediction to be extremely problematic.”

One can take an extreme position, as Ms. Stevens does, that accurate political predictions are “the field’s benchmark for what counts as science.” One can also claim, as Mr. Dickinson does, that predictions are not highly scientific unless they are rooted in clearly articulated theory.

Some of these distinctions, I think, are semantic rather than substantive. Where I was able to reach some agreement with Mr. Dickinson is in the notion that predictions in political science are usually more a means than an end.

Although there is much riding on the outcome of the presidential election, the success or failure of a political scientist’s prediction about it – or my prediction of it, of course – isn’t going to contribute much one way or the other to human welfare. This might be contrasted with, say, a weather forecast, or an economic forecast, which will have a more direct impact on life and policy decisions.

In political science, much research does not lend itself to testable predictions. Theories about political institutions, for instance, might take decades to verify – if they can be verified at all.

But herein lies the problem. Theories and statistical models are different types of approximations about the real world. Without testing them by means of prediction, how do we know they are true and objective approximations?

If some models and theories lend themselves more readily to prediction, largely the same techniques are applied to formulate the testable and untestable ones alike. It is extremely easy to mistake the random noise in data for a signal, or to mistake correlation for causation. And one may become enamored of the model or the theory, which will usually be neater and more seductive than the reality.

The fact that the relatively few hypotheses from political science that we can test by means of a prediction are faring so poorly suggests that many of the untested theories and models are equally wrong. In fact, they may be worse, since we have no way to learn from our mistakes and dispose of the wrong ideas.

So I agree with Ms. Stevens (and with the philosopher of science Karl Popper) that the failure of these predictions ought to be extremely discomforting, whether or not prediction is the goal of political science per se.

For better or for worse, however, these problems are not confined to political science. Economics has had at least as many problems. So have a number of hard sciences, ranging from seismology to epidemiology. The majority of “statistically significant” results documented in medical journals also cannot be reproduced when another experimenter seeks to validate the relationships that they describe.

Nor are the problems limited to the ivory tower. Mr. Tetlock’s work suggests, for instance, that those forecasters who appear most frequently in the news media, and cross the line into being pundits, make especially poor predictions. Studies of the panelists on “The McLaughlin Group,” for instance, find that they aren’t remotely good at forecasting.

In some ways, in fact, poor forecasts seem to be a part of the human condition. The work of behavioral scientists like Daniel Kahneman has been, in my view, the most important development in the social sciences in the past half century because its implications are so far-reaching. Mr. Kahneman’s work describes the myriad ways in which we interpret information incorrectly or with bias. The gap between perception and reality is substantial.

Self-awareness is a big part of the issue – we need to know more about what we don’t know. That the presidential forecasting models have performed badly is one thing; my view is that forecasting presidential elections is actually pretty hard. But that these models have failed despite claiming to be extremely accurate is less excusable. (Be careful about interpreting the FiveThrityEight elections forecasts as gospel, as I’d say of any forecasting system. However, I do strive to describe the uncertainty in them as carefully and completely as possible and articulate a realistic range of possible outcomes. We don’t mistake precision for accuracy.)

Ms. Stevens, however, ignored one of Mr. Tetlock’s key findings. Although Mr. Tetlock’s expert forecasters were bad on the whole, some were considerably less bad than others. (My book, likewise, documents some success stories – both in entire disciplines like meteorology and among individual forecasters within these disciplines – in addition to the various forecasting failures.)

Mr. Tetlock calls these modestly successful forecasters “foxes,” as opposed to the completely unsuccessful ones, whom he calls “hedgehogs.” These names come from a quote attributed to the Greek poet Archilochus: “the fox knows many things, but the hedgehog knows one big thing.”

What makes a fox a fox and a hedgehog a hedgehog is something that deserves a fuller explanation than I will offer here, but Archilochus points us in the right direction: foxes (the better forecasters) tend to believe in a plethora of little ideas rather than one big idea. Their pluralism also tends to make them more comfortable with probability and uncertainty and more inclined toward Bayeisan methods. They also lean toward more inductivist approaches, rather than trying to deduce everything from first principles. (Although I agree with Andrew Gelman and Cosma Rohilla Shalizi that this particular distinction gets complicated.)

These findings generally contradict what Ms. Stevens recommends as an approach to other scholars. Indeed, she makes some of the hyperbolic claims (“research aimed at political prediction is doomed to fail”) that Mr. Tetlock’s foxes would be careful to avoid. In her papers, she inveighs against probabilism and inductivism, qualities that seem to be associated with better (or at least “less bad”) forecasting. If she is emphatically correct that one should be suspicious of quantitative approaches that claim to produce miraculous results from messy data sets, it is not quite clear what her alternative is, or whether it would be any better. (Mr. Tetlock’s work, as I mentioned, found that statistically driven approaches performed well as compared with expert judgment.)

Likewise, the solution Ms. Stevens proposes – distributing N.S.F. grants by lottery – verges on postmodern reductionism. Although competition for research grants and space in prestigious journals can produce problems, finding ways to make the competition healthier would seem better than giving up and replacing it with a lottery. (And why, as Ms. Stevens suggests, should the lottery be open solely to political science Ph.D.’s like herself – especially when Mr. Tetlock found that these credentials had little relationship with forecasting skill?)

In short, I don’t know that Ms. Stevens’s recommendations follow from her evidence. If we’re so bad at prediction, and yet it is so essential to science, it seems like we need more research into it rather than less.

Nate Silver is the founder and editor in chief of FiveThirtyEight. @natesilver538