Saturday, February 18, 2006

Probability, prediction and verification VI: Verification

At last, and after getting slightly sidetracked in various ways, I'll get back to the meat of things.

Forecast verification is the act of checking the forecast against the reality, to see how good it was. The basic aim is to see if the forecast was valid, in the sense that reality did not throw up any major surprises. You don't want your forecast to be confident of sunshine and warmth, but reality to be cold and rainy.

Obviously, for any current forecast, this check cannot even be attempted before the valid time of the forecast has arrived. So anyone who complains that today's multidecadal climate forecasts cannot be verified is merely stating a truism based on the definitions of the terms. A weather forecast also cannot be verified in advance of its valid time (say, tomorrow). But this in itself obviously does not mean that weather forecasts cannot be trusted and used. On the contrary, they prove themselves to be highly valuable on a daily basis, with industries ranging from agriculture to the military depending heavily on them. (That's despite there being ultimately no objective rigorous basis for the way in which the epistemic uncertainty in weather prediction is handled, as I've explained in more detail here, here and here.)

In fact, even after the valid time of the forecast has passed, and even assuming that precise observations are available, verification is still not a trivial matter. Returning to my previous example of a rain forecast, if the forecast said "70% chance of rain", then either a rainy or dry day is an entirely acceptable outcome. So was that forecast inherently unverifiable? The inevitable answer is that yes, of course it was! Even for a quantitative forecast ("tomorrow will have a max of 12C, with an uncertainty of 1C"), it will only be on the rare occasion that the observed temperature falls far enough outside the plausible range of forecast uncertainty that one might be able to say that the forecast failed to verify. In fact, if we assume the forecast uncertainty is Gaussian (or any other continuous unbounded function), there is no threshold at which the forecast fails to verify in absolute terms - you might simply have got the 1 chance in 1000 that the target was 3 standard deviations from the forecast mean. Indeed, with one forecast every day, you'd expect to see this roughly once every 3 years. [Note that we check whether the data lie within the uncertainty of the forecast, not whether the (central) forecast falls within the observational uncertainty of the data - see here for more on this.]

Once you have more than a handful of forecasts, however, you can usually make a realistic assessment of the reliability of the system as a whole - if many days are 3 standard deviations from the mean, you'll probably judge it more likely that the system is bad than that you happened to hit the 1 in 10100 unlucky streak in a good system :-) But the latter can never be truly proven false, of course. Conversely, if the forecast system has validated consistently over a period of time, we will probably trust today's forecast, but even if the system is known to be statistically perfect, there is still a 1 in 1000 chance that it will be 3 standard deviations wrong tomorrow. Each day is a unique forecast based on the current atmospheric state, which has not occurred before. As I explained before, the forecast uncertainty is fundamentally epistemic not aleatory, so there is no sense in which there is a "correct" or "objective" probabilistic forecast in the first place. The uncertainty is fundamentally a description of our ignorance, not some intrinsic randomness.

Obviously it would take a long time to collect adequate statistics from successive 100 year climate forecasts if we started now. And given the rate of ongoing model development, this approach could never tell us much about the skill of the most up-to-date model anyway, since they are replaced every few years. We can however, use simulations of the historical record (and the present) to test how well the models can hindcast variations in the climate which are known to have occurred. In its simplest form, this sort of test provides only a lower bound on forecast errors, since the models are largely built and tuned to simulate existing observational data.

When the models fail to reproduce the data, of course it calls their validity into question - at least, it does if the data are reliable. A striking example of models teaching us about reality is in the recent resolution of the tropospheric data/model incompatibility in favour of the models (OK, I'm over-egging things a little perhaps). Looking back over the longer scale, we have Hansen's famous forecast from 1988, which has proved to be spot on over the subsequent 17 years. In fact, the simplicity of the physics means that one thing we really can forecast quite confidently is a continued global warming in coming decades: the IPCC TAR said it was likely to continue at 0.1-0.2C/decade for several decades to come, and although this perhaps could be nudged marginally higher (we are getting close to the 0.2 limit), it won't be far wrong.

A slightly more sophisticated general technique known as cross-validation involves witholding some historical data, training the model on the rest of the data, and seeing if it correctly predicts the data which were witheld. In order to avoid accusations of cheating, it is necessary to use some sort of automatic tuning technique. If the data take the form of a time series which is split into an initial training interval followed by a forecast interval, then this accurately mimics the situation of a real forecast. It is also how new versions of weather prediction systems are tested prior to introduction - repeat the forecasts of the past year (say), and adopt the new system if it shows greater skill than the current one. We demonstrated a simple example of this cross-validation approach in this paper a few years ago, and broadly similar methods can be found throughout the more prediction-focussed corners of the climate research literature (eg Reto Knutti used a neural network in this forthcoming paper, training it on half the data and verifying it on the other half). These sort of formal forecast methodologies have not been widely undertaken in the GCM-building community in the past, partly because until recently there were no computationally-affordable automatic tuning methods, and partly because most climate scientists don't have much of a background in prediction and estimation - they are primarily physical scientists with an interest in understanding processes, rather than forecasters whose main aim is to predict the future. But there is now plenty of work going to bridge this gap, and here's the obligatory plug for the modest contribution we're making in this area :-) Climate scientists may never get to the level that weather forecasting is at, in terms of attaching clear and reliable probabilities to all of our predictions, but we are definitely making progress.

4 comments:

1) There seems to be an assumption of statistical stationarity here. What about verification in the context of non-stationarity?

2) You assert that forecast uncertainty is not aleatory. I find this implausible, what about Lorenz? Do you really think that forecasts can be made deterministically?

3)You positively cite Hansen's 1980s forecasts, but what about Bil Gray's equally accurate forecasts of increased hurricane activity? How to differentiate forecasts verified for the the right reasons from the others?

4) You don't include the notion of skill here, which requires a naive baseline. Choice of the naive baseline matters for understanding forecast "goodnesss" -- how to choose this baseline? On ENSO forecasts Knaff and Landsea claim that climatology is overly simplistic since ENSO is cyclical. Should there be a trend line as the naive forecast of future temperature, or is stationarity appropriate?

5) Finally, Murohy differntiates between forecast quality, skill, and value as qualities of forecast goodness. if the ultimate goal is to make forecasts useful to people who make decisions, isn't this degree of precision warranted in such discussions?

That's an interesting set of questions that almost justifies a new post, but:

1. Stationarity would be nice, but it only really applies in a situation of frequentist, aleatory uncertainty - which means it is always an abstraction of the real world. The reliability of a specific forecast is not knowable even in that best case.

2. The Lorenz model is a prime example of uncertainty in a deterministic context. That's not aleatory uncertainty! Randomness in forecasting is a useful way of coping with our uncertainty, it is not intrinsic to the system.

3 I was specifically looking for examples of verification of climate models, but yes I could haver also mentioned Gray. It's obviously hard to differentiate between clever and lucky forecasts in any one-off situation, but it seems reasonable to prefer people with a hypothesis (model) which fits a wide range of conditions, versus someone who gives some ad-hoc prediction with no testable method behind the claim (and that goes double when they refuse to bet on their forecast, or make it just happen to fit the consensus over the plausible betting horizon of 20 years or so).

4. I've talked about skill before here and here, and Chris Randles made similar points about the baseline. Of course just happening to extrapolate a 30-year trend for 30-50 years into the future is probably a good forecast right now, but it's only through the models that we know these time scales to be appropriate. Anyone who wants to argue for some sort of trend extrapolation as a baseline for measuring skill would have to show that it would have outperformed stationarity in the past. It would be an interesting question to investigate further. Certainly very simple models give a good simulation of global temperature, when forced appropriately. It is not clear to me that GCMs add a great deal of skill on top of that.

5 Obviously value is what ultimately matters, and that depends on the user(s). Given that current long-term predictions are completely ignored irrespective of what they say, it is hard to argue that they have any value whatsoever :-( However, some people are doing stuff that actually has users over shorter time scales, and I'm hoping to head in that direction myself. If the Tyndall Centre's millennial assessment stops someone from putting a nuclear waste store within a few metres of the coast, then it might prove to be very valuable indeed, despite the sarcastic comments I made about its relevance to mitigation.

James- Thanks. Just a quick follow up on the notion of aleatory uncertainty. You seem to be defining this term the mean randomness. I'd suggest that aleatory uncertainty inlcudes all of those uncertainties that cannot be reduced through new knowledge. From this perspective randomness is thus an example of aleatory uncertainty, but not the same thing.

Well, definitions of aleatory uncertainty aren't always very clear, but give me good enough knowledge of the initial conditions, and I can predict the future of the Lorenz model as far ahead as I want...and given enough knowledge to also build a good model of the weather/climate system, the same is true in real life.