Pages

Friday, December 4, 2009

Bayesian Climate Model Averaging

Another instalment in the 'lack-of-climate-model-validation-bothers-me' series (see Lindzen's talk for a good intro). I've been reading Jaynes' book lately so naturally the Bayesian approach to the issue seems most germane. The whole climate science / public policy intersection can be viewed as one big decision theory problem (acting to maximize utility under uncertainty). To come out of that game well you generally need to have good models (hopefully with nice, tight predictive distributions) and smooth, gently sloping loss functions. I'll leave the loss functions for now and focus on the modelling aspect (since that's what I'm familiar with).

Validation (comparing the model predictions to experimental observations) is generally what allows you to find out if you've made good choices of what model structure to use, what physics to include and what physics to neglect. In most applications of computational physics this is a straight-forward (if sometimes expensive) process. The problem is harder with climate models. We can't do designed experiments on the Earth.

Here are a couple of choice quotes from Reichler and Kim 2007 about the difficulties of climate model validation.

Several important issues complicate the model validation process. First, identifying model errors is difficult because of the complex and sometimes poorly understood nature of climate itself, making it difficult to decide which of the many aspects of climate are important for a good simulation. Second, climate models must be compared against present (e.g., 1979-1999) or past climate, since verifying observations for future climate are unavailable. Present climate, however, is not an independent data set since it has already been used for the model development (Williamson 1995). On the other hand, information about past climate carries large inherit uncertainties, complicating the validation process of past climate simulations (e.g., Schmidt et al. 2004). Third, there is a lack of reliable and consistent observations for present climate, and some climate processes occur at temporal or spatial scales that are either unobservable or unresolvable. Finally, good model performance evaluated from the present climate does not necessarily guarantee reliable predictions of future climate (Murphy et al. 2004).

The above quoted paper is a comparison of three generations of IPCC-family models. The study shows improvement in prediction of modern climate as the models improve from 1990 to 2001 to 2007. It also shows that the ensemble mean is more skilled than any individaul model (more on this later). The reasons given to explain the improvement make intuitive sense:

Two developments, more realistic parameterizations and finer resolutions, are likely to be most responsible for the good performance seen in the latest model generation. For example, there has been a constant refinement over the years in how sub-grid scale processes are parameterized in models. Current models also tend to have higher vertical and horizontal resolution than their predecessors. Higher resolution reduces the dependency of models on parameterizations, eliminating problems since parameterizations are not always entirely physical. That increased resolution improves model performance has been shown in various previous studies (e.g., Mullen and Buizza 2002, Mo et al. 2005, Roeckner et al. 2006).

A problem faced by climate modelers is that it is unlikely that we'll be able to run grid-resolved solutions of the climate within the lifetime of anyone now living (us CFD guys have the same problem with grid resolution scaling for high Reynolds number flows). There will always be a need for 'sub-grid' parameterizations, the hope is that eventually they will become "entirely physical" and well calibrated (if you think they already are, then you have been taken by someone's propaganda).

Bayesian model averaging (BMA) is one way to account for our uncertainty in model structure / physics choices. Instead of choosing a 'right' model, we get predictive distributions for things we care about by marginalizing over the uncertain model structures (and the uncertain parameters too).This paper shows that it is a useful procedure for short-term forecasting. The benefit with short-term forecasts is that we can evaluate the accuracy by closing the loop between predictions and observations. Min and Hense apply this idea to the IPCC AR4 coupled-climate models. Here's a short snippet from that paper providing some motivation for the use of BMA:

However, more than 50% of the models with anthropogenic-only forcing cannot reproduce the observed warming reasonably. This indicates the important role of natural forcing although other factors like different climate sensitivity, forcing uncertainty, and a climate drift might be responsible for the discrepancy in anthropogenic-only models. Besides, Bayesian and conventional skill comparisons demonstrate that a skill-weighted average with the Bayes factors (Bayesian model averaging, BMA) overwhelms the arithmetic ensemble mean and three other weighted averages based on conventional statistics, illuminating future applicability of BMA to climate predictions.

The ensemble means or Bayesian averages tend to outperform individual models, but why is this? Here's what R&K2007 has to say:

Our results indicate that multi-model ensembles are a legitimate and effective means to improve the outcome of climate simulations. As yet, it is not exactly clear why the multi-model mean is better than any individual model. One possible explanation is that the model solutions scatter more or less evenly about the truth (unless the errors are systematic), and the errors behave like random noise that can be efficiently removed by averaging. Such noise arises from internal climate variability (Barnett et al. 1994), and probably to a much larger extent from uncertainties in the formulation of models (Murphy et al. 2004; Stainforth et al. 2005).

Another interesting paper that explores this finds that models which have good scores on the calibration data do not tend to outperform other models over a subsequent validation period.

Error in the ensemble mean decreases systematically with ensemble size, N, and for a random selection as approximately 1∕Na, where a lies between 0.6 and 1. This is larger than the exponent of a random sample (a = 0.5) and appears to be an indicator of systematic bias in the model simulations.

This should not be surprising, it is very difficult to get all of the physics right (and remove the systematic bias) when you aren't able to do no-kidding validation experiments. They begin their conclusion with

In our analysis there is no evidence of future prediction skill delivered by past performance-based model selection. There seems to be little persistence in relative model skill, as illustrated by the percentage turnover in Figure 3. We speculate that the cause of this behavior is the non-stationarity of climate feedback strengths. Models that respond accurately in one period are likely to have the correct feedback strength at that time. However, the feedback strength and forcing is not stationary, favoring no particular model or groups of models consistently.

This means it is very difficult to protect ourselves from 'over-fitting' the models to our available historical record, and it certainly indicates that we should be cautious in basing policy decision on climate model forecasts. The 'science is settled' crowd, while busy banging the consensus drum and clamouring for urgent action (NOW!), never seem to offer this sort of nuanced approach to policy though.

If you have read any good, recent climate model validation papers please post them in the comments. Please don't post polemics about polar bears and arctic sea ice, my skepticism is honest, your activism should be too.

11 comments:

First paragraph of the abstract:Ensembles used for probabilistic weather forecasting often exhibit a spread-skill relationship, but they tend to be underdispersive. This paper proposes a principled statistical method for postprocessing ensembles based on Bayesian model averaging (BMA), which is a standard method for combining predictive distributions from different sources. The BMA predictive probability density function (PDF) of any quantity of interest is a weighted average of PDFs centered around the individual (possibly bias-corrected) forecasts, where the weights are equal to posterior probabilities of the models generating the forecasts, and reflect the models’ skill over the training period. The BMA PDF can be represented as an unweighted ensemble of any desired size, by simulating from the BMA predictive distribution. The BMA weights can be used to assess the usefulness of ensemble members, and this can be used as a basis for selecting ensemble members; this can be useful given the cost of running large ensembles.

Abstract:Projections of future climate change caused by increasing greenhouse gases depend critically on numerical climate models coupling the ocean and atmosphere (GCMs). However, different models differ substantially in their projections, which raises the question of how the different models can best be combined into a probability distribution of future climate change. For this analysis, we have collected both current and future projected mean temperatures produced by nine climate models for 22 regions of the earth. We also have estimates of current mean temperatures from actual observations, together with standard errors, that can be used to calibrate the climate models. We propose a Bayesian analysis that allows us to combine the different climate models into a posterior distribution of future temperature increase, for each of the 22 regions, while allowing for the different climate models to have different variances. Two versions of the analysis are proposed, a univariate analysis in which each region is analyzed separately, and a multivariate analysis in which the 22 regions are combined into an overall statistical model. A cross-validation approach is proposed to confirm the reasonableness of our Bayesian predictive distributions. The results of this analysis allow for a quantification of the uncertainty of climate model projections as a Bayesian posterior distribution, substantially extending previous approaches to uncertainty in climate models.

"A difficulty with this kind of Bayesian analysis is how to validate the statistical assumptions. Of course, direct validation based on future climate is impossible. However the following alternative viewpoint is feasible: if we think of the given climate models as a random sample from the universe of possible climate models, we can ask ourselves how well the statistical approach would do in predicting the response of a new climate model. This leads to a cross-validation approach. In effect, this makes an assumption of exchangability among the available climate models."

"There are of course some limitations to what these procedures can achieve. Although the different climate modeling groups are independent in the sense that they consist of disjoint groups of people, each developing their own computer code, all the GCMs are based on similar physical assumptions and if there were systematic errors affecting future projections in all the GCMs, our procedures could not detect that. On the other hand, another argument sometimes raised by so-called climate skeptics is that disagreements among existing GCMs are sufficient reason to doubt the correctness of any of their conclusions. The methods presented in this paper provide some counter to that argument, because we have shown that by making reasonable statistical assumptions, we can calculate a posterior density that captures the variability among all the models, but that still results in posterior-predictive intervals that are narrow enough to draw meaningful conclusions about probabilities of future climate change."

In other words, the predictive distributions are informative (it's not just a uniform distribution), but we still can't protect ourselves from systematic bias (which the results cited in the post above seem to indicate). This unquantified risk to decision making is the fundamental problem that lack of model validation admits.

By assuming the absence of significant systematic error are we not, in effect, assuming validation? That is, it seems that by ignoring the need to consider systematic error we then get to ignore the need for validation.

Rather we get this concept of "cross-validation." A process that seems to involve taking a collection of unvalidated models and using their output to validate some other unvalidated model. (Repeat until all unvalidated models are validated.)

But really, how large and dynamic is the systematic error? Believing it large or small seems here a matter of faith -- a prior.

It is almost like using an ensemble of religions to "cross-validate" a belief in the existence of God. And using it to counter the "argument sometimes raised by so-called [religious] skeptics [...] that disagreements among existing [religions] are sufficient reason to doubt the correctness of any of their conclusions."

Also, since Bayesian priors can be anything, this affects the outcome of conditioning. If my prior is a near certain belief that the climate modelers basically behave as a "herd" (say, because they all used the same historical climate data for tuning their differing parameterizations, etc.) then a basic consistency among the various models will only strengthen this prior. Not weaken it.

By assuming the absence of significant systematic error are we not, in effect, assuming validation?

That's my concern.

... how large and dynamic? Believing it large or small seems here a matter of faith...

Not quite, this paper that I linked in the post seems to indicate that the systematic bias is significant. Unfortunately since we can't (or are too impatient to) do validation testing for climate models like we normally would for numerical weather prediction, or CFD, or [pick your simulation], we can't estimate the sign or magnitude of the bias (and then of course control for it in our new and improved model).

...since Bayesian priors can be anything...

I think that's why they chose uninformative priors, they want to avoid criticism that they are 'cooking the books'.

...the climate modelers basically behave as a "herd"...all used the same historical climate data...

...a basic consistency among the various models will only strengthen this prior. Not weaken it.

What am I missing here?

I don't think you are missing anything; the cross-validation approach still doesn't protect us from fooling ourselves the way real empirical validation would (it's an unfortunate similarity of terminology too because the two 'validations' are not the same thing at all).

Also, that 'herd' behaviour and 'spread-skill' relationship (less spread means better predictions, more spread means worse predictions) is exhibited by the weather prediction ensembles, but the model that tends to perform well on the training set changes as the training set moves forward in time (and the optimal length of the training set changes based on the thing you are trying to forecast and how far you are trying to forecast), the reason the BMA approach works well there is because we have a chance to close the loop with new observations every day (and gradually change the weights we give to each model).

I think it's still applicable to climate model forecasting, but I don't think we have the political will to do validation because we have to wait much longer to close the loop. Unfortunately, calling for a decade or two of climate forecast validation, and tying policy decisions to gradual changes over decades isn't exactly compatible with urgent calls to decisive action (even if it is compatible with rational decision making, I mean we're talking about a process with time-scales on the order of decades, centuries and millennia right?).

But really, how large and dynamic is the systematic error? Believing it large or small seems here a matter of faith -- a prior.

Here's a paper that treats the bias problem that way (from the abstract):"[...] In addition, unlike previous studies, our methodology explicitly considers model biases that are allowed to be time-dependent (i.e. change between control and scenario period). More specifically, the model considers additive and multiplicative model biases for each RCM and introduces two plausible assumptions (‘‘constant bias’’ and ‘‘constant relationship’’) about extrapolating the biases from the control to the scenario period. The resulting identifiability problem is resolved by using informative priors for the bias changes. A sensitivity analysis illustrates the role of the informative prior. [...] Our results show the necessity to consider potential bias changes when projecting climate under an emission scenario. Further work is needed to determine how bias information can be exploited for this task."

Comment on article by Sanso et al.:"But GCM natural variability is a property of the GCM: it does not proxy the difference between the GCM and the climate system. In climate science this has been appreciated and discussed, but only recently has there been a genuine effort to determine a variance for the model structural error that is not based on internal variability (Murphy et al. 2007)."

"Sanso et al. present us with diagnostics based on holding-out 43 of the 426 evaluations from Y , and then predicting the model response on the hold-out and comparing it with the actual values. [...] Sanso et al. present us with diagnostics based on holding-out 43 of the 426 evalua- o tions from Y , and then predicting the model response on the hold-out and comparing it with the actual values. However, I suspect that there is plenty of information about MIT2DCM from the 383 evaluations that remain in Y. The experimental design for Y was a multi-level grid, so the evaluations that remain will almost certainly still do a good job of spanning the three-dimensional model-parameter space. Therefore I am not surprised that the diagnostics show that the hold-out sample is predicted well, but I am not sure that this tells us much about the statistical model for θ∗ , W, Y, z , or the reliability of Sanso et al.’s conclusions about the updated distribution for θ∗ : the verdict on the statistical model from the evidence in the paper is ‘unproven’."

"I particularly commend the use of a statistical model to link model evaluations, model parameters, and system observations. This, and the inclusion of an explicit term for model structural error, are major steps forward for Climate Science."

A Bayesian Framework for Multimodel RegressionAbstract:This paper presents a framework based on Bayesian regression and constrained least squares methods for incorporating prior beliefs in a linear regression problem. Prior beliefs are essential in regression theory when the number of predictors is not a small fraction of the sample size, a situation that leads to overfit- ting—that is, to fitting variability due to sampling errors. Under suitable assumptions, both the Bayesian estimate and the constrained least squares solution reduce to standard ridge regression. New generalizations of ridge regression based on priors relevant to multimodel combinations also are presented. In all cases, the strength of the prior is measured by a parameter called the ridge parameter. A “two-deep” cross-validation procedure is used to select the optimal ridge parameter and estimate the prediction error.

The proposed regression estimates are tested on the Development of a European Multimodel Ensemble System for Seasonal to Interannual Prediction (DEMETER) hindcasts of seasonal mean 2-m temperature over land. Surprisingly, none of the regression models proposed here can consistently beat the skill of a simple multimodel mean, despite the fact that one of the regression models recovers the multimodel mean in a suitable limit. This discrepancy arises from the fact that methods employed to select the ridge parameter are themselves sensitive to sampling errors. It is plausible that incorporating the prior belief that regression parameters are “large scale” can reduce overfitting and result in improved performance relative to the multimodel mean. Despite this, results from the multimodel mean demonstrate that seasonal mean 2-m temperature is predictable for at least three months in several regions.

It is also a good illustration of Jaynes' claim that properly applied probability theory (Bayesian) does away with the multitude of ad-hoceries in the standard statistical toolbox:The purpose of this paper is to clarify the fact that a wide variety of methods for reducing overfitting in linear regression problems, including many of those mentioned above, can be interpreted in a single Bayesian framework. Bayesian theory allows one to incorporate “prior knowledge” in the estimation process.

James Annan has a short comment [pdf] in press, here's an interesting paragraph:Min and Hense [2006] suggest another alternative to the reporting of probabilities, explicitly treating the issue as a decision problem in which the expected loss is to be minimised and thus emphasising the close link between Bayesian probability and decision theory. The companion paper Min and Hense [2007] considers the issue of D&A on a regional and seasonal basis. Uncertainties are relatively higher at smaller scales, and moreover it is on a local basis that climate change will actually impact the environment. Therefore, this area of research is likely to remain important long after the main questions of climate change on the global scale are considered settled.

I like this part of their conclusion:In contrast, we think that the Bayesian approach is not only flexible but facilitates an open debate on the assumptions that generate probabilistic forecasts.Making the assumptions explicit is a big step towards productive discussion and consensus building.