Tuesday, November 15, 2011

How not to compare models to data again

It's been bugging me for some time, the way many people talk about the CMIP3/IPCC ensemble not spanning the "full/true" (both terms appear in the literature) range of uncertainty. For example, the IPCC AR4 says "the current AOGCMs may not cover the full range of uncertainty for climate sensitivity". The first error here is in the apparent belief that there even is such as thing as a "full" or "true" range of uncertainty that the models ought to represent. Within the Bayesian paradigm, uncertainty is an indication of the belief of the researcher(s) involved, and is not intrinsic to the system being considered. So, at best, it might be legitimate to say that the models do not represent my/our/the IPCC authors' uncertainty adequately. This could perhaps be dismissed as a pedantic quibble, were it not for the way that this category error concerning the nature of the uncertainty underpins the justification of the claim. I recently got around to writing this argument up, and it was recently accepted for publication. So here is the outline of the case.

The IPCC statement is actually a summary of a lot of probabilistic estimates, as presented in their Box 10.2 Fig 1:

The top left panel has a lot of pdfs for the equilibrium climate sensitivity, generated by various authors, with their 90% probability ranges presented as bars on the right. In the bottom right panel, we have the range of sensitivity values from the CMIP3 ensemble (pale blue dots). Clearly, the spread of the latter is narrower than most/all of the former, which is the basis for the IPCC statement. There are numerous examples of similar statements in the literature, too (not exclusively restricted to climate sensitivity).

Many of the pdfs are based on some sort of Bayesian inversion of the warming trend over the 20th century (often, both surface air temp and ocean heat uptake data are used). This calculation requires a prior pdf for the sensitivity and perhaps other parameters. And herein lies the root of the problem.

Consider the following trivial example: We have an ensemble of models, each of which provides an output "X" that we are interested in. Let's assume that this set of values is well approximated by the standard Gaussian N(0,1). Now, let's also assume we have a single observation which takes the value 1.5, and which has an associated observational uncertainty of 2. The IPCC-approved method for evaluating the ensemble is to perform a Bayesian inversion on the observation, which in this trivial case will (assuming a uniform prior) result in the "observationally-constrained pdf" for X of N(1.5,2).

It seems that the model ensemble is a bit biased, and substantially too narrow, and therefore does not cover the "full range of uncertainty" according to the observation, right?

No, actually, this inference is dead wrong. Perhaps the most striking and immediate way to convince yourself of this is to note that if this method was valid, then it would not matter what value was observed - so long as it had an observational uncertainty of 2, we would automatically conclude that the ensemble was too narrow and (with some non-negligible probability) did not include the truth. Therefore, we could write down this conclusion without even bothering to make this inaccurate observation at all, just by threatening to do so. And what's worse, the more inaccurate the (hypothetical) observation is, the worse our ensemble will appear to be! I hope it is obvious to all that this state of affairs is nonsensical. An observation cannot cause us to reject the models more strongly as it gets more inaccurate - rather, the limiting case of a worthless observation tells us absolutely nothing at all.

That's all very well as a theoretical point, but it needs a practical application. So we also performed a similar sort of calculation for a more realistic scenario, more directly comparable to the IPCC situation. Using a simple energy balance model (actually the two-box model discussed by Isaac Held here, which dates at least to Gregory if not before), we used surface air temperature rise and ocean heat uptake as constraints on sensitivity and the ocean heat uptake efficiency parameter. The following fig shows the results of this, along with an ensemble of models (blue dots) which are intended to roughly represent the CMIP3 ensemble (in that they have a similar range of equilibrium sensitivity, ocean heat uptake efficiency, and transient climate sensitivity).

The qualitative similarity of this figure to several outputs of the Forest, Sokolov et al group is not entirely coincidental, and it should be clear that if we integrate out the ocean heat uptake efficiency, the marginal distributions for sensitivity (of the Bayesian estimate, and "CMIP3" ensemble) will be qualitatively similar to those in the IPCC figure, with the Bayesian pdf of course having a greater spread than the "CMIP3" proxy ensemble. Just as in the trivial Gaussian case above, we can check that this will remain true irrespective of the actual value of the observations made. Thus, we have another case where it may seem intuitively reasonable to state that the ensemble "may not represent the full range of uncertainty", but in fact it is clear that this conclusion could, if valid, be stated without the need to trouble ourselves by actually making any observations. Therefore, it can hardly be claimed that this result was due to the observations actually indicating any problem with the ensemble.

So let's have another look at what is going on.

The belief that the posterior pdf correctly represents the researchers' views, depends on the prior also correctly representing their prior views. But in this case, the low confidence in the models is imposed at the outset, and is not something generated by the observations. In the trivial Gaussian case, the models represent the prior belief that X should (with 90% probability) lie in [-1.64,1.64], but a uniform prior on [-10,10] only assigns 16% probability to this range. The posterior probability of this range, once we update with the observation 1.5±2, has actually tripled to 47%. Similarly, in the energy balance example, the prior we used only assigns 28% probability to the 90% spread of the models, and this probability doubles to 56% in the posterior. So the correct interpretation of the results is not that the observations have shown up any limitation in the model ensemble, but rather, that if one starts out with a strong prior presumption that the models are unlikely to be right, then although the observations actually substantially increase our faith in the models, they are not sufficient to persuade us to be highly confident in them.

Fortunately, there is an alternative way of looking at things, which is to see how well the ensemble (or more generally, probabilistic prediction) actually predicted the observation. This is not new, of course - quite the reverse, it is surely how most people have always evaluated predictions. There is a minor detail which is important to be aware of, which is that if the observation is inaccurate, then we must generate a prediction of the observation, rather than the truth, in order for the evaluation to be fair. (Without this detail, a mismatch between prediction and observation may be due to observational error, and it would be incorrect to interpret this as a predictive failure). One important benefit of this "forward" procedure is that it takes place entirely in observation-space, so we don't need to presume any direct correspondence between the internal parameters of the model, and the real world. It also eliminates the need to perform any difficult inversions of observational procedures.

For the trivial numerical example, the predictive distribution for the observation is given by N(0,2.2) (with 2.2 being sqrt(1^2+2^2), since the predictive and observational uncertainties are independent and add in quadrature). That is the solid blue curve in the following figure:

The observed value of 1.5 obviously lies well inside the predictive interval. Therefore, it is hard to see how this observation can logically be interpreted as reducing our confidence in the models. We can also perform a Bayesian calculation, starting with a prior that is based on the ensemble, and updating with the observation. In this case, the posterior (magenta dotted curve above) is N(0.3,0.9) and this assigns a slightly increased probability of 92% to the prior 90% probability range of [-1.64,1.64]. Thus, the analysis shows that if we started out believing the models, the observation would slightly enhance our confidence in them.

For the more realistic climate example, the comparison is performed between the actual air temperature trends of the models, and their ocean heat gains. The red dot in the below is the pair of observed values:

This shows good agreement for the energy balance models (blue dots - the solid contours are the predictive distribution accounting for observational uncertainty), and also for the real CMIP3 models (purple crosses), so again the only conclusion we can reasonably draw from these comparisons is that these observations fail to show any weakness in the models.

The take-home point is that observations can only conflict with a probabilistic prediction (such as that arising from the simple "democratic" interpretation of the IPCC ensemble) through being both outside (in the extreme tail of) the model range, and also precise, such that they constrain the truth to lie outside the predictive range. While this may seem like a rather trivial point, I think it's an important one to present, in view of how the erroneous but intuitive interpretation of these Bayesian inversions has come to dominate the consensus viewpoint. It was a pleasant surprise (especially after this saga) that it sped through the review process with rather encouraging comments.

When I've heard "doesn't represent the full range..." I've always interpreted it as something along the lines of "the ensemble of opportunity is not the ensemble you'd get if you explicitly asked researchers to characterize their (subjective) uncertainty." Is there an egregious example of someone interpreting it the other way ( ensemble != 'true' uncertainty )? (Maybe I should just wait for you to post the paper?)

Re "Is there an egregious example of someone interpreting it the other way."

I suggest reading the following post. While it doesn't give much in the way of specific details of an egregious example, it does indicate James thinks there is widespread wrong interpretation of results and the ensemble is performing pretty well while researchers think it is doing badly.

Part of the problem is that people have started to adopt the Bayesian estimates as their personal beliefs, and dressing it up in this language enables them to present it to the anti-Bayesians (of which there is still a substantial body) without the latter actually understanding what went into the sausage factory...

Here is an random example of someone (cited in that IPCC figure) saying as the result of their simple analysis "there is a 70 percent chance that [sensitivity] exceeds the maximum IPCC value". In fact he is one of the anti-Bayesians who thinks he did that calculation without using any prior at all.

It is quite clear from discussions I have had that some people really believe that these sort of analyses point to there being a problem with the models being too similar or not having a broad enough spread (at least probabilistically). But as we have shown, this is simply not a valid deduction.

When you say "uncertainty ... is not intrinsic to the system being considered" -- can you say that a few different ways to maybe make it clearer to readers like me who just barely stumbled through calculus and stat 101, many decades ago?

The notion that there is a true range of uncertainty seems intuitive (" ... when I finally got in touch with my feelings I found out I can't trust them ...")

I could imagine you mean any of:

The uncertainty we are interested in is not the uncertainty in the system being considered?

The system being considered has no intrinsic uncertainty?

The intrinsic uncertainty of the system being considered is hidden under the carpet/in the box with Schrodinger's cat/discarded because the real end has only one result so the other possibilites don't matter?

It seems like "models too similar" is a bit of a red herring. Should some teams deliberately use incorrect numerical methods or parameters they don't believe in order to increase the spread? I'm sure there are some verification issues around reliance on shared code and so forth, but beyond some point the burden ought to be on the critic to identify an actual problem or unrepresented alternative approach.

"Not having broad enough spread" only makes sense if the ensemble of best guesses understates some variability that you'd get if you asked each team to contribute a few additional runs that were representative of their subjective probability distributions. That seems like a testable hypothesis, but expensive carry out.

My guess is that the idea that the spread is too narrow resonates with a lot of people due to the considerable evidence about overconfidence of expert forecasts. Most of that originates in social science and model-free settings, so it isn't obviously relevant to climate, but it certainly shapes a lot of people's gut reaction to modeling.

Hank - basically, the answer is that the uncertainty we are interested in is not actually a property of the system at all.

Consider the following:

"Someone" creates a new climate model and compiles it into an executable. You can't examine the code directly, but just see some of the outputs as it simulates the evolution of historical climate (change). Your observations of the outputs get increasingly limited and inaccurate as you look further back in time.

The question you want to know the answer to is, what is the equilibrium sensitivity of the model? This is a well-defined value - run the model at 280 and 560ppm CO2 for a long time, and the temp difference will converge to a particular value, with essentially as much precision as you want. But you aren't allowed to actually do this experiment.

Your uncertainty in the model's climate sensitivity is a property of you, not of the model. The model has a fixed value, you just don't know it. There is no "true" distribution of sensitivity (other than the delta-function at the correct value, for pedants). If you do a Bayesian estimation based on what the model does over the 20th century or LGM etc, you have to start with a prior belief about the sensitivity, which may depend on your judgements about the "someone" who built the model. Depending on your prior, and the particular properties of the model, you will probably end up with a substantial uncertainty in its sensitivity - just as most people find when they do this with the real world.

Tom, this is exactly the argument I have tried to make in writing, in this manuscript. However, I haven't worked out how to publish it (was rejected by GRL, no real surprise there).

I would certainly not claim to have proved that the ensemble is definitely broad enough - fundamentally, I think this is basically unprovable (and indeed false at some level of detail - eg the raw model predictions for hurricanes are generally wrong, cos they don't resolve them). But most of the claims to have shown that the model spread is grossly inadequate in large-scale average variables such as temperature trend are not well-founded IMO.

Regarding the example of the true sensitivity distribution of the black-box climate model: If the model were stochastic and path-dependent, it could be that different realizations would converge to different equilibrium temperatures for the same forcings. Therefore one could argue that its sensitivity does have some true distribution.

If this is also true of the real system, it seems that it would amplify the point of your original claim against truth-centrism ("First, since we don't know the truth (in the widest sense) we have no possible way of generating models that scatter evenly about it."). Even if we did know the truth, i.e. we had an exact replica of the black box model, up to stochastic inputs, and could generate an ensemble that reflected the true distribution of outcomes, there's no reason to think that 'reality' (a single realization of the original model) would lie at the center of that.

Yes, I deliberately glossed over the state/path-dependence issue and "real" stochasticity, this can indeed lead to a small uncertainty in the equilibrium achieved in a given experiment, but this uncertainty is very small indeed in model world, and I am sure that most scientists think it to be small in the real world too - so long as we are only talking about moderate differences in climate state, compatible with the present day climate.

For fun, and perhaps puts some light on the conclusion that the uncertainty is not a function of the system:

"Hank - basically, the answer is that the uncertainty we are interested in is not actually a property of the earth at all.

Consider the following:

"Someone" creates a new planet. You can't examine the planet directly, but just see some of the data as it evoles its climate history (change). Your observations of the data get increasingly limited and inaccurate as you look further back in time.

The question you want to know the answer to is, what is the equilibrium sensitivity of the planet? This is a well-defined value - run the planet at 280 and 560ppm CO2 for a long time, and the temp difference will converge to a particular value, with essentially as much precision as you want. But you aren't allowed to actually do this experiment. (Alas, we're doing it.)

Your (sic - kind of prejudices the case, try "The") uncertainty in the planet's climate sensitivity is a property of you, not of the planet. The planet has a fixed value, you just don't know it. There is no "true" distribution of sensitivity (other than the delta-function at the correct value, for pedants). If you do a Bayesian estimation based on what the planet does over the 20th century or LGM etc, you have to start with a prior belief about the sensitivity, which may depend on your judgements about the "someone" who built the planet. Depending on your prior, and the particular properties of the planet, you will probably end up with a substantial uncertainty in its sensitivity - just as most people find when they do this with the real planet."

Hank said"When you say "uncertainty ... is not intrinsic to the system being considered" "

Why the lengthy posts?

Consider there is a weighted dice, person A has seen 10 rolls averaging 4.5, person B has seen 8 rolls averaging 5.

The uncertainty of the system e.g an average of 10,000 rolls is very low and is not the uncertainty we are dealing with, which is that of person A, person B or perhaps of some omnipotent observer that has seen all the rolls of the dice. This uncertainty is clearly intrinsic to the person not the system.

In NWP land I use the term 'true' uncertainty to mean an uncertainty that yields well calibrated probability forecasts, i.e. neither over nor underconfident. It's not a property of me personally, but an output of a forecast system.

Of course you can't do this for climate sensitivity modelling, but I'd argue there is still a correct uncertainty that you obtain by optimally combining all the available evidence.

ac, that seems understandable to some extent, but what would you say if someone else (perhaps even yourself in a few years time, performing a reanalysis with a better model) came up with a different system that also gave perfectly calibrated predictions but which differed from your "true" values?

As for "correct", well I'd certainly agree there is such a thing as a "correct" calculation according to Bayes but it depends on what you put in - including the prior, and there may be genuine disagreement there even if nowhere else. In practice, there is plenty of room for judgement in interpreting the available evidence too.

Well I wanted to make the link to the climate system as clear as possible. But your example is also useful in allowing us to consider the interplay between the two types of uncertainty. If asked for a prediction of the next throw, A and B may well both give probabilistic answers which are somewhat similar not only to each other but also may (with a bit of luck) be reasonably close to the "true" probability distribution of the die - there really is a "true" uncertainty here (the long-term frequency distribution of the die), which is relevant to the prediction. However, if asked for the mean of 10,000 throws, A and B will still give rather broad answers, but this time they will have a far greater spread than the intrinsic uncertainty of the experimental results - which isn't quite a delta function, but which would be pretty tight (eg if the 10,000 throws experiment was repeatedly performed). Climate sensitivity, being essentially determined by the long-term average over the attractor of a chaotic system which we don't understand very well, is essentially an example of the second case.

Not too surprised at the PNAS thing either - however their radiation map doesn't seem to mesh with the exposure we've been experiencing in Yamagata-ken, then again - they do say their program doesn't model regional Japanese well.

I find this idea of declaring ensembles of models with giant uncertainties to be "non-excludable" to be excruciatingly uninteresting. Let's invert the problem. How good does a model have to be on an annual basis to give us a 1C confidence interval in its prediction 100 yrs from now. That's actually interesting and testable.

I don't see anything in that paper to support the claim. It was already well known that it is possible to envisage models that match recent observations but which have a broader range of projections than the CMIP models. This Rowlands et al paper appears to merely be a variant on the same argument. It is precisely the point of my paper to illustrate why this fact cannot discredit the CMIP models.

Additionally, I don't understand the calculations underpinning this new paper. It looks from their Fig 1 that the high projections are coming from models that massively overpredicted the recent warming too. It might have been interesting to address the question of how much future warming is generated by the models which do not also overpredict the past warming.

>" It might have been interesting to address the question of how much future warming is generated by the models which do not also overpredict the past warming."

Yes that might be interesting to see. However, ISTM they are asserting that their goodness of fit measure is better than just using 'overpredict past warming' (which is probably just a global average temperature?) because their goodness of fit also includes spatial patterns. Looking at figure 2a, the good models below the dotted line has a range that is not much narrower than the range for the bad models above the dotted line. If quality is improved a little by lowering the dotted line then the IPCC 4AR upper limit on temp change looks better with just one very good model at nearly 3C temp change.

The 66 percentile used seems an arbitary choice to make a headline drawing conclusion that the IPCC 4AR temp change range is too low at the high end. That it appears too low at the low end seems more robust.

Looking at the paper more carefully, I think it's worth a longer post. It seems that their "likely" range is simply the full range of outcomes from the sample that passes their statistical test (see the vertical lines exact match the extreme members in the "good" set) and their range is therefore heavily dependent on sample size, and under some very simple and natural assumptions would grow without bound as the ensemble grows (ie, does not asymptote to any finite value).

It is somewhat analogous to the first cpdn paper when they basically said "we found a high sensitivity model that is compatible with the current climate".

As such, I don't think there is any valid interpretation of their range, in terms of IPCC "likely" etc.

Yes some verification is needed but AFAICS they have done this with their 'goodness of fit' measure. I don't really know how good the measure they have used is as it isn't easy to come up with a good single measure.

I would accept that models that have been tuned over long periods of time have probably had more examination on more aspects and thus CMIP models might have a greater tendancy to be good than the measure CPDN has used shows.

Re "model in a given run"They have used initial condition ensembles where available. It would be interesting to look in more detail at that single model that has high transient temperature rise (nearly 3K for 2041-2060) as is as good as any other model per the goodness of fit measure. Was it a single run or an initial condition ensemble of how many runs? Does the output stand up to more scrutiny than just applying their goodness of fit measure?

>"under some very simple and natural assumptions would grow without bound"

You might think such assumptions are simple and natural. I have difficulty believing there would be a model that shows over 20k temperature change by 2041-2060 that has a good hindcast goodness of fit unless there was some problem with the experiment design or an unstable computer calculating the forecast.

The question seems to be why isn't the 66% likely range calculated for each arbitary goodness of fit threshold?

If natural variability is considered as a sample from a gaussian (and that's one of the most tame possibilities with very thin tails) then given enough samples, you can get a model to do literally anything.

Given enough samples, quantum mechanics says earth can jump into middle of the sun but the average time you would have to wait for it to happen is an unbelievably high order of magnitudes longer than length of time universe has existed. So I don't think it is worth considering these extremely remote possibilities.

Back in land of sense, once a 15K global average temperature anomaly arises through natural variability then there is such a very strong feedback making reaching a 20K anomaly sufficiently unlikely that I think we should ignore it in same way as possibility of earth jumping into centre of sun.

Chris, I think you are misunderstanding their method, which is fair enough as it seems pretty obscure and bizarre to me. From a Bayesian perspective, sure these things are rare enough that they have an extremely low probability, and will not lie in a sensible 66% range. But these guys are presenting the full range from their ensemble, after checking consistency with the historical record. If they happened to get a single model that warmed by 15C, they would use this as the uper bound on their 66% interval.

If they had applied this methodology to the original cpdn sensitivity paper, they would have had a 66% range going right up to 11C sensitivity, because they found a model up there that had a plausible climatology. Since that time, I believe they have generated models with even higher values, like 20C.

(That's assuming their climatology test was set at something they would describe as 66%, but the basic point stands whatever probability level it was labelled.)

This method does not make much sense to me, and it certainly does not appear to be compatible with the Bayesian perspective in which probability is interpreted as a degree of belief. OF course, the authors (at least some of them) have rejected the Bayesian approach in previous papers, but that rather leaves the question of what 66% actually does mean to them.

I think I am not understanding. In 2a I think see 1 triangle below and right of dashed lines. But perhaps you are right and triangles are plotted so top apex is where the point is and that is above dashed line.

But either way on that, the horizontal dashed line is a quality control where 33% of models are rejected. AFAICS that doesn't make the remainder a 66% likely range, it just makes them the models that pass quality control test. A 66% likely range should AFAICS then be calculated using the good models. They clearly haven't done this - looks more like 99.5% range to me.

If they calculated a 66% range from good models, this would be expected to be smaller than IPCC likely range as they have only used 1 model type (HadCM3L) and using other model designs would give a wider range. I haven't seen any discussion equating a 99.5% range of good models of 1 model type to a 66% likely range of all models types. Seems a bit of a pluck a number out of the air to me. But I am probably just not understanding.