Ross on Panel Regressions

Ross comments:

One of the benefits of panel regressions is that it forces you to spell your null hypothesis out clearly. In this case the null is: the models and the observations have the same trend over 1979-2009. People seem to be gasping at the audacity of assuming such a thing, but you have to in order to test model-obs equivalence.

Under that assumption, using the Prais-Winsten panel method (which is very common and is coded into most major stats packages) the variances and covariances turn out to be as shown in our results, and the parameters for testing trend equivalence are as shown, and the associated t and F statistics turn out to be large relative to a distribution under the null. That is the basis of the panel inferences and conclusions in MMH.

It appears to me that what our critics want to do is build into the null hypothesis some notion of model heterogeneity, which presupposes a lack of equivalence among models and, by implication, observations. But if the estimation is done based on that assumption, then the resulting estimates cannot be used to test the equivalence hypothesis. In other words, you can’t argue that models agree with the observed data, using a test estimated on the assumption that they do not. As best I understand it, that is what our critics are trying to do. If you propose a test based on a null hypothesis that models do not agree among themselves, and it yields low t and F scores, this does not mean the hypothesis of consistency between models and observations is not rejected. It is a contradictory test: if the null is not rejected, it cannot imply that the models agree with the observations, since model heterogeneity was part of the null when estimating the coefficients used to construct the test.

In order to test whether modeled and observed trends agree, test statistics have to be constructed based on an estimation under the null of trend equivalence. Simple as that. Panel regressions and multivariate trend estimation methods are the current best methods for doing the job.

Now if the modelers want to argue that “of course” the models do not agree with the observations because they don’t even agree with each other, and it would be pointless even to test whether they match observations because everyone knows they don’t; or words to that effect, then let’s get that on the table ASAP because there are a lot of folks who are under the impression that GCM’s are accurate representations of the Earth’s climate.

133 Comments

Gosh, Ross, when you say it like that you don’t give them much wiggle room. Perhaps they’ll answer by explaining that regular statistics is rather boring. The novel kind is much more fun and leads to “better” conclusions. ;-)

The result of the null hypothesis testing also depends on magical quantities called the uncertainties. In other word, if the model says 0.2+/-0.01 and the observations say 0.1+/-0.01, then the null hypothesis can be safely rejected. If they say 0.2+/-0.1 and 0.1+/-0.1, respectively, then it’s not so sure you can reject the null hypothesis — although the mean trends do not agree. This is a basic fact, and no matter where your statistical method comes from — economics, social sciences or phrenology — you have to state clearly which uncertainties you are working with.

And in this respect, the last posts of this site made me extremely suspicious. Steve and Ross never seem out of tricky ideas when it comes to shrinking the observations/models’ error bars until they don’t overlap any more.

This is Steve did in his last post, until he had to retract himself. Fig. 2 of MM2010, reported in an earlier post, also shows error bars suspiciously shrinked.

Steve: if you don’t like these methods, key results fail under Santer’s own methods fail as well. Ask Gavin about Santer’s Table III and IV under updated data.

One of the major conceptual issues students run into when learning basic hypothesis testing is that you have to assume “no effect” or “no difference” to see if the data indicate a difference.

The reason for doing so is simple: If you assume “no difference”, you can calculate a probability of observing a value of the test statistic at least as far away from the value implied by “no difference” as you did observe.

If your maintained hypothesis is “there is a difference”, then you cannot calculate such a probability.

If the chances of observing something that is at least as far away from the value of the test statistic as you did observe are small enough under the “no difference” hypothesis, you say the data do not support the “no difference” hypothesis and reject “no difference”.

That does not make the alternative the “truth”. It just says that if there really were no difference between model trends and observed trend, the chances of seeing the trends we saw are small.

Oh, and this is true of all statistical testing regardless of how complicated the math might get.

The significance level is the probability (which you choose ahead of time) of incorrectly rejecting “no difference” with which you are comfortable. The p-value is the probability of incorrectly rejecting “no difference” given data. If the p-value is in your comfort zone, you reject “no difference”.

That is why statements like “95% significant” make absolutely no sense and lay bare the basic ignorance of team members when it comes to these things.

Sinan:
So why should not the null hypothesis be that the trend is the first order effect from the growth in CO2? The problematic aspects of GCMs, IMHO, is the multiplier or amplification that is built into the models. Actual observed temperature trends look like those from the first order effects.

@Bernie: I am not sure I understand what you mean. The question here is: We have a bunch of models that generate a bunch of values for a variable of interest. Do the trends from those models match the trends from observations?

Ok, I see that you are testing the hypothesis that all the models are identical with each other and with the truth. This is even more strict than testing whether the mean of the models is equal to the truth, but has the same property that once you get enough models you will certainly reject it. Nothing is exactly equal to anything but itself. Personally I think it is more interesting to look at the “ensemble model” (using my definition rather than Steve’s ie the collection of all models) to see if this might be a useful way of using models to estimate trends with uncertainty. You obviously have to accept that none of the models is perfect, since they must disagree with each other in order to be an ensemble, and this can be difficult for some modelers, but if the variation between models is small, and plausibly includes the truth, you have a good forecast. If the spread is large you don’t have a good forecast even if it does include the truth. If the spread doesn’t include the truth you don’t have a good forecast at all.

I would be interested in your comments on my assertion in the previous post that this isn’t a time series problem at all.

“ensemble models” have the disadvantage that it’s impossible to interpret the results in any fundamental physical sense. You might as well adapt wideband delphi techniques to the problem – also problematic for interpretation, but with some significant history of use.

You can interpret the results from an ensemble in the same way as you interpret the results from a single model – a temperature trend is just that, wherever it comes from. What you can’t interpret is the physics of the ensemble model itself, ie all the careful mathematics you use on an individual model doesn’t have a simple interpretation in the ensemble. However, this doesn’t matter if all you want to do is make a prediction.

Exactly, and without the ability to interpret the underlying physical basis, the results are even further removed from being used to project future behavior, which is, afterall, the sole purpose of the models to begin with. An ensemble model that predicts past behavior but has serious limitations for projecting future behavior isn’t particularly useful.

If the spread is large you don’t have a good forecast even if it does include the truth. If the spread doesn’t include the truth you don’t have a good forecast at all.

The models are tuned so that thy match the historic record. Given that there are only a few subsequent years for which the models are not tuned, it is rather difficult to conceive of a situation in which reality will have had enough time to deviate from the ensemble as you define it. So in this case the models cannot by wrong but following the ideas of C.S Pierce, the theories that they represent are useless

I think that Ross has figured out exactly what they mean – even though he is slightly afraid to admit to himself that that’s what the “critics” mean.

The critics mean that the models beautifully describe the observations – and the word “beautifully” means “up to errors comparable to the difference between the models themselves.”

So if the models predict between 2 and 8 degrees of warming per century, that’s 5 plus minus 3 degrees of warming, and indeed, up to tolerable 2-sigma errors i.e. up to a 6-degree accuracy, most (over 50% of the) models indeed do agree with the observed warming (which is near zero). ;-)

The agreement between the models and reality would be even more spectacular if you included the models that predict 15 degrees of warming per century. ;-)

This is the “consensus science” approach to the comparisons between models and empirical data. If there’s no real consensus between the models, then every model gets an A from the “agreement with the observation” discipline. ;-)

And because most models end up with having an A, we must trust them when they predict something that is clearly not happening according to the observations. That’s the democratic kind of science, folks! :-)

I’m a Quaternary scientist, and I’ve long believed that models are of limited use in climate science, geomorphology and elsewhere. The last part of Ross’ post says: ‘“of course” the models do not agree with the observations because they don’t even agree with each other, and it would be pointless even to test whether they match observations because everyone knows they don’t; or words to that effect, then let’s get that on the table ASAP because there are a lot of folks who are under the impression that GCM’s are accurate representations of the Earth’s climate’.

It seems to me that this is self-evidently true and realtively non-controversial. Which is why using estimates of past climate change in response to changes in the forcings is a better guide to the likely future response of the earth’s climate.

Put it another way: the models might not be able to predict the detail of future climate change, but we can be sure that warming will be the results of increasing GHG emissions. We can also be pretty sure of the magnitude of the warming.

One group are wanting to test this view:
The models are individually biased, but the mean is somehow equal (or at least close enough for government work.) So, in this case, the idea is one acceptable reason for the mean to differ from observations is that you only had ‘N’ models. If you’d had 10*N or 100*N models you would get a better multi-model mean. So, you have to account for that. (Of course, you must also account for uncertianty in observations.) For better or worse, that’s the Santer type test. It’s not necessarily a bad test– especially since the “within model variance” ought to be somehow picked up in the estimate of the variaince of observed trends. (Of course, he had to assume AR1 to estimate that, but it’s a stab at the issue.

The other group wants the test to not reject the models if a few of the models are individually correct. I can see that view– but by the same token, if the models are all wrong on average and the between model spread is very large, this test can result in the preplexing situation that you can never show the mean either high (or low) if 10% of the models are low (or high.)

Your test strikes me as the best aligned with phenomenology. The “method” tests for consistency of observations against a model that has the typical mean (deterministic sighal) and typical variance (due to ‘weather’,) and so uses the within model variance. The question is meaningful, the method seems correct.

It strikes me that lots of people who are criticizing the method don’t want to step back and discuss what questions are worth asking. They want to presuppose that only one question is worth asking (which is ridiculous from both a policy and scientific point of view.)

But I think for both policy and science, we both want to know that the multi-model mean is high (or low) and we want to know if any of the individual models might be ok. Both questions are important to making policy decision. (Or, at least, I want to know if the batch is high overall when planning. I want to know this even if we don’t have enough individual runs from individaul models to have power to test individual models. )

My concern is that a model could be “correct”, i.e. in line with the observed results, for the troposphere trend, but not so with the corresponding surface trend. Or, for that matter, a model could be wrong for the troposphere and surface, but get the difference correct. Also a model could be correct for a regional area like 20S to 20N and not some other region or globally. I continue to judge that the best statistic for comparing the observed and model results is to use difference series between the surface and troposphere. I think the differencing will cut down the variance and autocorrelations. Besides I thought the original question here involved the differences or ratios of temperature trends for surface to troposphere in the tropics. At least that is what I took away from Douglass et al.

I think the code and data for doing the panel regression with difference series is available and while Ross McKitrick has done a good job explaining the linear algebra involved, my doing it may be a little to blackboxish even for me.

@Kenneth: I agree with the differencing approach. (That’s why, in another thread, I asked if the data were available in a simple table somewhere).

Since then, I have downloaded the SI for the paper. It seems straightforward to extract what I need from what is contained in the archive.

I did a couple of scatter plots between ensemble mean from two model and satellite observations. In both cases, the cloud was a downward sloped ellipsoid. I find that curious but not very meaningful because I am not too certain of my steps.

IMHO, a very simple way to check if model generated values per month per year match observations is to run the regression:

(Eq.1) obsym = b0 + b1 modelym

and test the join restriction:

H0: bo = 0 and b1 = 1

Then, one can say model generated values match observed values, for a given model.

Alternatively, testing

H0: b1 = 1

allows for a constant shift factor between model generated values and observed values.

The residuals from Eq.1 plotted against observed values should give a good idea of the direction of bias in the model.

My concern is that a model could be “correct”, i.e. in line with the observed results, for the troposphere trend, but not so with the corresponding surface trend.

Sure. That’s why it can be useful to test more than one measure. MMH does– it looks at LT, MT. A ‘truly correct based on physics’ model should get all possible measures right, so you should test many types of observables. When doing the stats, you need to account for the number of tests you do when figuring out the criterion for rejection — but the more thing it “passes” the better.

Of course this doesn’t mean the hurdle for writing a paper should be running the test on all possible observables.

“It strikes me that lots of people who are criticizing the method don’t want to step back and discuss what questions are worth asking. They want to presuppose that only one question is worth asking (which is ridiculous from both a policy and scientific point of view.)”
Of course, the two questions that exclude each other as you state.
And, of course, there is the whole extra problem of ‘correlation’ and ‘significance’.
Also, baleful statistics: Out of 100 who live A there is a significant correlation with heart disease. Therefore A causes heart disease? No, dupe head!

“Now if the modelers want to argue that “of course” the models do not agree with the observations because they don’t even agree with each other, and it would be pointless even to test whether they match observations because everyone knows they don’t; or words to that effect, then let’s get that on the table ASAP because there are a lot of folks who are under the impression that GCM’s are accurate representations of the Earth’s climate.”

Taking “you cannot have your cake and eat it too” up a few notches.

I think we see more examples of this dichotomy, with more analyses of models versus observations, and reconstructions for that matter. If uncertainty is going to cover a multitude of sins then that uncertainty must be dealt with.

“The figure below shows the IPCC distribution of 55 forecasts N[0.19, 0.21] as the blue curve, and I have invented a new distribution (red curve) by adding a bunch of hypothetical nonsense forecasts such that the distribution is now N[0.19, 1.0].

The blue point represents a hypothetical observation.

According to the metric of evaluating forecasts and observations proposed by James Annan my forecasting ability improved immensely simply by adding 55 nonsense forecasts, since the blue observational point now falls closer to the center of the new (and improved distribution).”

Thank you that someone can apply themselves! I’m to old to restart ( got my A level at 13 and, since, been an autodidact ) so I, like a lot of others, have to follow my commendably logical nose – always brings me back to Steve!

We need a method that can be applied a prior for selecting valid model results and perhaps even valid observed series. Without such a procedure we have to assume that one model result is as valid as another. And by the way, I would guess that valid for a model does not necessarily imply that it matches the observed only that given the assumptions for forcing and feedbacks that it gives valid (correct) answers, i.e. that it gets the physics (and parameterization) right.

Does anyone out there have such an a prior selection method? Perhaps Mann will provide one such as he has for recontructions based on the posterior RE values. I am ready for some good old down home Texas sharpshooting.

Roger has proposed a Thought Experiment. These are often useful in trying to understand unfamiliar ideas. His “adding a bunch of lousy models” will indeed improve this statistic’s characterization of the validity of the modeling. This is at best counterintuitive, and likely points to a flaw in the statistic (or its meaning), as conceived. The IPCC’s belief that the models that are admitted to its ensemble are “good” or “good enough” does not weaken Roger’s experiment, it seems to me.

Lazar had a different Thought Experiment at DeepClimate, also useful, I think. Paraphrasing,

Let’s take the MMH10 approach to an extreme. Let’s suppose that the internal variability for runs of each model is zero. Thus the “within group” SD must be zero. Let’s further suppose that the different models agree pretty well with each other, and that observations fall within the tight band of model projections. Then, by the MMH10 method, you will determine the average for the “ensemble of models,” and give this average an SD. But the MMH10-calculated SD will be zero. So, in this Thought Experiment, observations would be within the range of the closely-spaced models that make up the “ensemble of models.” And yet, since the average of the “ensemble of models” has an SD of zero and an SEM of zero, the MMH10 test will declare that the modeling has “failed” — the observed trend falls very far from the ensemble average (an infinite number of SDs away, in this extreme example). Even though by a common-sense standard, this would be a very happy outcome for all the modelers: different models, all with minimum (zero) run-to-run variability, all clustered closely together, and with the actual data observed to be within the cluster.

While Lazar’s Thought Experiment is very different from Roger’s, they each seem to say something useful about how to go about thinking about the right way(s) to use statistical tools to approach the problem that interested Douglas, Santer, and now MMH.

Steve: Amac, the approach in MMH10 is different than the one outlined in my post. I thought that I could show something very simply and save people the algebra, but I made an error. I’ll re-do the post when I get back from Italy. Unfortunately, I also distracted from MMH which has an entirely different approach and I urge you to read it.

How delicious that honest sound of “Oops”! Thank you for sharing such a special noise!

Steve: I don’t claim to be perfect. If I make a mistake, I try to acknowledge and correct it. It’s a practice that that I would recommend to the Team. There’s a useful point that I’ll restate when I redo the post, probably when I’m back from Italy.

“Steve: I don’t claim to be perfect. If I make a mistake, I try to acknowledge and correct it.”

I know. And when Craig Loehle makes a mistake, he acknowledges it and tries to correct it. Same with Mosher. etc. etc.

Why don’t I hear that same honest oops from Gavin when AMac points out that there is no calibration with Tiljander? If Mann would utter the honest oops, there’d be no need to still be trying to get him to fess up on a paper whose mistakes have been known for over ten years, and no need for applause for a paper such as McShane and Wyner who point out that proxies are no better than red noise. Why don’t we have Jones saying “Gosh, that ‘delete all emails’ email I wrote really IS attrocious, isn’t it? What was I thinking? I’m going to go for a long walk and think about that.” Why does Tamino still maintain Mann’s decentering is ok, even after Joliffe says he wouldn’t know how to interpret such a thing? etc. etc.

I know I’m not the only one to notice this difference in attitude, but I want to let you know (as I did with Mosher) that to me, and I suspect to many others, it makes all the difference in the world.

AMac, Lucia had a post or comment on that exact point, and Lazar ignores the real point. Surely in such a case the models would fail but it wouldn’t be anything worth noting. The difference is small and in that case the test is of limited use. But in the actual case we’re considering, the models don’t even come close do they ? Of course you could wait for many, many, years before Nick, James A, and the others to accept that obvious point, but we wouldn’t be having the same argument if the situation was as Lazar’s example sets out, would we ?

I for one would say in Lazar’s case that the models aren’t an exact model (and that’s a truism), but they’re obviously getting it reasonably right, and the “mean model” doesn’t appear to be badly out of alignment. But in real life, that isn’t true.

Steve, ok… could you state whether MMH tests include model spread… structural uncertainty… model heterogeneity… whatever you want to call it… your post (schools) seemed to imply that it didn’t, that the only error variance is the “within group” internal variability… Ross McKitrick clearly implies that it didn’t… no need to do explain the algebra (hopefully).

P.S. regardless of whether it is a sufficiently accurate description of MMH, could you resurect the content of that post (the logical points, not necessarily the graph)? Otherwise interesting discussion in the comments is devoid of context and meaning.

Steve: I promise without fail to revive this because there’s a good point here. I’m really squeezed on time right now because I’m going to Erice on Tuesday, have a presentation to finish plus grandkids on holiday. I’ll be away for about 10 days as well and will be at a conference and on holiday. But I promise to re-visit and re-analyse quite promptly after I get back – OK?

If I showed you the hoary old algebraic proof of 1=2, you’d tell me that “divide by zero” does not invalidate algebra, it just shows that extreme cases require special handling. Lazar’s thought experiment seems to me to fall into that class.

Ensemble predictions allow for uncertainty in modelled physical representations of the real world… that is the reason for making ensemble predictions… to include the effects of different guesses and assumptions about real world physics into prediction uncertainties. The purpose of reducing internal variability to zero was to illustrate the implications on the test of not including those structural uncertainties in an ensemble prediction. Internal variability simply obscures the point… which was examining the logic of the test.

> The purpose of reducing internal variability to zero was to illustrate the implications on the test of not including those structural uncertainties in an ensemble prediction.

This seems to me to be exactly right.

Now, upthread, Steve McI cautioned me that MMH10 doesn’t take the internal-variability approach that Lazar’s thought experiment addresses. OK, we’ll have to pull the paper and/or await Steve’s follow-ons. That’s not an argument against the logic of Lazar’s Thought Experiment.

For generalists, Thought Experiments that take an assertion to an extreme (eg, take internal variability to zero) and then look at how the analysis handles that extreme case are useful tools. True for Roger’s instance (sorry, Chip), and true for Lazar’s.

I think the problem is in making inferences about models (individuals) from the group, when testing “the models” as an ensemble. It is better logically, empirically, to treat an ensemble model *as a single model*, and GCMs as inputs.

E.g.

“Models are consistent with observations”. Well, some models are more consistent, some are less, the statement doesn’t tell us which are which, and since the spread can be anything, doesn’t give information about real distances. Pretty meaningless.

Suppose that obs lie just outside the 2 s.d. limits. And suppose that one GCM is an (for the sake of argument) exact match to obs. “Models are inconsistent with observations”…? This is nuts. Some models are fairly close to observations, some are fairly far away, and one is an exact match.

OTOH, if an ensemble is treated as a single model, “consistent”/”inconsistent” fairly describes observations inside/outside the model prediction. The prediction might be wide, but if that accurately reflects our level of understanding of the physical system, then it is what it is and let the grumblers grumble.

P.S. It was a pleasure reading your exchanges with TCO (Tiljander), as a good discussion; it was tough, fair, honest, unbiased, insightful, and collegial. A breath of fresh air. Please keep it up.

The prediction might be wide, but if that accurately reflects our level of understanding of the physical system, then it is what it is and let the grumblers grumble.

I suspect this is the crux of the matter. The hard part is that the popular press, and decisionmakers, have a hard time listening/communicating to a message whose meat is uncertainty levels rather than data values.
I.e. innumerate people want to hear “2 degrees per century” rather than “a range from -1 to +3 degrees per century.” They want one simple number, the “most likely” number. After all, graphs are prettier without all that shading ;)

Let’s take the MMH10 approach to an extreme. Let’s suppose that the internal variability for runs of each model is zero. Thus the “within group” SD must be zero. Let’s further suppose that the different models agree pretty well with each other, and that observations fall within the tight band of model projections. Then, by the MMH10 method, you will determine the average for the “ensemble of models,” and give this average an SD. But the MMH10-calculated SD will be zero. So, in this Thought Experiment, observations would be within the range of the closely-spaced models that make up the “ensemble of models.” And yet, since the average of the “ensemble of models” has an SD of zero and an SEM of zero, the MMH10 test will declare that the modeling has “failed” — the observed trend falls very far from the ensemble average (an infinite number of SDs away, in this extreme example).

This is not true. Here’s a demo you can do on a spreadsheet, though I did it on Stata. Generate a trend t=1,…,100. Now generate 3 deterministic “model” runs using t0=0*t, t1=1*t and t2=2*t. So each one has zero SD. And generate some “observed temperature” data using tt=0.8*t+N(0,1). So the observed trend is 0.8, within the spread of models. Now do the panel estimation as in MMH, by constructing a dummy variable d=0 for models and d=1 for obs, stack the 4 series, construct dt = d*t and do the panel regression. OLS will do fine here since there is no autocorrelation. The estimated trend on t will be 1.00 and the SE will be about 0.08. The test of a model-obs difference will not reject. You could even leave out the N(0,1), in other words make the “observed temperature” data deterministic, and the t-test on the model/obs difference will be 1.21 (p=0.227). In this case there is no within-model variance, only between-model variance, but the panel regression still takes it into account. I don’t know where they got the idea that the variance of the trend would be zero if all the (detrended) model runs had zero variance. That would only happen if all the models were identical and exactly linear, but in that case there would, in fact, be zero variance on the trend.

Lazar, for the basic model/obs diff-significance test, the panel regression equation is equation 10:
y=b0+b1.t+b2.d.t+b3.d+e. The code file is VF09.do and even if you don’t use Stata it should be readable. The SE on the model trend is the SE on the trend slope (b2). It comes from the computation of panel-corrected standard errors where each panel is assigned its own variance, its own AR1 coefficient and the off-diagonal covariances are also calculated.

All the models enter the estimation individually. We do not average the models together when estimating the regression parameters. For models with multiple runs we average the runs together to create an ensemble mean for that model, but if we used the runs individually it would not change our conclusions (IIRC — this was something a reviewer demanded in an early round).

Yes. And the way you do the test is as follows. Stack the vectors in the order you have written them, then define the dummy variable d=(one,zero,zero,zero)’ where one is a 100-length vector of ones, and zero is a 100-length vector of zeros. Then estimate

You may want to re-read what Roger said: “my forecasting ability improved immensely simply by adding 55 nonsense forecasts, since the blue observational point now falls closer to the center of the new (and improved distribution)”

If the assumptions are invalid, then so too may be the conclusions based on the test. I don’t think that if Roger wanted to improve his forecasting ability, he would add known nonsense. So, what he proposed doesn’t make any sense. If, on the other hand, the 55 other forecasts he identified were perfectly justifiable, then adding them to the mix would be perfectly acceptable–whatever the result.

Chip: So what if the IPCC and Ross both take the same assumption that “all the models are equally-well expected to reflect a possible reality.”

I see your gripe as one of, “if you assume all data is valid, then inserting bad data could give a false conclusion because all the model data you currently have (and expect in the future to have) has been pre-screened and is known to be good data”.

If you are used to writing programs that are designed to interpret input, the way you test the flaws in the design is not only to input more of what you already have, but to input data that is quite different.

You test to see not only what inserting a model with a tend that exactly matches observations does to the outcome, you also insert model data that is way outside of anything you currently think would be added. If the result says more trash you put in means a closer fit to observations, you know something is wrong with the way you are trying to process the data.

Why does the input of nonsense matter if you assume all current data is good? If in 2 months from now a model comes out that has data very similar to what Ross calls “nonsense” and is accepted by the IPCC as valid, it would be a good idea to know if the program you are working on handles that input.

The reverse is also true, if what you assume today to be good model data and in 2 months they determine that actually a few of the models were actually calculated wrong to a point that did not make sense. Well then you already know how well nonsense data does because you tested for it.

Also, and this goes to another point made (but not by you). Some people want to say that all the models represent possible realities yet complain that a comparision is unfair because you are not calculating how each run in a model does. Either you are assuming they all are valid or your not.

If some models or individual model runs are more valid than others, then that needs to be stated by the IPCC. If all are equally valid, then a test to see how valid as a group they all are is completely reasonable. Ross says the group does not represent reality. So now it must be found which if any models or runs are closest to reality. At any rate we know our original assumtion that all models represent a possible reality is false.

Last thought… why are any of the models “assumed” to be reasonable when real observational readings are availible when the models are run to begin with? Shouldn’t a claim to be reality based depend on actually checking reality to see how close they came to as many paramaters as you have availible? Then pronouncing exactly what areas you matched observations and which ones were off and by how much they were off?

“Last thought… why are any of the models “assumed” to be reasonable when real observational readings are availible when the models are run to begin with? Shouldn’t a claim to be reality based depend on actually checking reality to see how close they came to as many paramaters as you have availible? Then pronouncing exactly what areas you matched observations and which ones were off and by how much they were off?”

I think we are sometimes missing the point that the models are supposed to be using first principles (with some parameterizations) to calculate a temperature and trend – given the forcings and feedback. In order to determine how well it performs one must compare it to the observed (assuming we know the observed with sufficient certainty to make the comparison). The model may perform its calculations well and simply the assumptions are wrong. It could make its calculations wrong and get a “right” answer by shear chance. It could get its assumptions wrong but multiple wrongs may give the correct answer.

Steve M was attempting to make the point that I suspect he will on another effort – just not with as dramatic results as the first attempt – that the model runs within a given model give reasonably stable results, but that between models the differences are greater. That observation raises legitimate questions that need answers.

Perhaps those who question using a mean of model results can address the questions that arise and do it in a detailed and specific manner by pointing to why some models are more valid than others and even perhaps providing a selection criteria.

I think Pielke’s point is that any time you have widely varying model results that obviously the range of those results will cover the observed results and one could say that one or some of the model results are close to the observed. That however begs the question that the model results are widely varying and apparently without a reasonable explanation. I do, however, think that adding in nonsense model results is not a good way of making that point for even I can explain why those do not work.

“The model may perform its calculations well and simply the assumptions are wrong. It could make its calculations wrong and get a “right” answer by shear chance. It could get its assumptions wrong but multiple wrongs may give the correct answer. ”

Your right, which is why having multiple tests with various observation data can help discover what is correct and what is not.

That’s the problem I see. It is a have your cake and eat it too. Of all the possible tests to run, the few tests that are passed nitpicked out and shown as as support, and disregard the rest as unimportant. If you can run 3 tests and only one out of 3 passes, that should be stated plainly in report of the model.

What we get… IT PASSED THE TEST!
What we should get… It passed one test but failed 2 others. They can then go on to explain why the other 2 tests matter less than the one that passed, however ignoring the other 2 tests is wrong.

Also, if even you can do it then please explain why adding nonsense models as a test of a proceedure is a bad idea. Several people have given explanations as to why it is necessary, if you think it is wrong, please explain why.

It says that if you add nonsense models you have a nonsense test. However, the IPCC has not identified any of the models as nonsense.

The test, as it has been set up, has as one of its assumptions that the included forecasts are not nonsense. Obviously, if this assumption is violated, so too is the validity of the test. If it is not violated then the test has a chance of being a reasonable one, no?

The test, as it has been set up, has as one of its assumptions that the included forecasts are not nonsense

But, Chip, aren’t we assuming what we need to approve? What if some Mann or other were to submit a ‘model’ that truly was nonsense? But we assume it’s not nonsense?
You misunderstand a ‘nonsense’ test: you do not apply a test which cannot be tested, ie it has to be appropriate – I do not test the grip of a hammer by staring at it – hence, Rogers tests were valid.

Chip,
I think you are missing the point. It is all nonsense. Read Orrin Pilkey’s book “Useless Arithmetic: Why Environmental Scientists Can’t Predict the Future.” Nature is simply too chaotic to be modeled into the distant future.

You guys do know that comparing multiple models of the same process, where some make one set of predictions (maybe a value goes up over time) and others make other predictions (maybe the value goes down over time), in such a way that the set of resulting trajectories somehow bound the measured data and this characteristic is submitted as some sort of validation is on it face ridiculous.

I can just see some poor sap at Boeing or LMCO or Airbus, saying to his boss “Well I’ve run three different finite element stress analysis, using three different models on the bracket for which I am responsible. One model shows that the bracket is fine until we get temps above X degrees C, the other model shows that cold soaking is the real issue the bracket gets brittle and comes apart due to excess vibration and the other model shows that above a certain frequency it doesn’t really matter what the temp is the bracket fails. We did do 1 test and the bracket didn’t fail at these medium vibe/temp levels. Which is, btw, bracketed by all three models. I think we should just run all of the models all the time, skip the testing cause any test we make will fall within the bounds..so we should just use a very cheap material for this bracket cause it always works given my model ensemble output”

Jaye, what are the models for? I mean, if they’re so wayward why not discard the lot? Or is one ‘better’ than another? How d’you test for that? Retrospectively? Gypsy Lee down the bay front does pretty well with that kind of ‘rigour’!

If they don’t go through independent V&V then I don’t know what they are for. Seems to me, I could propose two models that would bound temperature as a function of time f(t) = a big number and f(t) = 0. There ya go, two models that are guaranteed to bound any temperature trend you might have.

We model/test/model to do our design work. Thing is our stuff has to work.

However, it is simply laughable that an ensemble of models each with different behaviors wrt to trend prediction should be cobbled together to show much of anything. Annan’s silliness in attempting to show that two distributions are similar if the mean of one, falls somewhere in the meaty portion of another distribution is ridiculous. Back when I was doing pattern recognition, we used Mahalanobis distance to do this sort of thing. If you aren’t taking the variance of both distributions into account then you have a meaningless metric.

Does statistics have much to say about scenarios? (as in “Scenarios are images of the future, or alternative futures. They are neither predictions nor forecasts. Rather, each scenario is one alternative image of how the future might unfold.”)

Or are those wise words in quotes from the IPCC simply there to avoid anyone doing anything reckless like trying to understand the underlying models that are used to alarm people?

The models are projections. If everything stays roughly like it is now, this is what we think will happen.

The crappy part that I see is that the model’s error bars are large enough so that the projections show little warming to extream warming and (with the error bars) can cover every range inbetween. So if you have a little warming, the models are correct, if you have a lot of warming, the models are still correct.

This is the worst part, if you are actually observing low warming that is within the error bars of the lowest model projection, you can apparently still claim the high amount of warming is possilbe because “the models are correct”.

I agree this is the worse part. I think that is what Roger was trying to show by testing nonsense forecasts. The range of outcomes is meaningless. Given the range of outcomes that you so ably outline it seems that direction to policy makers hinges on the precautionary principle.

I know we’re not supposed to editorialize, but given what you’ve said (which makes complete sense) it would seem anyone with a political agenda would want this result. Otherwise, one would believe that it is necessary to invalidate and remove the outlier models.

“It appears to me that what our critics want to do is build into the null hypothesis some notion of model heterogeneity, which presupposes a lack of equivalence among models and, by implication, observations. But if the estimation is done based on that assumption, then the resulting estimates cannot be used to test the equivalence hypothesis.”

True, but no big deal. It is known a priori that models and data are not equivalent here, or ever (a point made by snowrunner’s comment). Is this one of those ‘silly null’ tests?

“In other words, you can’t argue that models agree with the observed data, using a test estimated on the assumption that they do not.”

You substituted “equivalence” for “agree”.

“because there are a lot of folks who are under the impression that GCM’s are accurate representations of the Earth’s climate.”

… and “equivalence” for “accurate”.

Climate scientists may mean something other than ‘exact match’ when they claim that models are accurate or agree.

Else you could say that *all* models are inaccurate. “All models are wrong, but some are useful” — George Box.

Afraid I don’t understand what the purpose of the test is. An ensemble is a *single model* created for the purposes of making predictions that include estimates of structural uncertainty. Assuming instead a prediction based on a one hundred percent accurate and complete physical representation of the climate system would produce claims of overconfidence, right? If your test does not include structural uncertainties (model heterogeneity), then it’s not testing ensembles and ensemble predictions *as they are created and used* in the IPCC reports and elsewhere. But it’s not testing individual models either, which would also be useful. What is it testing exactly?

James Annan once described the inclusion of model heterogeneity in tests as ‘testing modelling as an approach’ (paraphrasing), the assumption being that truth is bracketed by the model spread. That seems reasonable intuitively.

See Stilgar’s comment above. The problem is that the range is so wide, that it includes “nothing to see here, move along” at one end to “armageddon” at the other end. That’s just not useful if we’re supposed to be allocating real dollars, coming from real workers pay cheques. I don’t think anyone disputes that a range of outcomes is the best we can do, but how do we narrow the range?

You can use the VF framework to test model-data equivalence while allowing for model heterogeneity. To do so you construct the joint test {[model1=obs], [model2=obs], [model3=obs], …, [model23=obs]} (see math below).

The catch is that, to be testing modeling as an “approach”, you need to do this as a joint test across models, which is what the curly braces denote. In other words what you cannot do is run separate tests [model1=obs], [model2=obs], [model3=obs], …, [model23=obs]; and then pick the lowest F score, and if it is insignificant, claim the “approach” is consistent with observations. In this case each test is run as if it is the only model in the world, or all the other models are unrelated to the one being tested, which contradicts the notion that you are testing models jointly. The independence assumption means that a low F score on model #23 would tell you no more about models #1 through #22 than it would Environment Canada’s weather prediction model or the US Fed’s GDP forecasting model. Claiming otherwise means trying to have it both ways–the independence assumption lets you get pretty much any F score you want, but then you have to pretend you never made the independence assumption in order to say something nontrivial about modeling.

If you want a result that tells you something about the group (or “approach) of models, without imposing the assumption that all the models have the same trend, you need to use the multivariate testing approach to construct an F score of the joint test given by the expression in the {}’s above. In equation (16) in the MMH paper, the R matrix will consist of 23-row identity matrix for its top left block, zeros in the bottom 4 rows, and the rest (the first 23 rows of the last 4 columns) will consist of the constant -0.25. Also q now equals 23, which is the number of restrictions being tested. This tests each model against the mean of the 4 observed trends. (Write out Rb to see why).

As q increases the critical value of F does as well. For 23 restrictions the 99% critical value for the F2 test is 118.3. The test I have described, if I’ve done things correctly, yields F scores of 661.5 in the LT layer and 903.02 in the MT layer on data spanning 1979-2009. That means we can reject at the 1% significance level the hypothesis that models jointly have the same trend as the mean observations. This is not the same as testing the model mean trend against the observational mean trend (which also rejects significantly).

“In this case the null is: the models and the observations have the same trend over 1979-2009. People seem to be gasping at the audacity of assuming such a thing, but you have to in order to test model-obs equivalence.”
It isn’t audacious – it’s boring. It’s rejecting a hypothesis that no-one thinks is true. If you think that’s wrong, please provide quotes.

The models do not have the same trend as each other – that’s well known. The obsrevations do not have the same trend as each other either – one of the results of MMH was that the highest level of significance went to the difference between RSS and UAH.

Without considering model heterogeneity, you can make a very artificial conclusion about this particular set of models, in terms of what they say about the tropical troposphere. But as soon as you modify that set, the conclusion doesn’t work.

It’s like measuring 23 Swedes to find their average height, and assessing the mean according to the uncertainty of your measuring instruments. You might have a very accurate result, and then show that the mean of a similar set of Danes was significantly different, based on measurement error. But unless you take account of the variability between the 23 Swedes (and between Danes), you can’t say anything about those national populations.

One can create models of electric circuits in the language SPICE and use it simulate their behavior. The SPICE results correspond to reality and can be sued to justify significant investments in tooling. There is no need to find other electric circuit models that will give different results jsut to be sure that reality is bracketed.. Somehow I do not find this “boring”.

Can the non-boring climate models do the same? Do they model reality in the same way that SPICE does?

“The models do not have the same trend as each other – that’s well known. The obsrevations do not have the same trend as each other either – one of the results of MMH was that the highest level of significance went to the difference between RSS and UAH.”

So let me paraphrase it this way:

Our models are different and don’t match the observations.

The observations don’t match each other or the models.

But because the models say we will have extraordinary warming we need to invest billions to cut CO2 emissions.

Re: K Denison (Aug 13 21:06),
No, it just means that whether the individual trends are significantly different from each other, based on the scatter of time series residuals, is not helpful in that decision.

The models still all predict warming. And the observations measure it.

Re: steven Mosher (Aug 14 04:07),
Reduced complexity models have been very fashionable lately. On evaluation, there’s a whole chapter (8) in the AR4 on the subject, including Sec 8.8 on simpler models.

But one thing complex models provide is detailed elucidation of physical mechanisms, which is often more important than the headline result.

If you increased the aerosol parameter but still within its own uncertainty range you can predict cooling too. That tells us only that you can get whetever result you like depending on the inputs you make.

The models have obviously been tweaked to match the global temperature. That is not a prediction of anything, it’s a cheat. But they couldn’t cheat to get the correct answer spatially or altitudinally. So yes they are all wrong but neither are they useful for policy. Of course lots of people like to assume they are useful, on the basis of “they are all we have”. That doesn’t hold water outside of academia.

The Douglass et al. test was actually a specific test of the assumption of large positive water vapour feedback in the models. The test and this one falsified it. Of course we now have multiple lines of evidence of that: Lindzens outgoing radiation test, the lack of a cooling stratosphere since 1995, the “missing heat” in the oceans. The hypothesis was tested and found to be wrong. It is that simple! Ordinarily in scintific investigation, that would be the end of it and we go back to a zero to one degree sensitivity and call off the apocalypse. The myth persists for other reasons – very human ones.

I’m sorry, you all the miss the cart for the horse – the ‘models’ ‘predict’ warming because they assume warming – which happens to ‘correlate’ with ‘real’ warming. Does that mean the ‘models’ have predictive skill? No, for then you would have to ask are the assumptions in the models correct. GHG laboratory theory notwithstanding.

And, then, there is the argument that if the models don’t predict the absolute temp. they atleast it’s rate – is that meaningful?

Two tests – are the assumptions correct and, then, retrospectively, did they work?

Or let me put it a different way – if I assume a very simple GHG model and then pile upon it stochastic complexity – and, then, find that my GHG shines through – and, then, find it has a positive temp. curve, even though it’s way out of line with reality – well, I’m right pleased – I got my result!

It strikes me that you (and james Annan, and others) have a point, you can claim that if you include enough models, some of those models runs are consistent with the observations. But where you are wrong and wrong headed about it, that doesn’t allow you to make any scientific “proof” statements about the models in general. You can have your point, it is a very weak one and doesn’t mean what you seem to think it means, and I assert you would be foolish to, for example, urge urgent action on anyone based on the fact that you have proved (and not the fact you seem to think you have proved).

What is clear is that what we could call the “mean model”, that is a model that has the typical characteristics of the models with results that represent the mean of the results, is NOT, understand that word ? (and I don’t think from the evidence to date you do), NOT, consistent with observations. That is a stronger statement that actually informs us about the properties of the models. And what is clear is that at this point the models in general wildly overstate the degree of warming to be expected.

Re: Ed Snack (Aug 14 04:45),
I haven’t sought to make proof claims. Here it’s the other way around; MMH are claiming to prove that models and data are significantly different.

Yes, if they are shown to be not significantly different, that is a null result and doesn’t prove anything. If you want to claim that illegitimate assertions have been made there, you should provide a quote.

So our conclusion – that model tropical T2 and T2LT trends are, in
virtually all realizations and models, not significantly different from
either RSS or UAH trends – is not sensitive to whether we do the
significance testing with “ocean only” or combined “land+ocean”
temperature changes.

> Bottom line: Douglass et al. claim that “In all cases UAH and RSS
>> satellite trends are inconsistent with model trends.” (page 6, lines
>> 61-62). This claim is categorically wrong. In fact, based on our
>> results, one could justifiably claim that THERE IS ONLY ONE CASE in
>> which model T2LT and T2 trends are inconsistent with UAH and RSS
>> results! These guys screwed up big time.

Re: TAG (Aug 14 05:23),
Context. That quote follows on immediately from where he lists the H1 pairwise comparison results, as they appear in Santer et al. He’s just listed the numbers of results in which pairwise model-result significance is rejected, and that is what he is talking about.

But that’s different from the proposition here that “the models and the results have the same trend”. That’s actually the point of H1 – you don’t have to try to define a common trend. But that’s the basis of the null hypothesis that Ross states.

Danes was significantly different, based on measurement error. But unless you take account of the variability between the 23 Swedes (and between Danes), you can’t say anything about those national populations.

The tack taken previously by Santer et al and by Douglass et al has been to treat all models as an ensemble of attempts to correctly simulate the “real climate” (pardon me). The large spread in their derived values of equilibrium climate sensitivity (ECS) naturally leads to a large standard error of the mean. Since this large spread has persisted for decades, so has the large standard error. This practice appears to prevent the weeding out of those models that exhibit the largest departure from observational history.

An alternative approach, which would likely be followed by physicists, is to assume that each model contains different physics and that some models agree better with observations than other models. We would then try to find out what structural or parametric properties differentiate the better agreeing models from those that depart more substantially from observations. From this process, there is a good chance we would learn something about climate and about climate models that would be useful in further research.

However, nothing like this is done or appears to be considered. Instead we have continuing salvos between defenders of “THE MODELS” and those who challenge their conformance to observation.

It was my impression, like Lazar’s, that the model ensemble represented a ‘sample of the model space’, of which the observations of the real world are one example, but not necessarily one represented in the set of available models. Its a bit like string theory I guess, narrowing the space of possible theories via constraints. Anyway, looking forward to your response to that perception.

I’m not sure if this is a useful comment – its saying little more than what others have said indirectly – but I think the discussion could be helped by explicitly moving it to the meta-model level.

At this level you can explicitly attach concepts such as purpose to the model, and introduce evaluation criteria like fitness for purpose. Viewing things from the meta level means that evaluation criteria become explicit and the statistical techniques to use in those evaluations become more clear cut.

Some quick comments that follow from taking this perspective are:

1. Bottom up models are not the only ones that can give forecasts of future climate. Utility and simplicity might mean that top down or partial models are “better” for a range of purposes. (Think about the use of multi-scale modelling that include the nano scale to predict chemical reactions – its coming but so far for most practical applications “top down” and partial models rein supreme). I don’t buy the idea that the climate models are all we have and therefore public policy has to use these.

2. Models whose explicit purpose were just to be sure the observation fell within the range of projection would not be of much interest, particularly if they were complex and there were much simpler models that could do the same thing. (Relevant to Piekle J vs Annan).

3. In the early days people working on complex modelling techniques may set themselves limited goals for their complete models, but as someone observed in this thread the real effort would go into the subsystems first. It is the performance of these that gives confidence in the final outputs, particularly where validation of the whole model is difficult.

4. Multiple different models all with a common purpose are best dealt with separately first because until that is done we really don’t know how much information they contain. It’s not clear to me what evaluations of composites of any kind from multiple models tells us under these circumstances.

etc

I do think it is useful to think of these climate models as being in the same state as multiscale modelling for chemical reactions. The problem right now is that I suspect they are being used for purposes well beyond their demonstrated capability.

For public policy we should be looking for alternatives until the state of the art improves.

David Stockwell and HAS, forgive my Joe Blog ignorance but what?!! Surely these ‘models’ must be used as predictors ie have predictive skill? Now we can test each model individually or test them as an ‘assembled mean’ – they fail all tests – the rest is mere verbiage.

Lewis, my point is that models can predict a large number of things. Much of the debate in this thread has been about how climate models should be judged i.e. what is it they are trying to do.

Climate modelers here are arguing for a low level of test (we weren’t modeling the short-run mean, just the rough rate of temperature increase and we’re doing this to give insight into the underlying processes are a couple of examples). They are arguing the tests being proposed by others are unfair.

My point is that it helps to get purpose and performance criteria upfront as part of the model specification. That way you focus the debate on the important issues, and models don’t get misused because they have yet to demonstrate their fitness for purposes, or because they were never designed for that purpose anyway.

My view is that both has happened with complex climate models, and it is therefore not surprising that the modelers now want to lower expectations. This doesn’t mean that we should stop working on them, but lets get the purpose and performance criteria clear.

If I understand you correctly, agreed – because models aren’t merely or even constructed as crystal balls but are created in order to replicate current understanding ( and so help in furthering that understanding ). The question, however, is, in this thread, how does one test such models? One way is to compare output predictors against real temps etc. Hence supra. You do agree that we must be able to verify if the models replicate what they’re supposed to replicate?

“You do agree that we must be able to verify if the models replicate what they’re supposed to replicate?”

Absolutely, and I’m further saying that modelers need to be explicit about what their models are designed to replicate (and desirably the tests to be used to validate and verify them). That then gives two areas for debate, are the tests the appropriate ones given the purpose, and how well does the model perform against those test. The two issues are getting mixed up supra.

Can we not take each of the models individually, run them 100 times from lets say a starting scenario of 1990 to the present to determine a sample of “change in temperature” over the 20 year period. Then we look at the real change in temperature over the period, lets call it X, and say “where on the distribution curve does X lie ?”, and from that evaluate the accuracy of the models individually ?

If the models are shown to be individually inaccurate, then combining them as one statistical ensemble just to get a wider standard distribution which encompasses the observed data is extremely UNSCIENTIFIC. Have there even been any scientific papers published on the “science” of how an ensemble of individually invalid models can actually have any kind of scientific validity or accuracy by taking a mean ? The whole concept smacks of convenience and lack of rigour. Surely this ridiculous concept is easy pickings for a competent statistician to take apart.

The only use I’ve seen of known bad models is to use several which are believed (based on fundamental considerations) to over estimate or underestimate to try to bound the solution. I don’t think that approach would be operationally useful in this case. Perhaps others have better ideas?

Further more, if the ensemble of models were run over the whole 1900’s i’m guessing there would be close to no deviation over the historic temperature trend (that they get trained on), with the deviations occurring of course once they enter uncharted territory. This on its own is enough to discredit the ensemble as meaningless.

1. The ‘hockey stick’ has no real evidence ( unless you count ‘intuition’ as evidence )
2. The models fail any kind of reality test.
3. Oh what was 3 – something to do with ‘global’ temps!
Wow. And, yet, when not distracted, there is such a rich field of real research!
Steve, I’m very interested by how statistics are used and abused – in the pharma industry this is a real plague – such and such is correlated with such and such – replacing any common sense notion of cause and affect.
Of course, we’re supposed to take this with a pinch of salt – but now, a whole fauna of ‘peer-reviewed’ papers spews out with such spurious nonsense. I mean, you should be glad you only have to comment on GMCs and ‘hockey sticks’. Don’t envy you!

I commented on his post as appears below. It is from memory so may not be exactly word for word, but is very close. I was hoping to get James to respond. Instead, my comment and question disappeared in due course. Here it is:

James,
I think I understand what you are trying to say. It is obvious you are able to write more clearly when you wish to. Here is what I think you are trying to say:

1. MMH is correct but their conclusions are wrongheaded and irrelevant because Santer screwed up his analysis of model ensembles.
2. The IPCC Experts at NCAR are dummies for following Santer’s wrongheaded statements about model ensembles.
3. This is all very embarrassing because you tried to warn climate scientists that standard practice in handling this is clearly wrong and nonsensical but your warning was damnably cryptic. Climate scientists are following Santer’s wrongheaded but clearly written analysis and ignoring your brilliant analysis but poorly written analysis.

Do I have it about right?
———
I thought it was a fair restatement and fair question for clarification. Was I wrong?

This ensemble business reminds me of the Dire Straits song “Industrial Disease” which has the lyric “Two men say they’re Jesus. One of ’em must be wrong.There’s a protest singer. He’s singing a protest song.”

All of these exercises do seem a bit like expecting the average phone number to be meaningful. 90% of the intellectual effort on these models should be going into getting the most precise and accurate understanding of why they differ and then developing tests to address those specific meso- and micro-differences. Otherwise, astrological or chicken-entrail estimators could be calibrated to past data and thrown into the mix with no reason to believe that they would not “improve” the ensemble in some way.

PS: It seems to me that “consistency” is a reserved term in statistics, like unbiasedness or sufficiency, and should not be abused in the vague way that the climate guys are. A consistent estimator is one that converges to the true value of a parameter when the amount of data goes to infinity. So if the ensemble mean is a “consistent” estimator of the temperature trend then its spread should shrink and be centered on the true value of the trend as more data accumulates (assuming there is a stable trend). Some well-known specification tests for econometric models exploit this property/definition of consistency.

srp,
It seems to me people are coming up with weird ensemble approaches and writing about them in intentionally vague ways in an effort to preserve some bit of credibility for the models. For example, James Annan wants it in the record that Santer’s ensemble approach is wrong but he would not say so clearly. Even he admits his criticism was “cryptic.”

One of the ways IPCC gets the very high estimates of future warming is to use not only the high emissions scenario, but also the upper end of the model forecasts. The wider the spread of model outputs, the worse the upper warming forecast becomes. If you drop the models with the worst agreement with 20th Century climate, you also tend to drop those with the most future warming. Thus the alarmists have an interest in keeping all the models.