A Mixed Effects Perspective on MMH10

Today’s post is complementary to MMH10, which, as readers obviously realize, is in Ross’ excellent style. There has been a kneejerk reaction from climate scientists that the article is “wrong” – typically assuming that we have neglected some trivial Santerism (which we haven’t).

This post is NOT – repeat NOT – an explication of MMH10. Let me repeat another point – Santer’s ensemble results don’t hold up with additional data using his own method. So this result doesn’t depend on the validity of the MMH10 method.

Some of the defences of the Santer results are based on the idea that the standard deviation of an “ensemble mean model” should be huge relative to the standard deviation of realizations of any given model. Unfortunately, the precise meaning of an “ensemble mean model” ishard to pin down in precise statistical terms.

I thought that it would be useful to try to pin the concept of “ensemble mean model” down in statistical terms. As I was doing so, I did the simple boxplot diagram presented here that provided interesting results and made me think more about the stratification issue. But this post is not an elucidation of the MMH10 algebra, which comes at things from a different perspective. ’ll try to figure out a way of presenting the MMH10 algebra in a friendly version as well on another occasion.

UPDATE Aug 13 – there was a mistake in the collation of trends that I used in this post -which was (confusingly) supposed to help clarify things rather than confuse things. I used trends calculated up to 2099 instead of 2009, which ended up making the within-group standard deviations too narrow. I’ll rework this and re-post. However, I’m going to Italy next Tuesday and am working on a presentation so the re-post will have to wait. This has also been a distraction from the MMH issues so let’s focus on them.

The point that several people are making is that you can’t use the SE of the mean trend of the models to determine whether the models are consistent with the observations. SE has a 1/root(N) so as N increases you are inevitably going to conclude that the models are inconsitent with the data.
Here are some relevant quotes.

From Santer:

“DCPS07’s use of σSE is incorrect. While σSE is an appropriate measure of how well the multi-model mean trend can be estimated from a finite sample of model results, it is not an appropriate measure for deciding whether this trend is consistent with a single observed trend.”

Two from ‘M’ on Annan’s blog:

“I can’t swear that the McKitrick paper uses this the sigma-SE where they divide the standard deviation by the square root of the number of model runs, but given their tiny error bounds on the model uncertainty, it sure looks that way (they don’t seem to specify in the text). ”

Answer: Almost certainly not, given Table 1. eg, this uncertainty limit would reject many individual model runs, which is clearly wrong (if you are trying to show that observations are not consistent with models).”

and two from Pete here:

“So the error bars on the model estimates represent confidence intervals, not prediction intervals? Isn’t this the same mistake Douglass et al made?”

“The problem is not with your standard errors, it’s that you calculated confidence intervals from those SEs instead of prediction intervals.

How many false positives do you get when you run your test substituting model runs* for observations? How does that compare to the nominal size of your tests?

It’s hard to figure out the theoretical rationale for Santer’s se_model term (also used by Douglass) when expressed in the terminology used here. Santer calculated the mean trend for each model and then calculated the inter-model standard deviation. This yields (in mixed effect terms) the between-group standard deviation, rather than the relevant within-group standard deviation. They then divided the between-group standard deviation by the square root of the number of models (minus 1) – a divisor of about 5. Because, in this case, the between-groups standard deviation is about 12 times larger than the within-groups standard deviation, this reduces the Santer model standard error to a number that is within an order of magnitude of the within-groups standard deviation. Because the quantities are small relative to the se_obs, this doesn’t affect the results too much. But it’s calculated wrong and the lack of statistical understanding distorts the debate.

The nswer seems to be that the critics are misundestanding what was done and statistics in general. The don’t understand what a MMM is

I personally keep thinking of this as – or at least analogous to – the difference between accuracy and precision.

All of the models presented are quite precise. An additional run of the same model yields a tight scatter on previous results.

But issues involving the accuracy completely swamp the decent precision.

If one were presented with five sticks and a declaration “These are all metersticks” – and they yielded a plot like the above, there would be an obvious call for calibration. “Ok, stick 1 always reads 2.7% below the mean, stick 2 7.4% above…” (And calibration can be extended to whatever order seems useful.)

You would be able to get meaningful information out of “sticks” – the accuracy corrected by the calibration, and the tight precision carrying through.

But instead of minimizing accuracy error through calibration, the combination into a large ‘ensemble error’ is odd.

Steve: I don’t think that the issues here have anything to do with this particular metaphor.

ya Jeff, that is a shocker. especially for a system that large. It tells me that the model is pretty much deterministic. It tells me that the differences in answers are structural.. related to either missing processes or miscontrued processes. Like using a constant value for TSI in the forecasts going forward.

I don’t understand why you are surprised. Since the physics is similar, the underlying assumptions are similar and the numerical method is similar, it seems to me obvious that you’re gonna obtain more or less similar results within a given model, no matter how many runs you try.

By similar results, I mean similar mean trends associated with similar intrinsic uncertainties (that can be large). Since the mean trends of a given model are similar, it is obvious that the corresponding std deviation is going to be very small.

What I do not understand is how Steve handle the individual (big) uncertainties.

In the modeling world it is too expensive and time consuming for each team to build a model from scratch. Expect lots of sharing of code and data. Having lots of modelers can be a good thing since it results in more eyes looking at the problem. Good solutions are kept, and things that don’t work are discarded. I seemed to notice that they have over-emphasized the aerosols from volcanic erruptions, this may go back to the idea that they haven’t a clue as to what causes these periodic dips in temperature (with GHGs increasing all of the time). Someone has oversold the idea of aerosol cooling and this showed up in the difference between the measured and modeled data. They also can’t do phase. Any phase difference is going to cause a divergence between measured and modeled results. I would skip trying to model oscillations and just try to get the broad long-term slope. After all, these models are being used to estimate long-term changes in climate. Will it be a 1 deg, 3 deg or 10 deg increase in 100 years. Any missed perturbation of the long-term slope as to phase or
amplitude looks bad. Just say we don’t know, this is as good as we can predict, its a ball-park. In modeling, you have to know your limitations.

A while back Gavin had this to say during a discussion hosted by the Bulletin of Atomic Scientists

“However, their models are all limited by similar computational constraints and have developed in the same modeling tradition. Thus while they are different, they are not independent in any strict statistical sense.”

He was talking about the 20 or so climate groups undertaking modeling work.

Here’s another analogy taken straight from classic social science statistics on stratified samples.

Let’s suppose that a school district has schools in many different neighborhoods and that there is considerable difference in family income between neighborhoods but little difference in family income within a neighborhood. A boxplot of family income by school would then yield a boxplot like the one that I’ve shown.

Think now what a “median school” would look like. It would be the school with the median family income and, like all the other schools, would have relatively low within-group standard deviation.

The “ensemble mean school” should not have properties that depart remarkably from the median school. Thus it is an ideal school that comes from a neighborhood with the overall average income and with a within-school standard deviation that is in some sense an average of the within-school standard deviations over the school district.

Here’s what it wouldn’t be. It isn’t a school with an average income but with an income distribution that encompasses the entire city. (This is Gavin’s analysis.)

Nor does it logically derive from the standard deviation of the inter-school average (the “SE” analysis).

Both perspectives miss the stratified structure of the data and how it affects the analysis – hence, the wrongheadedness of the discussion by climate scientists on both sides.

I like this analogy a lot. It makes sense and it’s easy to understand why this statistical method is being used in the first place. I’m going to be interested in seeing how our supposedly sophisticated CC statisticians react to it. I suspect they’re just going to dig themselves in deeper in terms of showing their lack of actual statistical savvy.

Nice work Steve. To put it in even simpler terms, perhaps even those a Senator could understand (well, maybe not), it is quite obvious from the boxplots that 5 of the six models are differerent from the observed temperatures. The climatologists can’t then just toss in one that isn’t different and claim the ensemble is now magically consistent with the observations.

I think this comment nails it. You don’t really need to delve into the arcana of statistical analysis when there are only half a dozen things to compare. Even if there are 20 or 30 models you could simply compare them one by one to observations, and count the ones that fail the test.

An interesting, if off-topic question: if only one or two models do match past observation, why is this so, and what do those models say about the next 50 years?

If I remember correctly, the stratified sample was the invention of one George Gallup, for whom the Gallup poll is obviously named. It was studied in my psychology courses, and is used for many different scientific studies. It made Mr. Gallup rich and famous, as he became known for being able to more accurately predict the outcome of elections (notably presidential elections) than other pollsters at the time. He was able to make these more accurate predictions even when using a much smaller sample size than other pollsters.

Santer defenders are increasingly making the point that, with enough passage of time, there is liable to me a “statistically significant” difference between models and observations even in circumstances where the models might have done a pretty good job.

Unfortunately, that point doesn’t help a whole lot for Santer’s argument that there wasn’t a “statistically signficant” difference between models and observations – which isn’t true using up-to-date data.

As to whether models have done a “pretty good” job on this point or not – surely there’s a role for common sense here – as Ross’ diagrams show very clearly. The ensemble mean trend is twice as high as RSS and four times as high as UAH. That’s the ensemble average – yes, there are some low end models that are not necessarily contradicted, but that only increases the pressure on the high end models.

This is a common sense point that got confused by Santer et al 2008.

Lucia makes this point clearly as follows:

It is true that in the limit that you collect an infinite amount of data, you will reject the multi-model mean if it’s wrong by even a small amount. With a long enough observational periods– day 1000 years–, you could reject the model mean even if it’s only off by 1%.

But that’s not considered problem in statistics, it happens in many statistical methods. Should you ever have enough data to reject the mean when it’s only off by some trivial amount (say 5% or so), you just point out that you got a statistically significant result, but the magnitude of the effect is small.

That’s not what’s happening here. In this paper, the model is off by a factor of 2 to 4, and the result is statistically significant.

Oof! We’re soon going to need a degree in stats for follow this weblog but I suppose someone has to attempt to educate The Team and any other climate scientists who happen by. Let’s hope this find this useful rather than dismissing it.

One thing about dividing by 1/sqart(n-1); I thought this was founded on an assumption that the errors in question had a normal distribution. Is there any legitimacy in applying that to the models as Santer does?

This is probably one of those things , like using simple least squares fit on data with errors in BOTH x and y, that many people seem to do blindly without bothering to check the assumptions/conditions implied in using the technique.

I’ve seen this used to calculate the error in a fitted straight line in relation to the number of data points used in a graph where the experimental errors can be assumed to have normal distribution.

Can that be justified in averaging different models? It seems to be making important and unstated assumptions about the nature of the error in the models.

Steve: normal distribution is a separate issue. Don’t worry about it in the context of stratified models. Re-read the post as you’ve misunderstood the point of distinguishing between within-group variance and between-group variance. This has nothing to do with normal distributions or not.

I maintain that a between-group analysis of models is meaningless, no matter what statistical method is used. The inputs that give rise to the differences in model outputs differ in kind from one another, they are not in any sense distributed around a mean value. If model A uses f(x,y,z…) and model B uses g(x,y,z…)to come up with a value for ouput M, it makes no sense to do statistics with M(A) and M(B).

The common sense approach here is to determine which model outperforms the others for each given output of interest.

The variation between models has everything to do with the algorithms selected by the researchers and their unique way of wiring the whole system together.

For example: For future forcings modelers have a choice.

Use a constant TSI. which is known to be false
Use a TSI that varies according to some cycle, like 11years. GISS model E does this. This too locally imprecise, in that the exact phase and amplitude of the TSI is not entirely predictable. And the variation in TSI is rather small. But that’s the kind of thing you would expect all modelling groups to agree on. Lets all use an 11 year cycle within these bounds and within these amplitude bounds.

Now, there is a reason to keep the SPREAD of models wide.
That way they all survive the tests.

Any again, people need to look at the attribution studies to see how these stats played out there.

One thought: If the models corresponded better to reality — in particular, if they reproduced the full spectrum of the noise correctly (we’re back to Koutsoyiannis and LTP processes) — the within-group variances would be much larger than those currently reported. So large, in fact, that there would likely be no significant discrepancy between the model trends and the observed data. Problem solved!

Well, not entirely. Once you admit LTP, a new problem emerges: You might find that there is no significant trend to discuss.

Tim, the observations are about half the model ensemble. In our first submission to IJC, we observed that if Santer-style methods had been applied to test whether there was a significant trend using their original data, they would have got no significant trend. One of the reviewers stated:

The question of trend significance from zero was not covered by S08 and is a side show. I see no reason why this should remain in any published submission.

Take, as an example, the GISS_ER model which has 5 runs at PCMDI. The mean (trend) is 0.308 deg C/decade; the within-group population of trends has a (tight) standard deviation of 0.00183 deg C/dec. In this case, the standard error of the model is obviously the standard deviation of the population of runs (which is a within-group variance).

By doing so, aren’t you throwing away the individual uncertainty associated with each run?

(Above is Fig. 3 of Santer08 with data up to 1999, below is Steve’s figure with updated data)

I think I understand what you have stated in the thread introduction Steve M:

1) Santer et al. 2008 compared individual model realizations (57 in all) to observed results and used the standard deviation of the trend slope for each model realization in the comparison.

2) What you show in the box plots is the mean of the realizations for 6 different models and the variation for each of the 6 ensemble means. You also show a single realization (of course) of UAH and RSS with the standard deviations of the single realizations.

Santer showed 5 realizations for the model MRI-CGCM2.3.2 in Santer et al. 2008 with the trends of:
0.042, 0.348,0.277, 0.362, 0.371 and a mean = 0.280 and sd = 0.138
If the outlier first realization is removed you obtain mean = 0.340 and sd = 0.042.

I would like to ask whether you had considered using the trend difference series between the surface and the troposphere temperaturess for panel regression or mixed effects analyses.

I had not seen Lucia’s rejoinder before about using SE as in “But that’s not considered problem in statistics, it happens in many statistical methods. Should you ever have enough data to reject the mean when it’s only off by some trivial amount (say 5% or so), you just point out that you got a statistically significant result, but the magnitude of the effect is small.” Brilliant.

Your trend estimates for MRI-CGCM2.3.2 have a standard deviation of 0.0076 (sd(Info$T2LT[Info$model=='mri_cgcm_2.3.2a'])). But Figure 1 of Santer shows trend estimates for realisations from the model with a standard deviation of 0.14.

Report from the Great Unbathed: I am a scientist who has never had to use statistics and who, therefore, is woefully unable to keep up with the statistical postings here. NEVERTHELESS, I find postings like this one to be fascinating and–even for me–ultimately informative. I keep learning, little bit by little bit. Between the box plot and the “schools” analogy, I think I get it. I much appreciate this kind of post and the comments. Thanks to you all. Much appreciated. Please keep it up.

I believe if you added one or more of the surface temperature observations to the boxplot, you’d illustrate even better exactly what Gavin’s method of “analysis” accomplishes.

Namely, Gavin’s method lets those models that are wrong about the surface record make his ensemble range overlap the troposphere observations, while letting those models that are wrong about the troposphere observations create an overlap with the surface observations. Thus, he can claim “no model-data inconsistency”
even though no single model matches both surface and troposphere observations.

my take on this (shoot me down if I am completely wrong) is that if we had a case where the individual model runs were exactly the same (ie a linear model) the standard deviation within the model runs would be zero (not a million miles away from reality).

In this case the Se-model term would be exactly right and (as there is a small deviation in the real inter model results) the error in the real se-model term is small.

Where I am confused is it seems that the H1 and H2 tests seem to have taken the divisor as the number of model runs where in my case clearly as the model runs are are all the same the divisor should be the number of models.

It seems to me that we have exactly the opposite situation of what you would expect – ie you would expect that the single model, multi-run SD would be much larger than the observational SD (many histories versus one history). Since it is not the conclusion must be that the models have nowhere near the natural variation that the actual measured climate has. I believe GS has stated that there is a “minimal” role for natural variation in these models – these graphs point it out. So, what the modelers are trying to do is aggregate the cross-model variance as a substitute for actual natural climate variability, which the models lack and can’t seem to simulate. That is of course unless there is instrumentation or sampling noise that is masquerading as natural variability in the temperature record?

This is not a correct interpretation. Each run of a single model has a strong variability, and therefore a large uncertainty associated with the mean trend (see Fig. 3-4 of Santer08).

In the case of a multi-run, single model, all runs have similar mean trends, associated with similar large uncertainties.

It seems to me that Steve calculate the SD from the set of mean trends… which gives a small SD since the mean trends are similar! It also seems to me that, in the process, the large individual uncertainties (which represent the variability within a given run) have been thrown away.

nono, here’s my take on your point. let’s think about it statistically. Let’s assume that the population of trends within a model has a “largish” standard deviation along the lines of that of the individual runs. The distribution of the sample standard deviation/variance has a known distribution based on normality – which is a reasonable enough base case to verify the math. This distribution is a sort of gamma distribution (Pearson Type III) according to http://mathworld.wolfram.com/SampleVarianceDistribution.htm .

The observed sample within-group standard deviation is very low. The chances of observing that low a sample standard deviation would be extraordinary unlikely given the distribution.

Intuitively one would expect the sample standard deviation to be about the same size as the standard errors thrown up in trend estimation. So if the sample standard deviation is much smaller – as it is, then this requires an explanation. The discussion of covariance in MMH may help you here.

You also need to assume independence for this line of argument. Independence is a reasonable assumption between models and observations (unless you’re Ross, then it’s “overly pessimistic”), but amongst model runs from the same model it’s not likely to hold.

I’d like to see what Figure 2 would look like if the uncertainty in the model trend estimates and the uncertainty in the estimate for the ensemble model mean were included.

Steve, can you confirm that what you are showing is that the trends of a single model don’t show much change over several runs with different CMIP4 forcings and NOT the variance associated with individual trends themselves? As I read it, the variance of the trends themselves is much, much larger than the variance of the mean of the intra-model trends.

Or am I off-base here?

Steve – Ron, it would be helpful if you tried to express your question, if you can, using the terms within-group variance or between-group variance. Otherwise there is a risk of crosspurposes.

I haven’t used the term “variance of the mean of the within-group trends” – I’ve talked about the within-group population variance (or equivalently standard deviation.)

MMH10 looked at mean and variance of the trend of the multi-model-means (inter-model or across-models) and compared that with the mean and variance of various observational data sets.

Chad and I are looking at the mean and variance of the trends of individual models and comparing that with the mean and variance of various observational data sets.

McIntyre in this post is looking at the mean and variance between the trends of multiple runs of a single model (intra-model mean or within-group mean). This shows that the modeled trends output are relatively insensitive to the perturbations of the intial conditions used in the CMIP4 runs.

You got it! Only I don’t think instrumentation noise is anywhere near the level of natural variability. Nature variability is tremendous. When you think about the fact the Northwest Passage was clear in 1944 just like it was in 2007, that tells you something important about variability in arctic sea ice. The ice did not disappear and reappear because of poor instrumentation.

Seems like they could add a stochastic component to match observed natural variability, but that would widen the terminal distribution in 100 yrs to include lots of minimal warming or possibly cooling. I think what needs to be understood is the wide variability in outcomes in IPCC is not model run variability simulating natural climate variability but model assumption variability or CO2 scenario variability. I’ve never considered that before.

To put it another way, there really is no Brownian motion type of process in these models that will let them wander away from the warming drift. It’s a reasonable question to ask whether that isn’t a category one flaw given how the temperature anomaly data looks. To argue the other side: I suspect that since the air 6 ft off the ground holds < .1% of the earth's heat content, we should be able to come up with a heat content measure that monotonically rises with increasing CO2, even if the air temp measurements are noisy. So, why can't we?

“the models have nowhere near the natural variation that the actual measured climate has”

I have to admit that having followed this topic somewhat closely I failed to grasp this essential point until just now. The incorrectly calculated ensemble SD gives a casual observer (like a peer reviewer) the misimpression that the models are simulating real world variability when in fact its just an artifact of bad math that’s imprinted the between-group variance onto the ensemble distribution. Wow.

Steve, nice post. MMH10 could turn out to be quite the important paper.

How the SDs computed for the observations? I assumed that it was some calculation based on the noise in the data but if that is the case are the SDs for the models calculated in the same way? Do these results really show the models have no ‘weather noise’?

People who do not understand why panel models are necessary (and therefore either live in a purely time series world or a purely cross section world) are frequently surprised when their conclusions no longer hold when their data meets a properly specified panel model.

It is hard to dig those people out of the ground when they lack both the intellectual capacity and the financial incentive to do so.

I think I understand the context for the above. First, Santer et al tried to rebut Douglas and show that computer models as a whole do a good job predicting actual temperature trends. Second, their approach contained statistical mis-steps which suggest that they lacked an understanding of what they were doing. Third, when theor approach was used in a statistically more defensible way against a fuller set of available data, i.e, a 20 year temp record, the models do a lousy job matching the actual trend. Fourth, efforts to explain away the negaitve result show the same basic lack of understanding of statistics.

That said, and my apologies if this has already been asked and answered, I am puzzled by why anyone would use an ensemble of models in the first place when there is little evidence that any of the individual models accurately predict the observed trends. Is not this one of Annan’s issues

Steve: The Pythagorean sum applies under the plausible assumption that the standard error of the models is uncorrelated with the standard error of the observations.

Better tell Ross:

MMH10: While detrended climate model projections may be uncorrelated with observations, the assumption of no covariance among trend coefficients implies models have no low-frequency correspondence with observations in response to observed forcings, which seems overly pessimistic.

Your paper says that equation (5) needs a covariance term. You’ve argued in this post that no covariance term is necessary.

Steve: in this post, I was trying to explain in simple terms why the results “make sense” using a different perspective. The within-group standard deviation is obviously very “small”. In MMH10, we observe that there is covariance between residuals and this has to be considered and does so using econometric methods, getting similar results. Both results derive from their internal logic. I think that the reconciliation of the two perspectives (and it’s an interesting exercise that I’m just mulling over) is that it is the covariance of residuals that results in the standard deviation being “smaller” than one would otherwise expect and thus the points reconcile neatly. I’m not 100% sure about this but it makes sense in a couple of ways. I may not revisit this point for a while since I’m going to be away for about 10 days.

Steve is talking about correlation between the standard error of the models and the standard error of the observations. Ross is talking about the correlation between trend coefficients across models. Different point.

Understanding that I know bupkus about higher-level stats, and *know* I don’t, please understand this post in that context, rather than, say, somehow I am proposing to have found the fatal flaw in MMH10.

I found myself thinking it interesting that you’re applying “social sciences” statistical methods to essentially physical sciences data. This made me wonder if the level of “intelligence” that can be assumed to be buried in any social science data analysis has some relevance as to the appropriateness of that mode of analysis in this context.

But then I “hmmed” to myself some more, about the fact that we’re talking about groups of model runs, and while (one hopes) the model runs are tied closely to physical reality, one would be a little bit naive to not recognize the impact of a social phenomenon often pointed at by skeptics re a problem with establishment climate science –“groupthink”.

So. . . I guess I would at least be interested, if you care to expound (either in the direction I’m speculating or not), as to why you believe a social sciences statistical method based, presumably at least in part, on herd behavior, is appropriate here, and whether that factor played a part in your deciding to apply it here.
Steve: irrelevant. It’s just a technique for handling grouped data.

So. . . I guess I would at least be interested, if you care to expound (either in the direction I’m speculating or not), as to why you believe a social sciences statistical method based, presumably at least in part, on herd behavior, is appropriate here, and whether that factor played a part in your deciding to apply it here.

There is no such thing as a purely social sciences (or any other sciences) statistical method.

Statistical methodology is based on a set of assumptions under which statistics most relevant to a situation can be calculated in an optimum fashion and decisions made using proper hypothesis testing techniques. If the assumptions for a given procedure are satisfied, the method can be reasonably applied to a problem in any scientific area. Some methods (e.g. time series, experimental design, survey sampling, multivariate analysis, etc.) are popular in specific disciplines (and may even have been developed in those disciplines) because a particular type of environment is commonly studied in that area of science.

A major problem with climate science is that it has been extremely rare for researchers using statistics to explicitly state an appropriate statistical structure in clear mathematical terms so that one can verify that the assumptions implicit in the application of a procedure are satisfied.

Before the PCMDI was started, there were templates for modellers distributed, so that certain values or data sets would or could be taken as “semi-fixed” before modelling began. For example, the CSIRO Australia sheet of January 2005 is at

Theoretically, the adoption of values for some variables invalidates, or requires caveats on, the later use of comparative statistics. This is especially the case when the variability of such a semi-fixed factor is unknown. The method forces within-group variability to smaller values if the same values are used in replicated runs.

It follows that some of the between-group variability depends on whether each group uses the same values for these “semi-fixed’ variables. In the CSIRO case, one such example chosen from many is stated as

“6. surface albedo scheme
Land surface albedo taken from Sib data set (varying monthly).
Snow albedo changes according to snow age and zenith angle.”

I have not looked beyond Australia to see how much pre-calculation commonality was defined by such “semi-fixed” variables.

However, the very act of circulating a check list of inclusions for modelers introduces a distortion. It is common sense to agree on values like pi, Stefan-Boltzmann, etc, but then there is an intermediate group of these “semi-fixed” that might cause a modeler to say “We were not going to include that. But let’s do it in any case, even though it’s not our specialty”. (This is of course, speculation on my part).

The existence of the template confirms that there was some organization of responses. It is unknown to me if some groups swapped notes with each other. If they did, and if they altered values afterwards, this again distorts an analysis of between-group variability because the results are no longer independent of each other.

The CSIRO results were compared to the averages as calculated, presumably from material available to Douglass et al for the 67 simulation runs. From their table II(a) I have extracted the CSIRO results (2 runs submitted), the mean of the 67 simulations and the standard deviation as described. Units are in millidegrees C per decade. Pressures are in hPa.

The close agreement at pressures of 600, 500 & 400 hPa was described here as coincidental, but it leaves one with an impression that there might have been some comparison of model runs, because the probability of such close results has many zeros after the decimal.

This is not an allegation of impropriety. It is simply an observation that might be germane to the present thread.

But one thing troubles me; it seems to me that any kind of statistical combination of individual models makes no physical sense. I mean, the individual models are each a logically constructed representation of (the modelers hope) physical reality. Each run of a model represents one possible “evolution” of reality according to that model. As such, it makes perfect sense to compare the individual realizations for each model to observations; net trend and variability can be meaningfully compared to net trend and variability in the observational data. If a model is consistently off in net trend or variability, then it can be judged to be inconsistent with measurements.

But I don’t see how an average of trends from different models, or an average of variation from different models, generates anything that aids in model verification/validation. It seems to me that verification, or lack of it, should be based on individual models, not an average of models. If a particular model is consistent with observations, then that model’s projections would seem to be a ‘valid’ representation of reality, while any model which consistently makes projections which are in conflict with observations can reasonably be judged as ‘invalid’.

The money quote “And one result is the realization that the previous generations of models have under-represented uncertainty in the physical climate system – i.e. the previous projections for future climate change were more precise than they should have been.”

This should alleviate the concerns expressed by Steve Mosher and Jeff Id.

No concerns, only curiosity about the method. The explanation above makes a lot of sense to me. I’m not sure how the various SD’s are combined for the model mean. When I first read the paper and Nick Stokes did his histogram of trends, it was obvious that the whole SD of the spread of trends wasn’t used.

The problem is the IPCC insists on using the ensemble of models as evidence of the worth of the models when the IPCC should be evaluating each model and discarding the ones that don’t measure up. As long as the IPCC insists on using the ensemble to influence policy makers critics must focus on testing the ensemble.

Hypothesis testing usually involves determining that there is an appropriately small probability that the DIFFERENCE between a model and observation COULD INCLUDE 0. Once this has been demonstrated, the model is usually considered to be invalid no matter how big the difference. However, you could go further and determine the probability that the difference between observation and model is at least X deg/decade. Right now, all you say (with a very high degree of certainty) is that model projections are too high, but you don’t saying anything about how much too high. I would be interested in knowing what the probability is that the model trend is at least 50%, 100%, 150% and 200% more than the observed UAH and RSS trends. For example, if you can say there is only a 3% probability that the models disagree by less than a factor of 2X compared to UAH and only a 15% chance that they disagree by less than a factor of 3X, you will have illustrated the size of the gap between models and observations.

The upper tropical troposphere is the most sensitive site on the planet to GHG-induced warming. Could you extrapolate from there to the the IPCC’s global predictions (which are based on ensemble means)? Possible conclusion: “Based on discrepancies observed in the upper tropical troposphere, the IPCC’s global projections of warming are very likely (confidence estimate) to be 50% too high and likely (confidence estimate) to be 100% too high. There is a 50% chance the error is greater than a factor of X.”

One of the benefits of panel regressions is that it forces you to spell your null hypothesis out clearly. In this case the null is: the models and the observations have the same trend over 1979-2009. People seem to be gasping at the audacity of assuming such a thing, but you have to in order to test model-obs equivalence.

Under that assumption, using the Prais-Winsten panel method (which is very common and is coded into most major stats packages) the variances and covariances turn out to be as shown in our results, and the parameters for testing trend equivalence are as shown, and the associated t and F statistics turn out to be large relative to a distribution under the null. That is the basis of the panel inferences and conclusions in MMH.

It appears to me that what our critics want to do is build into the null hypothesis some notion of model heterogeneity, which presupposes a lack of equivalence among models and, by implication, observations. But if the estimation is done based on that assumption, then the resulting estimates cannot be used to test the equivalence hypothesis. In other words, you can’t argue that models agree with the observed data, using a test estimated on the assumption that they do not. As best I understand it, that is what our critics are trying to do. If you propose a test based on a null hypothesis that models do not agree among themselves, and it yields low t and F scores, this does not mean the hypothesis of consistency between models and observations is not rejected. It is a contradictory test: if the null is not rejected, it cannot imply that the models agree with the observations, since model heterogeneity was part of the null when estimating the coefficients used to construct the test.

In order to test whether modeled and observed trends agree, test statistics have to be constructed based on an estimation under the null of trend equivalence. Simple as that. Panel regressions and multivariate trend estimation methods are the current best methods for doing the job.

Now if the modelers want to argue that “of course” the models do not agree with the observations because they don’t even agree with each other, and it would be pointless even to test whether they match observations because everyone knows they don’t; or words to that effect, then let’s get that on the table ASAP because there are a lot of folks who are under the impression that GCM’s are accurate representations of the Earth’s climate.

“Now if the modelers want to argue that “of course” the models do not agree with the observations because they don’t even agree with each other, and it would be pointless even to test whether they match observations because everyone knows they don’t;”

Gavin Schmidt said that (almost those exact words) on Annan’s blog a few days ago

To repeat the obvious, your work shows that the models are inconsistent with the observations, which means the models do not correspond to reality — they do not work — or possibly that something else is wrong. Either way, you have exposed a huge problem, and the modelers need to fix it before they do anything with their “results”.

The MMH10 analysis is important. I would encourage everyone to take the time to understand it

Not quite. Which is Steve F’s earlier point. Some models do a better job than others. But if you focus on that then you show that the trend is less than the ensemble and less scary and, therefore, less need for immediate dramatic action and in all likelihood no action on CO2 emissions.

Santer et al’s huge effort to rebutt Douglas is, I guess, because Douglas’ results threatened the legitimacy of the scary scenarios. Now we know that the 17 Santer authors were misleading themselves and us. Watch now for a new effort to rehabiltate GCMs.

The employed estimator has a distribution around the H0 value, under the H0. This information is used to perform tests on the true value of the trend, for each chosen H0 (e.g. is it equal to 0? is it than 0? is it equal to some other trend?)

In case of MMH what is tested is:

H0: model generated trend parameter = historical/true trend parameter

,with both sides of the equality estimated, the former on GCM outputs and latter on observed temperatures.

If the same model is run multiple times it generates multiple trend estimates (yes, they have a distribution). We know that these estimates are correlated. This consitutes additional information about the trends generated by that particular model.

More information -> tighter confidence intervals for all tests involving trends generated by that model.

Re: Ross McKitrick (Aug 13 01:47),
Pete is right. Ross’s null hypothesis makes no sense. We have a large number of models and can run each many different times with different initial conditions or different parameter values, giving many different trends. We have only one observed climate. So what does “the models and the observations have the same trend” mean? See also latest comment from snowrunner.

He is also right about the box plots being ‘wrong’, or at least not explained. The SDs are much smaller here than in MMH. You are using a different SD here.

Both of these points relate back to what I asked on the first thread, unanswered:
“It is unfortunate that the paper does not explain clearly the meaning of the ‘bars’ on figs 2 and 3, either in the caption or the text. How are they defined for (a) the models and (b) the observations? This has caused considerable confusion both here and on Annan’s blog. I hope this can be clarified before the paper goes to press. Bear in mind that most casual readers will just focus on these figures (as this blog post does).”

There is something wrong in Table 1. At the top it says the numbers in brackets are SD but in the caption it says they are SE.

Re: PaulM (Aug 13 05:00),
Yes, the boxplots are not well explained. The R boxplot function does not show standard deviation. You’ll see that the whiskers are not symmetrical. I understand they actually terminate on data points, which could be the extremes, but the complicated R default, used here, is that they are the furthest data points within a limit that is 1.5 times the interquartile range. All this gets a bit rough for small samples.

One consequence is that they include no contribution from model trend se’s. It is therefore rather odd to show them on a plot with observations where the red arrows do represent trend se’s and not data spread. They are totally different things.

One begins to wonder what the discussion is all about, apart from statistical (il)literacy.

Best, VS

PS. Steve, just for the record, the within-corrections employed in equations 6-11 in MMH are of the fixed effects, FE, type (not random effects, RE).

I’m noting this because the term fixed effects is used 0 times in both the paper and this post, while you are mentioning random effects. In addition, you also wouldn’t be able to test your hypothesis employing a RE estimator, as you are interested in the actual point estimates, rather than simply correcting for heterogeneity.

RE vs FE, in terms of accuracy versus precision: FE is more accurate (ie. it’s more robust, see below), but less precise (ie. more parameters to estimate).

In your example you are referring to mixed effects (or ‘random intercept’) models used by sociologists, which are of the random effects type. In practice, such specifications are flawed to the bone, as the RE enters the composite error term (not X, like the FE) and should therefore be completely independent of the other explanatory variables included in order to achieve estimator consistency.

Needless to say, in the case of schools etc. this assumption arguably never holds.

Ross: Are you saying that your null hypothesis is that the models and the observations ALL have the same trend? So this would be rejected if all models bar one had exactly the same trend as the observations?

Steve: You have come up with a very unusual definition of an ensemble model. It is true that your “median school” would probably have roughly the same within school variance as all the others, and it is also true that any new model would probably have roughly the same between run variance as the models we have seen. However, the point of the ensemble is that the modelers know that the individual models are not correct, and that multiple runs of the same model aren’t sufficiently different to allow a correct expression of the uncertainty in the estimation. The ensemble assumption is that the spread of results among different models does include enough variability to do this. Thus the truth should be somewhere in the range of values produced by the full range of models. If you want to do hypothesis testing, you have to be a bit more precise than this, so you might say that the model results come from a distribution centered on the truth (I think IPCC said this), and then you could do the test that the observations were equal to the model mean. As many people have pointed out this is rather pointless because the hypothesis will be rejected as soon as you have enough models to estimate the mean sufficiently precisely. The other hypothesis that is sometimes suggested is that the truth lies somewhere in the distribution of models. You can then calculate the p-value as the probability that a model chosen at random would give a result more extreme than the truth. The difference between the test of these two hypotheses is the difference between using the se and sd of the distribution.

I think your test is something of a hybrid. The point you have not made is that even under your definition of an ensemble model you don’t actually know the mean (or the median, if you prefer), you have to estimate it, and to do this you use the between model se (not sd). Thus you have an uncertain mean, plus some within model variability. You are then testing whether the truth lies in the distribution of results from this proposed model. If there was no between model variability you would be testing the second hypothesis, if there was no within model variability you would be testing the first.

It is perfectly true that the hypothesis that the truth lies somewhere in the ensemble distribution will never be rejected if there are enough extreme models. There is nothing wrong with this though, so long as the full uncertainty is reported. If the model predictions are that the temperature increase in the next 20 years will be say 0.2C+/- 0.3C, then this is probably impossible to falsify, but also not of much use. If that is a true statement of the current state of the art of modeling, fair enough.

“However, the point of the ensemble is that the modelers know that the individual models are not correct, and that multiple runs of the same model aren’t sufficiently different to allow a correct expression of the uncertainty in the estimation. The ensemble assumption is that the spread of results among different models does include enough variability to do this. Thus the truth should be somewhere in the range of values produced by the full range of models.”

I could not possibly disagree with you more. Climate models are based on physical theory, and are designed/implemented to as best possible fairly represent physical processes. If multiple runs of a single model do not include the measured trend (within their distribution of multiple runs), then there are deficiencies (AKA errors) in the model which are causing the discrepancy. Different models may have completely different and completely unrelated errors, but there is no reason to think that these errors somehow “off-set” each other when an average is formed, and there is absolutely no reason to believe that a group of error-prone models somehow become “correct” when their individual runs are combined to form a pool of runs. This is simply a fantasy.

I’m not saying that this is my view, just that this is the point of ensemble modeling. I quite agree that there is no reason to suppose that the models should average out to the correct value, or that the range of models should include the true value (there is a sort of implication that to make a set of wrong models right, all you need is a few models wrong enough in the opposite direction). Nevertheless, this is what ensemble modeling is about, and this is what the various tests are supposed to address.

VS: The tests in table 1 are for the significance (ie different to zero) of the trends for each model or data set individually aren’t they? They don’t have anything to do with a comparison of models to observations.

Let me raise another point, which is fundamental to MMH and much other work on comparing models to observations: This is not a time series problem.

A model is intended to give an estimate of what is going to happen in reality. We can’t compare it with the future so we produce runs for the last 20 years or so and compare the results with the observations. For global values the measurement error is quite small so for the purpose of this argument I will say we know what actually happened exactly. Now we have modeled values for every year, and the truth for every year, so the obvious thing to do would be to compare them directly. It would be really useful if a model could give us an accurate value for every year in the future. However, we know that the models are not very good at this, in fact there is almost no correlation between the yearly modeled values and the truth, so we take summaries of the truth (like the trend) and compare that to the same summary of the model. We can define trend in any way we like, but however we do it we know it exactly. One possibility is the slope of the least squares regression line. Another might be the difference between the last and first year’s data. Whatever we choose, we know exactly what it is. The same is true for an individual model run. We know the summary. For the truth, there is no argument, we know what happened and we want to know how close the model came. We certainly don’t fit an AR model and say we are interested in the trend parameter but we don’t know what it is exactly. This is equivalent to saying the world could have been different and we are interested in the parameter used to generate multiple worlds of which we have one realization. Even if you believe this multiple world hypothesis, we are still only interested in the one we actually live in. For the models the argument is a bit different, because we COULD have got different results. However, we know for certain they are not generated by an AR model. They are generated by a series of mathematical equations which give different results depending on the starting values. There is no need to do any estimation however, because we have different runs (for most models) and so we know the variability in the trend between runs. This variability may or may not be similar to that found by assuming an AR model (has anyone checked how similar it is?). The only reason to use time series modeling is to predict into the future. If you want to know what might happen in the real world based on the observations so far, you could fit a time series model and predict from it. In that case you certainly would want to include the uncertainty in the trend estimate. You could do the same for the models, ie if you want to know what they will predict in the future you could fit a model, or alternatively you could just run the model into the future (fitting a model isn’t as silly as it sounds because you could get a lot of runs quickly, this is what emulators do).

Picture the models as being experts. I ask the experts for their estimates. I have to squeeze the experts a bit to get a straight answer (= uncertainty within-group, the answers are qualified and slightly fuzzy).

Then I observe that most experts are actually inaccurate, even taking the hesitations into account. Politically, I also cannot exclude experts that are fully wrong. And I don’t have other experts nor tarot readers to ask (i.e. don’t know how to fix the models), how can I proceed?

I assume that each expert has his bias, which should compensate [*]. I then make the average of all expert’s estimates and assume that to be close to the truth.

What has been proven in the new paper MMH10 is that this last assumption still yield a result far from the reality. Hence it actually shows that the hypothesis marked with [*] was wrong – averaging did not cancel errors out.

Another way to say that is:
For averaging to cancel errors, of course, some experts (about 1/2) should overstate and some other (about 1/2) should understate reality. The models are mostly overstating reality (i.e. show a trend too large) so obviously averaging will still give an overstatment!

OK – I admit to still knowing that I do not understand this. I understand the school example, but I am not clearer on understanding how this can be used to relate to the climate models. Where are my false assumptions?
1) The models are individually valid for their chosen parameters.
2) The observed temperature is equivalent to observing a single school.
3) The model within group variance relates to the accuracy of the observation to predict the parametrisation.
4) So we are able to select a sub-set of models which do match the observations for our actual school of interest.

I think I might have just found the answer – does this process allow me to determine which of the models could be relevant to my observation series, in effect observing that the models with higher or lower trends could be valid, just not for this instance of earth – and that the model ensemble is not a useful predictor for anything other than the precision which could be expected for a single ideal model?

Can these models be treated as a collection of schools, and what does the ensemble tell us?

In the continuing spirit that anomalous results can be rich in information, have a look at model 22 in Douglass et al, INCM_3_0 Institute for Numerical Mathematics, Russia. Would you exclude it? What criteria would you use to justify exclusion? With each passing year it seems to be the best performer, does it not?

The observed trend in temperature 1970-2009 appears “consistent with” that produced by climate models for the same period because you get such a large range of predictions from different models.

I think the funniest thing to point out is that the various models aren’t even “consistent with” one another (let alone the observations), so how can you turn around, average out their inconsistent results and then state that collectively they produce a trend consistent with observations.

It _is_ funny that ‘the various models aren’t even “consistent with” one another (let alone the observations)’. This is a problem, though not necessarily a concern for any particular model. Modeling involves simplifying — explicitly or implicitly omitting countless factors — and this is a subjective activity. One does not expect two models to produce exactly the same estimates.

What one does expect, however, is a “good” correspondence (defined in terms of a distance metric involving the difference between the model predictions and the observations with respect to model standard error of prediction and uncertainty in the observations) between model predictions and observations. If I understand correctly, we are not seeing that.

Obviously it is not necessarily the case that only the models are wrong. The observations could be wrong, too.

FInally, this needs to be sorted out before anyone should trust the model results. 😉

I think it is more nuanced than that. confidence intervals overlap, but in 1 case of the 4 (RSS/UAH LT and MT), the confidence limit does not encompass zero (indicating significant positive trend) while the others do not.

Given the presenting policy issue of CO2 emissions and the agreed upon nature of CO2 as a GHG, why wouldn’t the null hypothesis be whatever the temperature trend is without any feedbacks, positive or negative, since that is what the models are leargely designed to incorporate?

One thing that I need to re-iterate is that this post is NOT – repeat NOT – an explication of MMH10. I’m going to re-emphasize that.

However that Santer’s ensemble results don’t hold up with additional data is confirmed in a number of ways.

1) using his own method;
2) using the MMH10 method;
3) using the mixed effects analysis of this post

The matrix algebra of MMH10 is hard and I was trying to figure out a way of giving a perspective on the problem given the kneejerk assumption from Annan, Schmidt and others that MMH10 was “wrong” in some elementary way.

As I wrote it, I got intrigued by the stratification issue. But this post is not an elucidation of the MMH10 algebra, which comes at things from a different perspective.

I’ll try to figure out a way of presenting the MMH10 algebra in a friendly version as well.

I’ve edited the intro to this post a little to further emphasize that this post is NOT an explication of MMH10 but an attempt to place into perspective some of the blog comment “rebuttals” of MMH10, hopefully shedding light rather than confusion.

I am not clear why ECHAM5 has a mean trend of 0.446 in your graph (and R script), and 0.227 in Table 1 in the paper. The other series agree much more closely (though not exactly). Can you explain why the means are not identical?

there was a mistake in the collation of trends that I used in this post -which was (confusingly) supposed to help clarify things rather than confuse things. I used trends calculated up to 2099 instead of 2009, which ended up making the within-group standard deviations too narrow. I’ll rework this and re-post. However, I’m going to Italy next Tuesday and am working on a presentation so the re-post will have to wait. This has also been a distraction from the MMH issues so let’s focus on them.