The work presented is a straightforward comparison of temperature trends, both observed and modelled. The goal is to check the consistency of the two - ie, asking the question "are the observations inconsistent with the models"?

This is approached though a standard null hypothesis significance test, which I've talked about at some length before. The null hypothesis being that the observations are drawn from the distribution defined by the model ensemble. We are considering whether or not this null hypothesis can be rejected (and at what confidence level). If so, this would tend to cast doubts on either or both of the forced response and the internal variability of the models.

It may be worth emphasising right at the outset that our analysis is almost identical in principle to that presented by Gavin on RC some time ago. In that post, he formed the distribution of model results (over two different intervals) and used this to assess how likely a negative trend would be. Here is his main picture:

He argued (correctly) that if the models described the forced and natural behaviour adequately, a negative 8-year trend was not particularly unlikely, but over 20 years it would be very unlikely, though not impossible (1% according to his Gaussian fit).

We have extended that basic calculation in a few ways, firstly by considering a more complete range of intervals (to avoid accusations of cherry-picking on the start date). Also, rather than using an arbitrary threshold of zero trend, we have specifically looked at where the observed trends actually lie (well, we also show where zero lies in the distributions). I don't believe there is anything remotely sneaky or underhand in the basic premise or method. One subtle difference, which I believe to be appropriate, is to use an equal weighting across models rather than across simulations (which is what I believe Gavin did). I don't think there is any reason to give one model more weight just because more simulations were performed with it. In practice this barely affect the results. Another clever trick (not mine, so I can praise it without a hint of boastfulness) is to use not just the exactly matching time intervals from the models to compare to the data, but also to consider other intervals of equal length but different start months. It so happens that the mean trend of the models is very much constant up to 2020 and of course there were no exciting external events like volcanoes, so this gives a somewhat larger sample size with which to characterise the model ensemble. For longer trends, these intervals are largely overlapping, so it's not entirely clear how much better this approach is quantitatively, but it's still a nice idea.

Anyway, without further ado, here are the results. First the surface observations, plotted as their trend overlaying the model distribution:

You should note that our results agree pretty well with Gavin's - over 8 years, the probability of a negative trend is around 15% on this graph, and we don't go to 20y but it's about 1% at 15y and changing very slowly. So I don't think there is any reason to doubt the analysis.

Then the satellite analyses (compared to the appropriate tropospheric temps, so the y axis is a little different):

And finally a summary of all obs plotted as the cumulative probability (ie one-sided p-level):

As you can see, the surface obs are mostly lowish (all in the lower half), and for several of the years the satellite analyses are really very near the edge indeed.

Note that the observational data points are certainly not independent realisations of the climate trend - they all use overlapping intervals which include the most recent 5 years. Really it's just a lot of different ways of looking at the same system. (If each trend length were independent, then the disagreement would be striking, as it's not plausible that all 11 different values would lie so close to the edge, even with the GISS analysis. But no-one is making that argument.)

It is also worth pointing out that this analysis method contradicts the confused and irrelevant calculations that some have previouslypresentedelsewhere in the blogosphere. Contrary to the impression you might get from those links, the surface obs are certainly not outside the symmetric 95% interval (ie below the 2.5% threshold on the above plots), though you can get just past 5% for HadCRU for particular lengths of trend and a couple of the satellite data points do go below 2.5%, particularly those affected by the super-El-Nino of 1998.

As for the interpretation...well this is where it gets debatable, of course. People may not be entitled to their own facts, but they are entitled to reasonable interpretations of these facts. Clearly, over this time interval, the observed trends lie towards the lower end of the modelled range. No-one disputes that. But at no point do they go outside it, and the lowest value for any of the surface obs is only just outside the cumulative 5% level. (Note this would only correspond to a 10% level on a two-sided test). So it would be hard to argue directly for a rejection of the null hypothesis. On the other hand, it is probably not a good idea to be too blase about it. If the models were wrong, this is exactly what we'd expect to see in the years before the evidence became indisputable. Another point to note is that the satellite data shows worse agreement with the models, right down to the 1% level at one point, and I find it hard to accept that this issue has really been fully reconciled.

A shopping list of possible reasons for the results include:

Natural variability - the obs aren't really that unlikely anyway, they are still within the model range

Incorrect forcing - eg some of the models don't include solar effects, but some of them do (according to Gavin on that post - I haven't actually looked this up). I don't think the other major forcings can be wrong enough to matter, though missing mechanisms such as stratospheric water vapour certainly could be a factor, let alone "unknown unknowns"

Models (collectively) over-estimating the forced response

Models (collectively) under-estimating the natural variability

Problems with the obs

I don't think the results are very conclusive regarding these reasons. I do think that the analysis is worth keeping an eye on. Anyone who thinks that even mainstream climate scientists are not wondering about the apparent/possible slowdown in the warming rate is kidding themself. As I quoted recently:

However, the trend in global surface temperatures has been nearly flat since the late 1990s despite continuing increases in the forcing due to the sum of the well-mixed greenhouse gases (CO2, CH4, halocarbons, and N2O), raising questions regarding the understanding of forced climate change, its drivers, the parameters that define natural internal variability (2), and how fully these terms are represented in climate models.

That wasn't some sceptic diatribe, but rather Solomon et al, writing in Science (stratospheric water vapour paper). And there was also the Easterling and Wehner paper (which incidentally also uses a very similar underlying methodology for the model ensemble). Knight et al as well: "Observations indicate that global temperature rise has slowed in the last decade"

So all those who are hoping to burn me at the stake, please put away your matches.

99 comments:

Tony Sidaway
said...

Thanks. This is some way over my head, but I get the basic gist, that your analysis shows that the observations are close to falsifying the models and although this isn't yet significant it's consistent with a scenario in which the models are about to be broken.I note that you briefly link to Nick Stokes' comments on moyhu in your introduction. Do you have a response to his claim that the approach adopted is inappropriate?

Specifically, he says: 'But here's the fallacy - that 95% range is not a measure of expected spread of the observations.'It expresses the likelihood that a model output will be that far from the central measure of this particular selection of models. It measures computational variability and may include some measure of spread of model bias. But it includes nothing of the variability of actual measured weather.'

It's not often recognized that the model ensemble is a collection of best guesses, not a full exploration of some plausible range of climate sensitivity. If models remain on the low side, their best guesses might be falsified, but the trend could remain perfectly consistent with some other significant sensitivity (say, 2C), even absent missing forcings or other data issues. Obviously that can't go on forever, but it leads to a very different interpretation of falsification than skeptics normally convey.

According to metoffice HadCrut3 is at the lower end of likely warming (http://www.metoffice.gov.uk/corporate/pressoffice/2009/pr20091218b.html) due to coverage in some of the fastest warming regions of the world(arctic,a lot of land area) , NCDC also excludes arctic and antarctic while giss coverage is almost global; so shouldn't a proper comparison be done after subsampling model output at the grid pixel with available data for each temperature reconstruction?

After all model data are defined over the entire world and the only almost global analysis (giss) is just somewhat lower than the expected warming:

And, you pointed to Moyhu; he illustrates the four month difference now, with his blinkenlines -- were there other times within the 15-year(?) span you assessed when a four month change would have made such a difference, or did something unusual happen in 2010 after the paper's cutoff?

James,You cited Gavin's analysis. But he presented only the statistics of model results. The problem with your analysis is that you have superimposed climate index results on model results without proper caution with variances.

Your null hypothesis needs to be stated carefully. As I see it, it is:H0: Climate indices (not climate) behave as if from the chosen population of model outputs.

Now you can test any null hypothesis you like, but falsifying an improbable null hypothesis is of little interest - unless someone then misuses it to claim a more significant result has been proved, like say, global warming has stopped!

I've said elsewhere that your plots are misleading in that the black probability curves are not related to observed climate statistics but to model variability. Now you could say that it's part of your null hypothesis that these are the same. If so, then that may well be the part that you have (almost) falsified. And that's not surprising. You could argue that there's reason to believe (or hope) that model variability will approach climate variability. I think there are good reasons for doubting this (volcanoes, ENSO etc), but in ant case, there's the further issue of measurement uncertainty. Models don't really have this - there are no missing values, instrument issues etc. Indices do, as shown minimally by the fact that they aren't all the same. And the differences underestimate the measurement uncertainty, since each index class uses a large amount of shared data, and error in that data won't show up in the differences.

There are, of course, other issues with the probability bounds. Did you make a correction for time correlation of model residuals, as in Santer et al 2008, for example? How are readers supposed to interpret the highly correlated trends with their supposed individual probability curves? This lack of independence allows stsements like your "for several of the years the satellite analyses are really very near the edge indeed" It's actually not years but trends lengths, and the fact that there are several years may follow from just one event (as with the 1998 dip).

This is an interesting project and it will be fun to see how it plays out over the coming years. You seem to have been overloaded with questions, so here's a few more for the stack:

How did you build the multi-model PDF? Is it simply the frequency distribution of all realisations weighted such that each model has equal probability mass, or is there a parametric fit somewhere? What does the first figure (pic1.png) look like if you consider longer trends?

Another question, in which I demonstrate my near-total ignorance about the models:

Assuming the results describe a real discrepancy, could it be something like an incorrect common assumption in the models about the rate at which heat goes into the deep oceans? That seems to be the implication of recent remarks by Trenberth and others.

BTW, this may or may not be relevant, but a new paper by Kerry Emanuel (which I found riveting because of its findings about hurricane trends) found that the GCMs (not quite sure if he meant all of them, but certainly the ones that have been used for Atlantic hurricane modeling) err in that they fail to show the measured sharp declining trend in lower stratospheric temperatures (maybe a consequence of AGW, but if so by a more complex process than the upper stratospheric trend).

Note also some related work by Jim Elsner finding that the solar cycle has a disprportionate affect on stratospheric ozone and so tropopause temps (=> TCs).

Even if these results have no relevance for this discussion, they would seem to imply a major gap in the models.

Tony, I think the criticism is wrong. The models also have their own natural variability exactly analogous with weather. The null hypothesis is that reality can be considered a sample from the model ensemble. Yes, the results will change according to what happens in the future. In that respect we are putting ourselves out there, but really we are just presenting the most appropriate and obvious way of comparing models to data.

Anon1: Yes, I was more than a little surprised to see her name on it. I don't know whether she realises how and why this analysis differs from what she has previously posted - I've not had any contact with her.

BCL, well it depends on how big a record. I haven't done the sums. A strong warming would certainly pull the results back up through the model range. Here's hoping :-)

Tom, I think your premise is rather debatable, but that's something that has to be written up in future. I do agree that even if the obs are outside the bottom end, this only means a smaller response than the model range, and not a zero one. Indeed I believe all the co-authors expect indefinite future warming under increasing GHG concentrations.

Anon2, I agree that data coverage is a possible issue with the obs estimates. In principle we could generate model analogues by masking out the unobserved areas, but it would be a lot of work. And note that these estimates *are* consistently presented as estimates of global temperature, so I don't think our simplification is unreasonable.

Hank1: OK I could have chosen better words. All models are *wrong* in some sense but the question is whether we can tell this by looking at the short-term trends. Answer is, well we sort of can't quite but it might be looking a little dodgy, especially the troposphere/satellite comparison.

Hank2, I haven't looked. The short trends are bound to be noisy. Note that we have just had a pretty strong El Nino. Wait another couple for years for the La Nina, who knows, it may be worse. This is the most straightforward and appropriate way of doing the comparison IMO.

Nick, yes, variability is part of the hypothesis. We did look a little bit at obs uncertainty, and they are small enough to not matter (at least if you believe the error estimates of Brohan et al). OTOH the alternate analyses do have some differences which could be used as an uncertainty estimate. If I was in charge I might do some more detailed analyses but I really don't think the current one is inadequate. I agree that consecutive years are not independent, there is no specific claim relating to that.

ac, the pdf was a fit to the frequency distribution (see Chip's presentation for a pic), in fact a few shapes were chosen (gaussian and t-distribution with different degrees of freedom IIRC) and it doesn't make any difference to the overall picture.

Steve, yes too little ocean heat uptake would make the models warm too fast compared to obs, which is one possible cause of a mismatch. I wouldn't like to say if this was a likely one though, I think most people argue the models have if anything too much mixing.

Yes,>Anon1: Yes, I was more than a little surprised to see her name on it. I don't know whether she realises how and why this analysis differs from what she has previously posted - I've not had any contact with her.

I do realize how and why this analysis is different from others I have posted.

James, I think a lot of the energy on this subject is a response to the statement in Chip's presentation that "Global warming has stopped (or at least greatly slowed) and this is fast becoming a problem".

There does seem to be a pattern of otherwise reasonable papers whose conclusions are greatly exaggerated by "skeptics".

My guess is that Chip's presentation from Heartland (this paper shows that global warming has stopped!) is representative of how this paper would be treated in certain quarters if and when it's published.

Like I said in previous comments on an earlier thread. There are start and stop dates (including some of the ones that we used) that show a negative trend—which according to any definition I’ve ever heard would indicate that net warming did not occur during that particular period. Thus, the warmed stopped (or slowed) in the data we used.

The results of our analysis allowed us to see how significant that was. In some cases, it was looking pretty unusual.

Note: I did not say that “the anthropogenic warming pressure on global temperatures stopped.” It hasn’t, and it won’t for a long time to come (no matter our best intensions). And as James alluded to, none of the co-authors think that rising greenhouse gas concentrations won’t lead to rising temperatures—but that doesn’t mean that the models have everything, including the magnitude of the rise, correct. The co-authors most likely have different viewpoints on that subject!

Chip writes: There are start and stop dates (including some of the ones that we used) that show a negative trend—which according to any definition I’ve ever heard would indicate that net warming did not occur during that particular period. Thus, the warmed stopped (or slowed) in the data we used.

This seems like an exceptionally un-useful way of describing the data.

"The planet warmed from 1978-1981, but global warming stopped in 1982. In 1983 it resumed briefly, before stopping again in 1984-1985, after which global warming started up again in 1986-1988 ..."

One can argue that the models overpredict warming. One can point out that temperatures were relatively flat during much of the past decade -- though every year since 2001 was in the 10 warmest on record (GISSTEMP Land/Ocean).

But the statement that "global warming has stopped" is both scientifically questionable (compare current temperatures to those during the much stronger El Nino episodes of 1998 or 1983) and obvious red meat for the WUWT/Heartland audience.

I understand why Chip would try to spin the paper that way -- it's Chip's job to try to turn mainstream climate science into contrarian-friendly talking points. I'm just a bit bemused that James is (apparently) surprised by people's reactions.

Interesting you mention Gavin's test on RC, on the article you mention the test determines if the trend lies within the spread of the model runs, but in the later paper by Santer et al. (including Gavin) seems to be based on the standard error of the mean of the ensemble, rather than the standard deviations. I was wondering if you had any comment on this (as the discussion is on testing the model-data consistency in trends).

Also, it seems to me that the graphs for different trend lengths implicitly involves some multiple hypothesis testing issues. Clearly if the models are "inconsistent" at the 95% level for one value of the trend length, that doesn't mean that there is a model-data discrepancy at the 95% level. Clearly something like the Bonferroni adjustment would be too strict, but it seems to me to be a caveat worth mentioning. No Spanish Inquisition though ;o)

I doubt that James was surprised. There's a definite value to pinning those folks down to a method that they cannot honorably abandon later (not that most or all of them won't anyway). And, while I doubt the models are very far off on sensitivity since there's so much confirmation for that, they could much more plausibly be off on the trend through 2100. If so, though, the range of appropriate policy responses isn't going to change since we've moved rather quickly from blithely assuming that a +3C change wouldn't be so bad to figuring out that even +2C is probably too much, especially when carbon feedbacks are taken into account. What's happening in the real world matters too, considering that e.g. the actual cryospheric response to the real world warming trend is quite sharp and got noticably sharper during the decade of less warming than projected by the model average.

They didn't let you write the Conclusions section nor the press releases nor the Heartland show piece, obviously. Dang!

Do any of these models you looked at already include the Solomon water vapor paper info that you link to as an example? Is it feasible to adjust for that and say what difference it makes?

Are the next round of models expected to increase the uncertainty ranges? (I think I've read that a broadening of the uncertainty ranges is expected--and expected to cause problems for the summary writers)? Are the next round of models expected to move the trend line up or down, by incorporating e.g. Solomon's water vapor info?_____Possibly some other time:A broader question -- who's tracking research areas where science is getting better, accumulating a weight of work not yet incorporated in most climate models? I'm thinking of the recent ANDRILL work, LeQuere et al. on changes in plankton population dynamics, and comparing paleo conditions to anthropocene (will plankton blooms in polluted water resemble those in past; has the trophic collapse in the oceans changed the ocean response to runoff, etc).

This is a new one, the kind of stuff I'm wondering about:http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.143.6184&rep=rep1&type=pdf

Ocean Sci., 3, 43–53, 2007www.ocean-sci.net/3/43/2007/

How does ocean ventilation change under global warming?"... Implications of this result for global biogeochemical cycles are considered .... our results suggest that significant and detectable changes in shadow zone ventilation may be occurring today."

Does changing the end year rather than the start year affect the graphs?

I notice Hadcrut 1995-2005 is 0.023C/year

1995-2010 it's roughly half that (0.011C/year), suggesting the last 5 years have a big impact on the trend over the period.

A good deal of the flattening of temperature in recent years may be due to ENSO (declining trend over recent years despite the recent El Nino) and solar cycle decline, in which case setting the end point as 2010 might be biasing all the trends downwards slightly.

Then again the last 30 years of HadCRUT shows 0.016C/year which is lower than most of those model runs.

Lucia: Good, I hope you also realise that this is a more appropriate way to do it :-)

Anon1, well papers can get exaggerated and misrepresented even when the sceptics didn't write them, eg the Potty Peer vs Pinkerton. In this case, I can see the point of view that the statistics are not unequivocal and the interpretation is disputable. Regardless, I think the comparison is a useful one.

Skanky, well even a continuous string of (marginally) new records wouldn't necessarily keep the obs within the model range. I just haven't bothered doing the sums (yet). I will post more on the prospects for a new record shortly and may mention it then.

Anon2, I think Chip is fair in just describing the data there. Yes, it's not useful in itself, because data are just a list of numbers. It's always the case that data need interpretation. The big question is whether this "slowdown" actually represents the underlying forced response or natural variability.

Dikran, what's the paper? I agree multiple hypotheses abound. That's one reason why I haven't made any definitive statements about the results. Don't forget for the whole process to be statistically valid in the first place we had to determine the tests before ever seeing the data, which is obviously a principle that is honoured almost entirely in the breach in climate science (and also in many other branches of science).

Steve, you're right there. I'm not surprised, just defending what I think is a quite reasonable piece of work. And yes, I believe there is value in finding points of agreement (even if only methodological) between the different "camps" (to the extent that we are in different camps).

Hank, no none of these models have a stratosphere, and I guess even the next lot won't address the strat. wv realistically. It is definitely one possible factor.

Cthulhu, well it's reasonable to use all data to date. (At some point it would be embarrassing to not update with newer obs also, but I don't think we have got there yet.) Also, it is reasonable to not want to go back before 1995, so as to avoid Pinatubo - even though all the models have it, it would invalidate the trick of using other intervals of the same length. It would be interesting to take out the El Nino effect from data and all models, but this would require a much more detailed analysis to determine the El Nino signal in each model.

"Skanky, well even a continuous string of (marginally) new records wouldn't necessarily keep the obs within the model range. I just haven't bothered doing the sums (yet). I will post more on the prospects for a new record shortly and may mention it then."

I was interested in how it might affect your bets, but looking back at them, the trend value is not specified (in the ones I can find with a quick look). For some reason, I had thought you were betting based on the 0.1 - 0.2 per decade prediction, but that's probably somebody else. Apologies.

>>Lucia: Good, I hope you also realise that this is a more appropriate way to do it :-)

What's "it"? I think different tests are suited to slightly different questions. Because more than one question exists, I think several methods are useful ways to "do it", with "it" being answering some question that is posed.

Now, rather than delving into what "it" might be, I'll ask: Have you read Santer17 (pdf yet?

If you believe this "a more appropriate way to do it" than the metod Ben Santer chose when testing consistency of modeled and observed trends in the tropical troposphere, why?

I think the way we do it in the GRL submission has advantages and disadvangates relative to Santer's method. The possible disadvantage of this method include that the internal variability in models may differ from that of earth, and contribution of structural uncertainty (i.e. biases) across models to trends is not distinguished from the contribution of internal variability (i.e. weather).

These are assumptions we have to make. I think they fall in the range of reasonable assumptions. But I think it's also useful to examine related questions using other methods, but Santer's method also has advantages. For example: At least in principle, his method attempts to estimate the variability in earth trends using earth data. While we can argue whether or not his choice of AR1 and a few other assumptions were reasonable, his method relaxes or assumption that the internal variability in models matches that of earth.

It's also worth nothing that the tests actually examine slightly different questions. One tests whether the multi-model mean is consistent with the data; on tests whether the observation falls inside a distribution. Since the ensemble contains models with a range of bias, it may actualy be possible that the multi-model mean will, ultimately disagree with the observation while observations will fall within the range of weather in the full ensemble. (I can think of several conditions under which this can occur. One range involves at least some models containing extremely wild weather; the other involves at least some models being useful while others are highly biased.)

I could say more about the relative advantages and disadvantages, but I think I would rather invite you to discuss what you think are the advantages and disadvantages of the methods. My general point of view is that it is convenient to have access to two (or more) methods, each of which makes different assumptions, test different questions and might give slightly different answers.

I, too, would be interested in your thoughts about applying the Santer17 approach to surface temperatures. Lucia and I have had many discussions about this, with each of us having a slight preference for a different approach, however, as Lucia points out, the methods (our current one and Santer17) address different questions, so it seems that both have their uses.

James>>it is reasonable to not want to go back before >>1995, so as to avoid Pinatubo - even though all >>the models have it, it would invalidate the trick >>of using other intervals of the same length

The limitation is worse that than you suggest owing to a second complication. Some models don't include Pinatubo.

If one was to merely examine the distribution of, for example, 7 year trends across the full model ensemble as a function of start year beginning in 1900 and going forward, you'll see the spread distribution increase rather dramatically around any volcanic eruption. This is because some models include volcanic aerosols and show the dips and recoveries; others don't.

This (and other) features of the ensemble results in other limitations which arise when people start to wonder if it make sense to compare the distribution of all rolling 7 year trends from 1900-2000 from an individual model or the model ensemble to earth. For example, if "model A" perfectly mimicked ENSO, PDO etc, but did not experience volcanic forcing, one might anticipate the spread of 7 year trends should be less than observed in earth-- which did experience volcanic forcing. (Or, if one had a theory that, notwithstanding the violent dip and recovery after strong volcanic eruptions, the over all effect might be to calm the earth, you'd only go so far as to suggest that the spread of the trends should be different. But in any case, if the forcing applied to a large number of runs from a perfect model did not match that of the earth, we couldn't necessarily expect the spread in models trends to be the same. Leaving out a forcing with a dramatic visible effect both in model runs and in the observations might well be expected to have a noticable effect on the spread.)

> none of these models have a > stratosphere, and I guess even > the next lot won't address the> strat. wv realistically. It is > definitely one possible factor

And it's a factor biasing results in the direction described?

What other factors will be taken into consideration by models that include a stratosphere, that we know about now?

Seems like an adjustment for 'known uncalibrated' factors might be interesting.

If you were just doing Kentucky windage,* as though you had an improved model, which way would you expect the results to go by adding in what you know ought to be included?____________*http://www.microwaves101.com/encyclopedia/slang.cfm

"Anon2, I agree that data coverage is a possible issue with the obs estimates. In principle we could generate model analogues by masking out the unobserved areas, but it would be a lot of work. And note that these estimates *are* consistently presented as estimates of global temperature, so I don't think our simplification is unreasonable."

That seems like a bit of a dodge. All the surface temperature groups make estimates of the uncertainties that arise from the gaps in the data. By masking out unobserved areas in the models you'd have a consistent estimate of that error and it would matter far less whether you believe Brohan et al's errors.

"We did look a little bit at obs uncertainty, and they are small enough to not matter (at least if you believe the error estimates of Brohan et al). "

These two are connected because the largest uncertainty in HadCRUT3 arises because the coverage at high latitudes is so poor. In recent years there's been a lot of warming in the Arctic which GISS picks up but HadCRUT3 doesn't. HadCRUT includes the effect of that missing data into the error estimate.

The implication is that the effect of observational error is not small and it does matter - compare the GISS and HadCRUT lines in the plots.

How did you come to the conclusion that the effect of observational uncertainty is small?

Lucia wrote "One tests whether the multi-model mean is consistent with the data; on tests whether the observation falls inside a distribution."

Why would one want to test whether the multi-model mean is consistent with the data? AFAICS there is no good reason why it should be as it is an estimate of only the forced component of climate change, whereas the data is a combination of this forced change and also an unforced component. Even if the model physics was perfect, the observed trend would be asymptotically outside two standard errors as the size of the ensemble grows large. So the test will fail the ideal model where the climate physics is correct and there are an infinite number of model runs (and hence a perfect characterisation of the uncertainties).

Also AFAICS for the test to be reasonable the error bars on the observations would have to reflect the variability of the unforced component of the climate change, rather than merely the uncertainty in estimating the trend from the observations. The best estimate we have of that would seem to be the spread of the model runs!

The only thing that I would be hope to be confident of claiming (were I a climate modeller) were that the observed trend should lie within the spread of the models. That would establish consistency of the models with the data.

Caveat emptor - I am not a climate modeller, so it is possible I have misunderstood plenty!

Hi Dikran Marsupial>>Why would one want to test whether the multi-model mean is consistent with the data? I think Santer was perfectly reasonable to test the multi-model mean. Here are some good reasons. First, the notion it should be tested in suggested on page 608 of Chapter 8 or the WG1 report to the AR4:

"The multi-model averaging serves to fi lter out biasesof individual models and only retains errors that are generally pervasive. There is some evidence that the multi-model meanfi eld is often in better agreement with observations than any of the fi elds simulated by the individual models (see Section 8.3.1.1.2), which supports continued reliance on a diversity of modelling approaches in projecting future climate change and provides some further interest in evaluating the multi-model mean results. Faced with the rich variety of climate characteristics".

So, it seems to me that Santer made a perfectly reasonable decision to test multi-model mean results. He did so because the authors of the AR4 suggest that we ought to be interested in testing those results.

>>So the test will fail the ideal model where the climate physics is correct and there are an infinite number of model runs (and hence a perfect characterisation of the uncertainties).

This does not happen when one applies the method Santer used. If you examine equation (3) in Santer, you will note the term {sbo} in the square root is uninfluenced by the number of model runs. It is estimated based on the residuals to a linear fit for the data. This quantity is an estimate of the variability in trends one would expect to see if you could re-initialize the earth's weather at the start time and test the trend again. In principle, it accounts for all factors contributing to the uncertainty-- both internal variability and measurement error.

(As I previously noted, Santer makes the assumption about the spectral properties of the noise. That's a fundamental way in which his method differs from the one in the paper on which James is a co-author.)

>> Also AFAICS for the test to be reasonable the error bars on the observations would have to reflect the variability of the unforced component of the climate change, rather than merely the uncertainty in estimating the trend from the observations.

You are correct. This is accounted for in the {sbm} term in equation 3.

>> The best estimate we have of that would seem to be the spread of the model runs!Do you mean the best estimate of the earth's internal variability?

I'm guessing that James agrees with you. However, this claim is based on believing that a) the model themselves produce the correct spectral properties of "weather" and b) that the contribution of structural uncertainty (i.e. biases) to the spread in trends across models are small relative to the variability due to weather.

The fact that both (a) and (b) constitute assumptions are the reason I believe it is useful to run both tests (and even additional tests if such are available.) Since I think both tests have some use, and both answer useful questions, I would not criticize Santer for have elected to apply the test he chose (I do have a few thoughts on some of the specific assumptions he made. But I'm not planning to discuss those at length in comments here.)

I can see the point in assessing the skill of the multi-model mean in reproducing the observed trend, but that is not the same thing as testing for consistency. While there are good reasons to think the multi-model mean is the prediction most likely to be right, there is no reason to expect the observed trend to lie any closer to the mean than merely within the spread of the models (due to the effect of the unforced component in the observed trend).

Testing for consistency is essentially a test for whether the multi-model mean is falsified by the data. The data don't falsify the multi-model mean unless they lie outside the spread of the model as that defines the error bars of the prediction of the observed trend. The standard error defines the error bars on the prediction of the forced component of the trend, but that is not the same thing.

I think we are talking at cross-purposes, equation (3) in Santer et al. seems to relate to tests of individual model realisations. The tests of the multi-model mean are in section 4.2, namely equation (12), which is an improvement on the incorrect test proposed in DCPS07. Here the standard error of the mean is used as the uncertainty on the multi-model mean, which is dependent on the size of the ensemble. This means that an ideal model with perfect physics and an infinite ensemble is likely to fail the d^*_1 test (12) unless the uncertainty of the observed trend is large. If the model is correct, it shouldn't rely on the observational uncertainty to save it from being falsified even though it is correct!

You are correct, I was indeed suggesting that the spread of the models is an estimate of the Earths internal variability. However it could be that the models significantly over- or under-estimate the true value, which would make its use in a test of the models a bit circular if used in that way.

Basically, it seems to me that the reason the Douglass et al test was wrong was not so much that it ignored the uncertainty on the observations, but in using the standard error of the mean instead of the standard deviation to represent the uncertainty of the model prediction. If you could isolate the forced component of the observed trend it would have been a reasonable test, however I don't know how that could be done.

I hope that goes someway to clarifying my point, rather than obfuscating it further!

I think Santer et al probably chose that way of presenting the analysis because they were basically commenting on the Douglass et al paper, showing how they had got it wrong. By all means calculate the multi-model mean to start with, but then you still have to account for the model variability if you want to test consistency with the ensemble as a whole.

Consider three models, which warm by 1, 2 and 4 degrees C over some long period, with small uncertainty. If the obs show 2C warming (again with v small uncertainty), my understanding is that Lucia's analysis would say this "refutes" the ensemble, since it is inconsistent with the mean of 2.33. But in this case the obs agree perfectly with the central model.

(If Lucia disagrees with my example, I'd be interested in seeing a calculation that she would do in that case, assuming all uncertainties are less than 0.1C)

>>If the obs show 2C warming (again with v small uncertainty), my understanding is that Lucia's analysis would say this "refutes" the ensemble, since it is inconsistent with the mean of 2.33. But in this case the obs agree perfectly with the central model.

Refutes the ensemble? I'm not sure what it means to "refute an ensemble".

The outcome you describe would refute the notion that the multi-model mean creates the unbiased projection. This is meaningful if someone or some entity either is, or has been communicating the notion that the multi-model mean is a method that cancels biases and results in an unbiased projection/forecast/prediction or what have you than any individual models, including the most central model.

If, instead, a group creating a forecast suggested the right way to create their forecast is to pick the model with the most central model, then I would not test their method by testing the multi-model mean. I would test the method they had adopted: That is, test the observations against the most central model.

The test of the multi-model mean would not refute the accuracy or usefulness of any individual model. One would need a test of the model mean for the individual model to do that. It is entirely possible for the multi-model mean to be wrong owing to the inclusion of high fraction of biased models that tend to be biased in one particular direction for some unknown reason.

Whether I say "IPCC projections continue to falsify" as the outcome of a test depends on what we consider to be the "IPCC projections".

If the IPCC projections corresponds to the multi-model mean with some particular uncertainty range, and the observed trend -- including it's uncertainty-- does not falll in taht range, then yes, I would continue to say the IPCC projections falsify.

Now going on to the numbers in your specific example:

1) Assuming the a standard error in the observed trend of 2 is ±0.1 I would say that the trend of 2.33 as a point value is outside the range consistent with the data. This is what I meant in my first answer, as it corresponds to the question I thought you were asking. I'm now not sure I understood your question correctly.

2) As for the method I currently use: we are testing the question of whether the notion of that a multi-model mean of randomly drawn GCMS from some ensemble of all possible GCMS with some quaifying properties (eg. stability during spin up, sufficient resolution &etc.), I would not say the observation of a trend of 2 with a standard error of 0.1 falsifies the projection based on the multi-model mean. The reason is that in this case, the standard error in the projection is ±0.8 (= 1.57/sqrt(3) where 1.57 is the SD of [1,2,4] ) . Assuming the model trends are normally distributed, the "t" value for two degrees of freedom is 4.3, and the any observation within ±3.79 of 2 will fall inside the ±95% confidence The observed value of 2 clearly falls inside the range -1.79 to 5.79, so I would not find the observation of "2" inconsistent with the multi-model mean of all models.

3) As for the linked post written in 2008, some context is required.

At the time I wrote the linked post, the model runs were not readily available to the peanut gallery. In this case, I interpreted the uncertainty intervals based on figures communicated to the public in the IPCC AR4 itself: that is, those in figure 10.4 and the words in that document.

The IPCC AR4 explains that the authors chose to create their projections and compute uncertainty intervals based on a notion that differs from the one we are using in the paper on which you are a co-author: The authors of the IPCC computed the multi-model mean temperature trajectory and the ±1 sd from the distribution of individual model means (not ±1SD of the distribution of all runs). Uncertainty for the mean trajectory of the temperatures (not indiviual realizations) were provided to the public in the form of a graph. You will find that graphic reproduced in comment 2069 in the article you linked.

During a period when additional information is not available to the public, I think it is quite fair for a member of the public to intepret "the IPCC projections" to be those projections as communicated to the public. I continue to this is fair. Those are the projections I tested.

As it happens, based on the reading of the IPCC AR4, IPCC method of using ±1SD of the multi-model mean (not the spread of the individual realizations) gives tighter uncertainty intervals than using the method we are using in our GRL submission. So, I would think it fair to using their own method to compute uncertainty intervals to whether or not their projections were "falsified". Now, let's look at the grouping you gave: [1,2,4] would correspond to a multi-model mean of 2.33 with ±1.57 standard error. No matter how I sliced it, the observation of 2.00 would not falsify this projection-- it falls inside the uncertainty intervals. So, I would not say that the observation of 2.00 falsified a projection of 2.33 with an unceratinty intervals of 1.57.

James,I had a similar objection to Lucia's falsification. But there are corresponding problems with this latest work. You've taken an estimate of the mean for a model ensemble, plotted the estimate of the pdf curves, and then superimposed the various indices. I've objected that the pdf values are not based on index data, but you've alternately suggested that index and model variance ought to be the same, and that the proposition that they are the same is part of H0. There must be relevant data-based evidence that you could cite.

But then there are the other sources of uncertainty. Your ensemble mean of o.2 or 0,25 C/dec has uncertainty. So do the variances. I've alluded to the measurement uncertainty - you've said it should be small, but the difference in the indices provides a lower bound there, and it isn't small.

Then there's the question of selection of the models in the ensemble. Is the message from your plot true only for that selection? If a conclusion is to be selection-invariant, you need to add an estimate of the variability due to reasonable alternative choices of ensembles.

http://www.combine-project.eu/Partners.762.0.html"... 22 beneficiaries, i.e. partner organizations receiving funding from the EU. The project is open for other institutions to become associated partners ..."

Thanks for the clarification. So if we happened to have a much larger ensemble of say 300 models for which 100 have trend each of 1, 2, and 4, then in this case the standard error of the mean would be 1.57/sqrt(300) =0.08 and you would then say that the projections *are* falsified by the data, right? (I think there are a couple of typos in you calculation, but no biggie. If my numbers still don't work out, then just try another factor of 10.)

I agree that the answer you get depends on the question posed. I would argue (well I think it is blindingly obvious) that to claim in this case that obs of 2 "falsify" an ensemble which consists of equal numbers of predictions of 1, 2 and 4 suggests that the question you are asking is not a very useful one.

If you were mislead by some of the IPCC presentation, that's unfortunate. But even so, on realising the confusion, it still would seem more sensible to actually perform a useful analysis rather than continue to present a meaningless one. I suspect one reason for the IPCC presentation is that it never crossed their minds that anyone would try to analyse such short trends based on that graph! Their main focus is the O(100y) time scale, after all.

By the way, the model output has been freely available for many years - certainly predating the publication of the AR4 itself.

The models have a range of forced responses. They also have a range of internal variability. These combine independently (not that we have tested this independence, or rely on it in any way) to give the overall temperature change.

We are testing whether the combined effects of forced response + internal variability of the real system lies in the range generated by the models, probabilistically speaking. That is all.

The choice of ensemble is defined by the decisions of the researchers that their models are sufficiently good to submit to the IPCC. Many people have interpreted this ensemble in probabilistic terms, and I consider it is a reasonable first-order assumption to do so. We are testing whether the data invalidate this use.

As for the obs error that several have mentioned...well if you already have a gaussian with a 1sd width of about 0.2 due to model variation, and you add some independent error with a 1sd width of about 0.05 due to obs error, you only increase the width of the overall distribution to 0.206, which is pretty much negligible. I already covered this here.

JamesThere could easily be typos. My husband was coming in the door for dinner and I didn't check for them. Right now, my spreadsheet is saying 1.5727.... =SD and SD/sqrt(3) = 0.8819... I wrote 0.8 above. Oddly, I don't think typos in the numbers are affecting our communication since I think you are trying to create an extreme example to explore what I would say, right? If 100 hadn't been enough, I'd be happy to go for 1,000,000. Not a problem.

>>So if we happened to have a much larger ensemble of say 300 models... the projections *are* falsified by the data, right?

Not right.

(It turns out I'm going to have to break up this long response owning to the 4,096 character limitation of your comments field!)

First, I need to note that we have a problem that you are creating a hypothetical ensemble and then asking me what I would say about falsifying "the projections" without stipulating the actual projections "panel A" created and communicated based on that ensemble. I dispute the notion that projections create themselves out of an ensemble. People or groups of people create projections using some method, and then they communicate those specific projections. If "panel A" has created "their" projections, others can compare those projections to the data. Meanwhile, rival "panel B", working with exactly the same ensemble might prefer a different method of creating projections, and they would create some other set of projetions. When creating and communicating their projections, each group would, presumably, indicate some level of uncertainty, and possibly describe their method of creating a projection out of an ensemble.

Of course, no matter what Panel A or B do to create projections, people can also test any features of the ensemble both panels used, possibly to explore ancilliary questions.

So, before I can answer whether I would say "Panel As projection" has been falsified, I need you to describe what Panel A projections actually are. You cannot simply describe the members of the ensemble and leave it to each individaul reader to decide what they would have projected given the identical ensemble.

To avoid the 8 hour delay associated with time zones, work &etc., I'll assume for the purpose of your counterfactual which seems to be designed to get me to explain when I consider "Panel A's projection" falsified or refuted. In that vein, I'll create "Panel A's projection" and then discuss whether or not I consider it refuted by an observation of 2.0. I'll assume that Panel A had decided to create a figure similar to 10.4 in Chapter 10 of the AR4, and that they created their figure based on the ensemble you describe (i.e. 100 models having a trend of 1, 100 having a trend of 2, and 100 having a trend of 4). Let us futher assume they used method discussed in the AR4 and that they called this figure and related tables "Panel A's projections". In this case,

* Panel A's figure would have shown a 'most likely' trend of 2.33 as the multi-model mean.* Panel A's figure would have communicated uncertainty intervals of ±1.57 on their graphic. (So, 0.8 to 3.86.) * To the extent that Panel A was vague about the actual meaning of their uncertainty intervals, there could be disputes over what they projected, but I would strongly suggest that the range shown to the public on that graph was 0.8 to 3.86, and the public would tend to think that's the range Panel A thinks is somehow "likely".

Given this, someone like you could ask me how I would "Panel A projections"? I would evaluate "Panel A projections" as being a most probable value of 2.33 with an uncertainty of ±1.57. I would do this because ±1.57 would be the uncertainty Panel A judged as the uncertainty range for their projections.

This is basically the notion outlined in (3) above. I should note that I would evaluate "their projections" this way whether I thought their method was sound, insane or what have you. If Panel A had created projections by consulting 12 psychics, and then told the world they predicted some mean and a standard deviation, I would use that mean and standard deviation and then state whether "Panel A's projections" were consistent with what they actually projected.

Since you asked about the 1.57/sqrt(300), I will now move on to the relevance of that standard error.

Although we can test projections as stated above, it is my opinion that we are also permitted to ask and test other ancilliary question. These questions might be interesting depending on what claims appear in the literature. The questions might be relevant to improving the process panel A used to create their projections. But one this is very clear: Tests of these ancilliary questions do not answer the question "Are Panel A's projections refuted".

Now, supposed someone or somebody (say Panel B) made the claim that creating projections based on the Multi-model mean of the ensemble would result in a cancelling of biases and results in a method that is somehow better than using any individual model. In that case, it might be interesting to examine whether the multi-model mean was consistent with the data.

I (or anyone) could proceed with such a test. Given the ensemble you described, I'd have to seriously explore some questions about independence of models and randomeness of the draw. But let's assume I concluded that they were independent and selected randomly. I would proceed with the notion that I had 300 independent samples drawn randomly in some vague sense.

Of course, without doing any sophisticated test, I would notice that a population with 100 1's, 100 2's and 100 4's is not normally distributed. This means I would not compute the standard error using SD/sqrt(N) and I would not use the "t" distribution to estimate the 95% confidence intervals. But it would be possible to determine the 95% confidence intervals on the mean of 100 samples either by brute force or possibly by finding the exact analytical solution for this case. I would then compare the observed value of 2.0-- including its uncertainty intervals-- to see if the observation is consistent with the multimodel mean of 2.33. I'm not going to do bother computing the correct 95% uncertainty interval-- I'm guessing it's pretty small and the observation of 2.0 will fall outside the 95% range. (If not, we can just change your hypothetical to 100,000 1s, 2s and 4s, right?)

If so, I would say "The multi-model mean is inconsistent with the observation." What this means is that I would expect that if we drew larger and larger sets of a samples, and compute the multi-model mean of these very large samples, I anticipate the multi-model mean is going to converge to a value that is not 2.0.

But, returing to your question about whether I would say Panel A's projection is falsified: The answer is no. The reason: Panel A's projection does not consist of merely the multi-model mean. The claim that the multi-model mean is unbiased is "Panel B's claim about the multi-model mean". (If Panel A and Panel B happen to be the same body, so be it. But that fact doesn't transform their claim about the multi-model mean into "their projection". They padded their projection in a way they thought suitable.)

Relating this to the IPCC issue: In this counter factual, "Panel A" chose to communicate quite error bars of ±1 SD when making their projection. If we want to say that the data are inconsistent with Panel A's projections, then the data have to be inconsistent with what Panel actually projected-- which in this counter factual was 2.33 ±1.57. The value of 2 falls in there.

Now: to turn to the GRL paper. We could, of course, also check whether the value of "2" falls in the distribution of (1,2 4)-- following the method in our GRL submission. In this case, we would report that it does fall in the distribution.

But what we have here are three different questions:1) Are the projections actually made by Panel A, using the method they chose consistent with the data?2) Is multi-model mean of a a collection of ensembles chosen according to the rules set forth by Panel A unbiased relative to observations?3) Do observations fall inside the distribution of all members of the ensemble?

In this counterfactual, the answer to question 1 is "The projections are not refuted". The answer to question 2 is "The claim the multi-model mean is unbiased is refuted." The answer to question 3 is "The observations fall inside the distribution of all members of the ensemble."

In my opinion, all three questions are interesting. I don't think one is necessarily more important than the others. The appropriate method for testing each question differs. So, we are going to see different questions and different methods.

Most of all, I think it's important not to conflate the answer to one question with the answer to a separate question or to impose a phrase that might be appropriate testing question (1) onto the answer to questions (2) or (3). You cannot say the IPCC projections are falsified, rejected, refuted or anything based on the answers to questions (2) and (3) because you aren't testing the projections. You are testing a specific property of the ensemble and determining whether that property is consistent with the observations.

Hank--I initially did not know James was a coauthor either. I was discussing various issue back and forth with Chip. I did know James was a co-author before this was submitted to GRL. When I learned he had joined, I was also surprised. But it was a perfectly pleasant surprise even though I know he criticized what I wrote in 2008.

I also wondered how much flak he was going to get for contributing to this paper. :)

Lucia, it seems to me to be best to consider a thought-experiment, that is structurally the same, but stripped of the (unnecessary) details of the application.

Say we had a coin that had been biased by adding a small weight to the tails side, so there would be more heads than tails. For the sake of argument, assume the true probability of a head is 0.6.

A coin modeler decides to try and make a prediction of the number of heads in the next 20 tosses of the coin. He inspects the coin, and using his expert knowledge of coin dynamics guesses that the probability of a head is 0.6 (i.e. the model is exactly correct).

Sadly, he is a computer scientist, so rather than using simple probabilistic arguments, he uses a Monte-Carlo simulation and makes his prediction based on the mean of an ensemble of 1,000,000 individual model runs, each consisting of 20 flips of a simulated biased coin with a 0.6 probability of a head. He predicts that there will be 12 heads out of a possible 20.

Clearly the ensemble mean gives *exactly* the most probable result, and so is the best possible single predictor. As such it ought to pass any sensible test of the "consistency of the ensemble mean".

Say we actually observe only 10 heads out of 20 (not that unlikely, the probability of this outcome is 0.1171, assuming I calculated it correctly). The ensemble mean then fails the test we have been discussing as the observed result lies outside the standard error of the ensemble mean.

This example shows why the ensemble mean can be the best single predictor, but that doesn't mean the observed result should lie within say 2 standard errors of the mean. Like a GCM ensemble, the observations should lie withing say 2 standard deviations of the mean.

N.B. the discrete nature of the statistic makes no difference, as you can make the number of coin tosses arbitrarily large and the same arguments can still be made.

>>Clearly the ensemble mean gives *exactly* the most probable result, and so is the best possible single predictor. As such it ought to pass any sensible test of the "consistency of the ensemble mean".....

>>This example shows why the ensemble mean can be the best single predictor, but that doesn't mean the observed result should lie within say 2 standard errors of the mean. Like a GCM ensemble, the observations should lie withing say 2 standard deviations of the mean.

Your example does not show why the ensemble mean of models is the best single predictor for the earth's weather.

The difficulty is that in your example, you are using flips of "coin A" to determine the average outcome for "coin A". Consider this: What if the person flipping the coin A is using the outcome of the flips to discover the likely outcome when he flips "coin B"? What if, notwithstanding the bias in "coin A", "coin B" is unbaised? What if the bias for coin B is 0.4 instead of 0.6?

The ensemble average of coin flips of coin A is only a good estimate of the average outcome for coin B if both exhibit the same average outcome. Whether or not they do share the same outcome is a very real question. This is important with respect to our discussion of whether or not we can assume that the multi-model average of models (i.e. "coin A") will be an unbiased estimate of realizations of earth's weather (coin B). In reality, whether the models are an unbaised estimator of earth's weather is a question that can be asked and should be tested.

Let me edit what you wrote so that I can agree with it:

>>Clearly the ensemble mean gives *exactly* the most probable result for coin A, and so is the best possible single predictor for flips of coin A. As such the method of testing whether an individual flip of coin A is consistent with the distribution of flips of coin A ought to pass any sensible test of the "consistency of the ensemble mean".

Now let's look at the numbers you suggest:

>> Say we actually observe only 10 heads out of 20 (not that unlikely, the probability of this outcome is 0.1171, assuming I calculated it correctly).

I think you are misunderstanding the way these tests are generally done.

I also get the probability of getting exactly 10 heads out of 20 for a coin that is expected to get 12 heads out of 20 is 0.1171. However, this value is not relevant to testing they hypothesis that the coins will result in 12 heads out of 20 on average. Since we are going to reject the hypothesis that the coin gets 6/10 heads for any and all outcomes that are sufficiently far from 6/10, we examine cummulative distributions. Using the "bidist()" in excel, I'm finding both 7 or fewer heads out of 20 happens 2.1% of the time, but 8 or fewer happens 5.7% of the time. Since we decree the claim that the average result is 6/10 both if we had too many heads and if we had too few heads, and I (at least) generally only reject a null hypothesis when it's outside the 95% confidence intervals, I only reject on the "too few" side if we saw fewer than 7 out of 20 heads.

It is true that in some experiments, I will flip coin A, do the analysis and decree that the outcome of a set of coin A flips is inconsistent with the known true value for the average of coin A. But that's the way these things are in frequentist statistics. The rule is to state the rate at which this particular mistake will be made. If I claim I'll make it 5% of the time, then I should make it 5% of the time.

This is what I think happened: Chip was collaborating with people as he put the paper together. People were added in stages when he needed assistance on some particular issue. We were all receiving various emails-- sometimes with each others names on them sometimes individually.

That said, it was only during the final stages of preparation of the GRL submission that I knew the precise names (other than mine) on the authors list. I'm pretty sure all names appeared on the final review copies, and I know James was listed as one of the authors at that point, and possibly slightly earlier. (I have to admit to not remembering the precise instant when I learned that James was also an author.)

After Chip sent us all the paper and I (and I've always assumed every other author) had sent Chip our ok, the paper was submitted to GRL. GRL sent me (and I assume everyone else) an email to verify their names have not been attached without their knowledge and consent. Nice is very coureous of GRL. :)

At a later point, Chip put together his presentation for Heartland. He sent me an email letting me know that he planned to do this, and I said it was fine with me. I assumed he let other authors know as well, but I didn't quiz him on the precise nature of his communication with each nor did I demand he send me copies of every communication with every author on his list.

That's the story about authorship as far as I know it. I don't have any issues with this. If I had a gripe, I would have complained to Chip.

Lucia, I am not sure where your "coinA" and "coinB" come in. The aim of a statistical model is to construct something that is statistically exchangable with reality. In this example I chose to make the model exactly correct, so the most likely outcome for CoinA is exactly the same as the most likely outcome for your coinB. That was the point of the thought experiment, to show an example where an ensemble mean from a perfect model fails the test, despite giving the best possible single prediction.

As for the numbers, I do know how significance tests are performed. The probability of 0.1171 was intended to demonstrate that the condition under which the model fails the test is not particularly unlikely.

Note that the test you proposed no longer involves the standard error of the mean, but depends on the spread of the distribution, which sort of makes my point!

"It is true that in some experiments, I will flip coin A, do the analysis and decree that the outcome of a set of coin A flips is inconsistent with the known true value for the average of coin A. But that's the way these things are in frequentist statistics."

Yes, but why would one expect the outcome of a particular example to be exactly the same as the mean of its generating distribution? It is a bit like rolling a die and being surprised at getting a six because it is a long way from the average score.

It doesn't surprise me at all that the observed trend is not exactly the same as the (population rather than sample) mean of the models. This is simply because the model mean is only an estimate of the forced component, whereas the observed trend is a combination of the forced trend and the natural variability.

It occured to me that we know the oceans are warming, the measure of heat content shows that; we know ice is melting a lot and can estimate that; and we know it is warming. What I don't know is how much the models take into account the heat needed to melt the ice and increase the ocean temperature. It'll take me a while to find out as well. Anyone want to say if they know the answer?

Dikran Marsupial>>Lucia, I am not sure where your "coinA" and "coinB" come in. The aim of a statistical model is to construct something that is statistically exchangable with reality.

It is also the aim of a GCMs to create something that is statistically interchangable with reality. However, the fact that this is the aim does not automatically make the output of a GCM statistically interchangable with reality. (Likewise, the fact that my aim is to lose 20 lbs by October make my losing 20 lbs by October a reality. When October arrives, we'll see what happens.)

So, in the coin analogy we have:Output of a GCM== Coin A.Reality = Coin B.

It is the aim of modelers to make "coin A" have the properties of "coin B". Whether or not they achieve what they hope to achieve is a question that can and should be tested. (I think it's fair to say that since 'weather' in individual models does not appear statistically interchangable with each other, it's also fair to suggest that the weather in at least some models is not statistically interchangable with earth weather. I'm willing to do analyses based on the assumption that the collection of models does create a distribution of weather that is statistically interchangable with the earth models, and then, point out that this assumption may not be true in the caveats to any conclusions. It's pretty common to do this sort of thing in engineering and science.)

>> That was the point of the thought experiment, to show an example where an ensemble mean from a perfect model fails the test, despite giving the best possible single prediction. I thought this was the point and that's why I then proceeded to go through your numbers. You are not applying the test the way I would apply the test, nor the way I think anyone would apply it.

>>As for the numbers, I do know how significance tests are performed. The probability of 0.1171 was intended to demonstrate that the condition under which the model fails the test is not particularly unlikely.Whose test does the model fail? Not one I've ever done or suggested anyone does.

It is true that someone could, hypothetically design the very poor test you described (which differs from any traditional significance test I am aware of), apply that very poor test test and consequently make some odd obsevations. For example, using the method you describe, if coin A has a known rate of heads of .6 * number of 2000 times, and we flip 2000 getting exactly 1200 heads, we would conclude that data showed the rate of heads is not 60%. After all, getting exactly 1200 heads out of 2000 flips will happen only 1.8% of the time.

This peculiar behavior of the method you have described numbers among the reasons few (and I might suggest no one) uses it.

Nevertheless, someone somewhere might come up with the idea the method you described works. But the fact that some person could hypothetically desing a very, very poor test and apply it does not mean that ordinary two-tailed t-tests or anything similar have suddenly been shown to be inappropriate.

>>Yes, but why would one expect the outcome of a particular example to be exactly the same as the mean of its generating distribution? I never suggested it would. Can you name any one who has?

>It doesn't surprise me at all that the observed trend is not exactly the same as the (population rather than sample) mean of the models.Me neither. I doubt anyone is surprised by this.

As I said, the whole point of the thought experiment was to construct a model that happened to be correct, so by definition it is exchangeable for reality, so discussing exchangability only clouds the central issue.

In my post, I did not perform a statistical test, I left it unsaid as the tests have already been discussed. As I said, the 0.1171 figure is not part of a test, it is just an indication that the situation where the model fails the test is not particularly unusual. It isn't a p-value or anything like that.

Perhaps it would be easier if you explained how you would test for the consistency of the ensemble mean with the observations in this case.

I was assuming that you would use the standard error of the mean as the uncertainty on the ensemble mean, and note that the observations are not uncertain, we can just count the heads, and then just see if the observations lie within the uncertainty of the ensemble mean. That seemed to be what you were suggesting and the ensemble mean fails that test, even though it is the ideal unbiased estimator.

"As for the obs error that several have mentioned...well if you already have a gaussian with a 1sd width of about 0.2 due to model variation, and you add some independent error with a 1sd width of about 0.05 due to obs error, you only increase the width of the overall distribution to 0.206, which is pretty much negligible. I already covered this here."

Compare the GISS and HadCRUT3 lines in your diagram. That is a measure of the observational uncertainty and it clearly isn't negligible.

Dinkam>>Perhaps it would be easier if you explained how you would test for the consistency of the ensemble mean with the observations in this case.You said you read the Santer paper you brought up in your question to James. I would do something along the lines in that paper, though I might make a few modifications. Before I explain further, do you understand how Santer tested the consistency of the ensemble mean relative to observations? Because, if not, you should read the paper and then ask questions in places where you don't understand the particular step.

>>...so by definition it is exchangeable for reality, so discussing exchangebility only clouds the central issue.No. Because based on your coin analogy, you have advanced a conclusion about models and earth data. Specifically, you said this, "Like a GCM ensemble, the observations should lie withing say 2 standard deviations of the mean."

Assuming by "observations" you mean "observations of earth data", and by "mean" you mean "mean of the GCM ensemble", your coin analogy is missing an important element. That element is that while your "coin A" is "coin A", and so exhibits all the properties of "coin A", a GCM ensemble is not an ensemble of earth data.

If you want to discuss whether a particular test would give reasonable results when testing a known truth about the distribution of outcomes for "coin A" against a sample outcome of "coin A", that's fine. But don't throw in additional irrelevant and incorrect claims about what that tells us about the relationship between the multi-model mean of an ensemble and observations of earth data.

Regarding authorship, I see absolutely nothing underhand in Chip's behaviour, or anyone else for that matter. Criticise me, if anyone. I had been in conversation with Chip for some time regarding the analysis - there was an earlier version of the manuscript which did not bear my name - and eventually I suggested that I'd made enough of a contribution to be listed as a co-author. I'd say this sort of thing (adding/changing co-authors during the work) is not particularly uncommon, although I expect it happens less often than not. I'm currently engaged in other collaborations where the author list is not yet fully decided (though I do have a good idea of what I think it should/will be).

I was also fully aware of Chip's Heartland presentation prior to the event. As I said, I would probably not have used exactly the same terms, but given the somewhat investigative and exploratory nature of the analysis I'm not going to insist on the One True Way of interpretation. As I alluded to earlier, the whole machinery of NHST is predicated on the test being decided before the data are seen, which obviously cannot be adhered to in cases such as this. I do however stick firmly to my guns on the basic approach presented in this work being the most natural and obvious way of comparing the model ensemble to the observations. I believe it is directly compatible with the RC post, and the Easterling and Wehner paper, for example, but goes some way past them in terms of detail.

Looking at the 8 year trends (for example), I see a central line at about -0.008, and ones either side at about -0.004 and -0.012 (the ticks are .004). So the 1sd uncertainty on that is 0.004. That is per year, I was talking about decadal trends.

(I would accept my calc is a bit of a hand-wave, but I don't think it is really our job here to try to do a detailed uncertainty analysis on the obs to challenge the figures that the originators of these analyses have actually published, such as Brohan et al for HadCRUT.)

That's an awfully elaborate narrative (or series of narratives) that you've got there. It rather seems like you are trying to interpret things in a way that enables you to put "IPCC" and "falsified" in the same sentence, rather than investigate how well the IPCC models match up to reality.

BTW, it might be worth noting somewhere that the IPCC figure you refer to never even claimed to present the GCM projections, rather it presents the output from a simple climate model (SCM) which was tuned to emulate the GCMs. This precludes the possibility of that graph representing short-term natural variability in any way whatsoever.

Regarding the multi-model mean, there are some confused statements in the literature, but the following are actually trivial truths: (1) the MMM will be better than an average model for anything you choose to look at, (2) the MMM is very unlikely to be better than all models in matching the actual temperature trend, and of course (3) the MMM is going to be biased one way or another, it is basically impossible for it to coincide precisely with the truth. So rejecting the "nill hypothesis" that the MMM coincides with the truth would be a pretty worthless exercise.

Something that we haven't discussed here, but also seems relevant, is the difficulty of accurately determining an underlying trend in a short time series with unknown autocorrelation structure. This was a major error in the Schwartz paper a couple of years back. Using monthly data and fitting AR(1) enables you to deduce a rather low level of multi-year variability, but it is far from clear that this is actually correct. An obvious test would be to apply the same analysis to short sections of GCM output and see how often the 95% confidence intervals actually exclude the forced trend. I'd be interested in seeing the results of this.

James, no criticism at all from me, I wanted to ask about the process because it's

I grew up as a faculty brat and learned early that scientists work as individuals -- not as teams or tribes.

Coauthors may not agree on much other than the points stated and supported in a particular paper they shared work on. I recall times when coauthors fiercely criticized each others' press releases and lectures for going beyond the publication. (And yes that does still happen.)

It's an important _way_ to do science --with coauthors selected for their work. I think it happened more back when travel and phone calls and postage were significant expenses and many coauthors didn't know each other.

Describing the process helps debunk the 'tribalism' notion.

That's also why I wondered if it could have been done by a group of authors deciding to hire a manager to help them pull a paper together -- and what difference that might make, since there is a significant amount of management involved even if the scientists pick one another to work with.

Paring away at claims that go beyond the publication, now, that'll be fun to watch.

Nice stuff, James. I'm not a stats jock, but I can see the clear communication and the strong desire to come to grasp and do hypothesis tests. Kudos.

I remember Moshpit touting Lucia. Even two years ago, it wa shard to see what she was even alleging given how she meandered. and never wrote a paper. Too much trumpeting, not enough analysis. And pretty clear she was not interested in really thinking about the nature of variability (which is the key issue). I guess there is a reason she ois knitting afghans rather than working in flluid dynamics.

James, far from criticising you for co-authoring a paper with sceptic types, I applaud it. Where the two "sides" can agree on something, science is best served by saying so. The division into opposing, mutually distrusting camps is human nature, but not very scientific!

James Said:"An obvious test would be to apply the same analysis to short sections of GCM output and see how often the 95% confidence intervals actually exclude the forced trend. I'd be interested in seeing the results of this."

This was the sort of analysis I was thinking of running, just to allay my concerns about the Santer-17 test, but I haven't got round to it yet. I'd also be interested in seeing such an analysis (or indeed working with anyone interested in such an analysis).

"So rejecting the "nill hypothesis" that the MMM coincides with the truth would be a pretty worthless exercise."

That is pretty much what I was trying to say. Does e.g. the IPCC make any claim that it should?

AFAICS, the Santer-17 paper does not show the models are "working fine" and does not suggest that they are "perfect". What it does show is that the models aren't falsified by the observations. Showing that the models are consistent with the data is not a strong endorsement of the models, it just means that we haven't proved that they are wrong yet. That is why claiming the models are inconsistent with the data would be a damning criticism (and hence a very big claim).

What causes disagreement is where claims are made that the models have been falsified (are inconsistent) on the basis of a test that doesn't actually establish that inconsistency, but perhaps does show cause for concern about the skill of the model. The test has to support the claim being made, if you want to show the models aren't useful, then show that they have poor skill or make predictions that are too vague to be useful, rather than go for the much greater claim that they are falsified by the observations, for which much stronger evidence is required.

Hi James, I'm going to divide this in two (because of your comment box limitations.) The second response is, I think, the more productive part of the conversation.

That's an awfully elaborate narrative (or series of narratives) that you've got there. ...rather than investigate how well the IPCC models match up to reality.You asked me whether I would call a certain proceedure "falsifying projections" and also asked me to put the answer in context of a specific post written in 2008.

I don't know why you think a specific 2008 post that discusses "projections" must by definition be a test to see how the individual underlying models match up to reality. (As you see above, I indicated I reject the notion that prjections create themselve out of an ensemble of models. People do that using some method. So, I think testing "projections" is not the precisely the same as testing "models".

If you consider my answering your question which appeared to assume there is no distinction between "projections" and "an ensemble of models" is an elaborative narrative, you are entitled to think so. But I see no way of answering that question without explaining that I think there is a distinction.

In any case, even if you don't see the distinction between projections and "the models", I don't know why you would think that testing whether an important visible characteristic of the ensemble matched the observations does not constitute a test of how well the IPCC models match up to reality. (I'll grant it's not a complete test.)

I also don't know why you think two individual posts from 2008 will permit you to make any conclusions about my full interest with regard to testing whether IPCC models match up to reality.

BTW, it might be worth noting somewhere that the IPCC figure you refer to never even claimed to present the GCM projections, This is an odd interpretation since is the figure presented to the public in the section called "10.3 Projected Changes in the Physical System" and is also the illustration provided in the section entitled "Projections of Future Climate Change" in the SPM of the AR4. However they chose to create the graph itself, they created this graph and presented it to the public. The caveats you insist on are nowhere mentioned.

In any case, if this graph does not represent projections, I am not the only one who has interpreted this figure as corresponding to "IPCC projections". The union of concerned scientists and also includes this graph in their collection of "Projections" by the IPCC. See Projections of Climate Change So, I think we are going to disagree on what constitutes "the projections". If this is our disagreement, it is not a statistical point. It is a matter of semantics.

So rejecting the "nill hypothesis" that the MMM coincides with the truth would be a pretty worthless exercise.Maybe. Or maybe not. If it is, it's rather amazing that Santer's response to Douglas wasn't merely to point out that rejecting the null hypothesis that the MMM does not match the data is a pretty worthless exercise!

con't.Something that we haven't discussed here, but also seems relevant, is the difficulty of accurately determining an underlying trend in a short time series with unknown autocorrelation structure.Sure. I consider this a big difficulty.

An obvious test would be to apply the same analysis to short sections of GCM output and see how often the 95% confidence intervals actually exclude the forced trend. I'd be interested in seeing the results of this.I'm not sure I understand the precise test you envision.

I'd like to do the precise test I think you are suggesting, but it's not quite possible because biases in models mean the multi-model mean is not the forced trend for individual models. If the mean trend in individual models do differ from each other, then we should expect to see a false positive rate for a "good" test to be greater than the target rate when testing whether individual runs from differ from the forced trend. I don't know how much greater the false positive rate should be, and that's a question that has to be considered if we get something like, say, an 8% false positive rate where we expected 5%. Would the extra 3% be due to the method, or the fact that we are comparing the realizations to something that is not the forced trend?

Anyway, I've been planning to put a script together to do this for the full forecast-- I think the general idea is discussed under a post called "The Carrot Eater test" which has a followon post discussing the issue of why the structural uncertainty in models matter to this along with some (not necessarily good) estimates of how large that might be.

I have done other tests with the related intention of asking "if the models are correct, how do the variances in trends estimated using 'method X' compare to variances in models with more than 1 run". I have not presented these at my blog-- partly because I think readers might not be particularly interested, partly because I want to double check my results.

I've done a number of other tests.

I've discussed some of these in a sketch way with Chip and would be happy to discuss them with you or anyone interested. But obviously, your comments block is not a convenient spot for me to discuss these things owing to the combination of space limitations and just not being a place I can later easily find my own stuff. It's more convenient for me to present anything with any detail at my own blog, or later in a manuscript.

James said: "So rejecting the "nill hypothesis" that the MMM coincides with the truth would be a pretty worthless exercise."

Lucia replied "Maybe. Or maybe not. If it is, it's rather amazing that Santer's response to Douglas wasn't merely to point out that rejecting the null hypothesis that the MMM does not match the data is a pretty worthless exercise!"

If Douglass et al. had merely pointed out that the MMM does not match the data, then that may well would have been the reaction. However Douglass et al. went much further than that. From the summary:

"We have tested the proposition that greenhouse model simulations and trend observations can be reconciled. Our conclusion is that the present evidence, with the application of a robust statistical test, supports rejection of this proposition."

Note that they claim that the model simulations cannot be reconciled with the observations, not the MMM. The prediction of the ensemble of model simulations is that the most plausible prediction is given by the MMM, with the credible interval given by the spread of the model runs. That means the models and observations are reconciled if the observations lie within the stated credible interval, and indeed they do.

Douglass et al. made a claim that is not supported by the test they actually performed, and as I pointed out, that is what caused the controversy.

Dikran Marsupial, >>Douglass et al. made a claim that is not supported by the test they actually performed, and as I pointed out, that is what caused the controversy.,

I agree it did not. The reason is that the models did not pass a test that was constituted to detect a discrepancy between the multi-model mean and the observations.,

But it seems to me that James claimed that the models actually failing a properly constituted test would be unimportant and I was respoding to that claim.

I don't think no one cares about detecting whether the multi-model mean is high or low. There are many arguments we can have about whether or not bias has been demonstrated. Some people may think other questions are more interesting.

But even if the notion that the multi-model mean is not perfect is widely accepted among specialists, at any given time, policy makers and the public certainly care whether or not the big fat central line on graphs indicating what to they are to expect is biased high or low. If data are indicating either situation, this is something that interests the public even if scientists think discussions of this are snooze inducing and should be greeted with yawns.

By the way, on another note: Singer sought me out to discuss his response to the Santer paper.

The conversation involved us mostly disagreeing on technical points, him thanking me for discussing his response and offering to thank me formally by adding a note to his paper when published. I asked him not to add any such formal thanks to the paper. (I'm not sure there is anything approaching a manuscript yet. If there is, I didn't see it.) Then, we went to lunch.

"... the IPCC figure you refer to never even claimed to present the GCM projections, rather it presents the output from a simple climate model (SCM) which was tuned to emulate the GCMs. This precludes the possibility of that graph representing short-term natural variability in any way whatsoever."

You're saying that the simple climate model is used to produce a picture -- an illustration, is that correct?

I gather Lucia can't use that picture to do short term analysis because it doesn't contain the information she's looking for -- is that right?

It's easy to find papers mentioning using SCMs in this way for various purposes -- perhaps a topic explaining it would be generally helpful.

Recall the problem creating "the digitized version of the 1990 IPCC curve" --mistakenly assuming it was more than a cartoon, deriving imaginary data from it, then testing that data. (Wegman, Barton hearing)

James, finishing the digression in hope it suggests a way to clarify what's being done -- from poking a bit, I would encourage you or some other climate scientist to write an explanation of what's being done in papers using SCMs and why it's done (time, expense do matter!).

Here as an example (perhaps relevant to the RC thread "On Attribution" about using local fingerprints):

"We present a methodology for quantifying the leading sources of uncertainty in climate change projections that allows more robust prediction of probability distribution functions (PDFs) for transient regional climate change than is possible, for example, with the multimodel ensemble in the the CMIP3 archive used for the IPCC Fourth Assessment. ....... The scaling uses a simple climate model (SCM), with global climate feedbacks and local response sampled from the equilibrium response, and other SCM parameters tuned to the response of other AOGCM ensembles. Use of the SCM allows efficient sampling of uncertainties not fully sampled by expensive GCM simulation, including uncertainty in aerosol radiative forcing, the rate of ocean heat uptake, and the strength of carbon-cycle feedbacks. Uncertainties arising from statistical components of the method, such as emulation or scaling, are quantified by validation with GCM ensemble output, and included as additional variance in our projections...."

"But it seems to me that James claimed that the models actually failing a properly constituted test would be unimportant and I was respoding to that claim."

My reading of James' comment was that the question answered by the "properly constituted test" was uninteresting, presumably because basic reasoning is enough to tell you the answer to the question without the need for performing a test. As I have pointed out the MMM is not intended as an accurate prediction of the observed trend, just of the forced component of that trend, so there is no reason to expect them to coincide.

This is one of the difficulties of frequentist statistics, an unbiased estimate of the forced component of the trend cannot be expected to be an accurate estimate of the observed trend on this particular Earth, just the best estimate on average over a large sample of alternate Earths with the same forcing, but different realisations of the internal variability. Sadly we can't observe these alternative realities to find out if the MMM actually is unbiased, which rather limits the conclusions we can draw from the one Earth we can actually observe.

I didn't say that "no one cares about detecting whether the multi-model mean is high or low". It is obviously high at the current time (you don't need a statistical test to see that). However that may just mean that the MMM is a good estimate of the forced component of the trend (which is what it is supposed to do), but the unforced component is sufficiently dominant to push the observed trend well into the lower tail of the ensemble. Or it could be that the unforced component is small meaning the MMM is biased, however, AFICS there isn't a way of telling which explanation is more plausible.

It seems to me a mistake to concentrate on any point estimate, rather than considering the distribution of plausible outcomes, especially if you are going to ignore the stated uncertainty of that point prediction and substitute a different one.

Dikram>>As I have pointed out the MMM is not intended as an accurate prediction of the observed trend, just of the forced component of that trend, so there is no reason to expect them to coincide.Yes, you pointed this out and I agreed with you that the MMM is not intended to be an accurate prediction of the trend. I have never thougt it was intended to be an accurate prediction of any individual realization of a trend.

It is worth nothing that the test in Santer are why the test in Santer is designed to test whether the MMM is consistent with the forced component associated with the observed trend.

You do understand this, right? You do understand why Santer's method is better than Douglas's method, right?

>>It seems to me a mistake to concentrate on any point estimate,Of course. I don't know anyone who is currently concentrating on a point estimate.

Hank--Are you suggesting that use of SCMs as one of the processing steps to creating projections based on an ensemble of GCMS make the resulting projections presented to the public and policy makers in the "projections" sections of the IPCC report not projections?

BTW: I am familiar with SCMs. What I want to know is whether you think the projections actually presented to the public and policy makers in the form of figures, text and tables the sections of the AR4 with headings like "projections" are not "the IPCC projections". If the material presented to the public in these sections are "not the projections", what is?

And if the IPCC does not include its projections in the report itself, do you think they should include their actual honest to goodness projections in future reports so that the public can learn what the projections actually are?

I wrote "As I have pointed out the MMM is not intended as an accurate prediction of the observed trend, just of the forced component of that trend, so there is no reason to expect them to coincide."

Lucia said "Yes, you pointed this out and I agreed with you that the MMM is not intended to be an accurate prediction of the trend. I have never thougt it was intended to be an accurate prediction of any individual realization of a trend."

If we both agree that the MMM is not intended to be an accurate prediction of the observed trend, why are you expressing interest in statistical tests that determine if the MMM is consistent with the observed trend? As James said "... rejecting the "nill hypothesis" that the MMM coincides with the truth would be a pretty worthless exercise."

"(I would accept my calc is a bit of a hand-wave, but I don't think it is really our job here to try to do a detailed uncertainty analysis on the obs to challenge the figures that the originators of these analyses have actually published, such as Brohan et al for HadCRUT.)"

It seems that the job you have set yourself is to reconcile (or not) the short term variability in the observations with short term variability in the models. The observational error is a component of that variability and if you don't account for it properly, you run the risk of getting the wrong answer.

The models typically show their greatest rate of warming at high latitudes, particularly in the Arctic. HadCRUT3 systematically under-represents this area so is very likely to have a trend that is too low. The same is true, to a lesser extent, for the NCDC analysis.

If accounted for properly by subsetting the models to where there are observations in HadCRUT3, you won't get a slight broadening of your distribution as you claim, you will probably narrow the range (there's a lot of variability at high latitude) and almost certainly shift the whole thing down (the Arctic warms faster than the rest of the globe).

It would be interesting to see the radiosonde estimates of the lower tropospheric temperatures as well to give a more comprehensive estimate of the uncertainty on the upper air trends (a la Santer et al. and numerous similar). The trends in these analyses probably lie closer to the middle of your distribution and if you exclude them it might look like cherry picking.

Uncertainty in predictions of anthropogenic climate change arises at all stages of the modelling process described in Section 10.1. .... Probabilistic estimates of climate sensitivity and TCR from SCMs and EMICs are assessed in Section 9.6 and compared with estimates from AOGCMs in Box 10.2."

>>But this has nothing to do with the paper being discussed, does it?No. But since you are are trying to make points about things that have nothing to do with the paper being discussed, you should anticipate others might respond to the things you say.

Than you for telling me you think the AR4 was clear. But what is your actual answer to the question I asked? Is it that you think a section discussing uncertainty in estimating sensitivity describes projections?

Like you, I also think the AR4 is quite clear. Unlike you, who quote text *does not not describe projections (it discusses sensitivity instead), *does not contain the word "projections", and * comes from a section that does not contain the word "projections" in the heading,

I will quote from a text that * uses the word "projections", * specifically discusses "projections" of temperature (not sensitivity which is a property, not a projections) * comes from a section that discusses "projections" and finally* is in the single most visible portion of the AR4: that is the supplement for policy makers.

With no further ado, this is a quote from the section entitled "Projections of Future Changes in Climate" in the SPM of the AR4:

"Best estimates and likely ranges for global averagesurface air warming for six SRES emissions marker scenarios are given in this assessment and are shown in Table SPM.3. For example, the best estimate for the low scenario (B1) is 1.8°C (likely range is 1.1°C to 2.9°C), and the best estimate for the high scenario(A1FI) is 4.0°C (likely range is 2.4°C to 6.4°C). Although these projections are broadly consistent with the span quoted in the TAR (1.4°C to 5.8°C), they are not directly comparable (see Figure SPM.5). The Fourth Assessment Report is more advanced as it provides best estimates and an assessed likelihood range for each of the marker scenarios. The new assessment of the likely ranges now relies on a larger number of climate models of increasing complexity and realism, as well as new information regarding the nature of feedbacks from the carbon cycle and constraints on climate response from observations. {10.5}

Note that SPM.5, illustrates "the projections" in the AR4. One could compare these to "the projections" from the TAR-- also provided in the form of a figure (which evidently you consider to be a "cartoon"). Comparison of the two "cartoons" would permit those who read both documents to see the form of the projections is not directly comparable--(as stated in the supporting text). That is to say: examiniation of the figure showing the projections permits you to diagnose a feature of the projections.

So, it seems very clear to me that figure and table that the authors of the AR4 tell us describe their projections actually describe "the projections".

In contrast, a section you quote which discusses the uncertainties in estimating the magnitude of a property of a quantity called "sensitivity" doesn't come close to telling anyone what the "projections" of anything might be. The sensitivity could be any number whatsoever, without knowledge of all sorts of other things like. at a minimum, heat uptake by the oceans, current forcings, and the entire time evolution of forcings one could not even use knowledge of the sensitivity to create projections.

I'm pretty sure you understand the difference between a estimate of the magnitude of sensitivity and a projection. So, I am amazed that you would quote such a thing when trying to communicate what you think the projections are!

Dikrum>>If we both agree that the MMM is not intended to be an accurate prediction of the observed trend, why are you expressing interest in statistical tests that determine if the MMM is consistent with the observed trend?Because testing whether the MMM is consistent with an observed trend does not involve assuming that the MMM will accurately predict any individual trends. I don't know why you think it does.

Once again: Do you understand what Santer did in the paper you asked James about? Do you understand what the {s(bo)} terms in his various equations are supposed to describe? (In words, not numbers?)

Lucia said "Because testing whether the MMM is consistent with an observed trend does not involve assuming that the MMM will accurately predict any individual trends. I don't know why you think it does."

Because if A is an estimator of B, then there is no reason to expect A to be consistent with C = f(B,D).

I don't know how to put it any more clearly than that, if the MMM is not intended to be an estimator of the observed trend (just the forced component) there is no reason to expect it to be consistent with the observations. However, I am willing to be corrected on this point if you can give a good reason why it should be consistent with the observations (other than in the sense that the observations lie within the spread of the ensemble).

And yes, I do understand the point of the Santer test; the question is why is it (eqn 12) is superior to seeing if the observations lie within the spread of the ensemble, which AFAICS is the obvious test for consistency of the ensemble.

> warming is almost exclusively > confined to the dry, cold, > anticyclones of Siberia and > northwestern North America....> Warming of this air mass type> may, in fact, be benign or even> beneficial ....

Sorry, I got called away on an urgent holiday. Thank you for all carrying on in my absence. Was there anything in particular more I needed to say?

Lucia, with reference to those descriptions of the projections, there is a good reason why they are all stated on a 100y time scale - notably, that over such a long time we can basically ignore internal variability, but over 20 years the IPCC only said the trend was likely to be about 0.2C/decade (and even then they did not precisely quantify what that meant). And that statement says very little about the <10y trend.

I think not exactly, Hank. Remember that what the models are trying to project is the likely response to the parts of the system known to change quickly. As the longer-term feedbacks kick in (generally assumed to require >100 years, although that assumption looks increasingly shaky), there will be a substantial divergence, and as I keep mentioning studies of the Miocene and especially Pliocene warm periods give us a pretty good idea of where we'll end up (the equilibrium conditions, anyway, bearing in mind that the really unpleasant effects would be associated with the transient) if we stay on the present path.

Well the slope over say 50 years is a function of CO2 concentration increase, sensitivity and ocean heat uptake. The final equilibrium (assuming concentration stabilises) will be proportional to the sensitivity, but obviously depends on the stabilisation level. The simple models can simulate this sort of behaviour pretty well IMO.

When you look at the 10y time scale, there is lots of natural variability to consider, superimposed on this forced response.

There was some mention of the solar cycle in this thread, but I think it was shortchanged. The last solar maximum occurred somewhere around 2001 and the solar minimum was last year (I think). Most estimates put the solar cycle worth about 0.1 degrees from peak to trough. Correct me if I'm wrong, but if model runs of the "future" leave out the solar cycle, and the "future" began in 2000, then we would expect observations to be about 0.1 degrees cooler now than they would be if the solar cycle didn't exist, correct?

Another point. Not that it would be easy, but it would be better to apply masks to the model output and calculate those anomalies based on the same area covered by the various temperature analyses. That would be a more apples to apples comparison.

I agree that in principle accounting for all the various factors should give a more "pure" comparison of the underlying models and climate system. Another possible tweak would be to back out the ENSO effect from both models and data, so as to get closer to the forced response, and we will probably do at least some of these things.

OTOH these models (including the representation of forcings) were presented as state of the art and the temperature time series are also routinely referred to as representing global mean temperature, so I don't think it is invalid to perform the comparison on that basis. Solar should certainly be mentioned as a contributor to the discrepancy, however.

Can you compare your description above to the descriptions of the material recently discussed herehttp://klimazwiebel.blogspot.com/2010/12/tony-gilland-time-to-move-on-from-ipcc.html?showComment=1292515463147#c3286621(what was submitted to GRL? and what was presented at Heartland?)

I think the Heartland presentation was basically the same material as the GRL manuscript, though I am commenting now from memory. I don't think Chip's assessment in that linked comment is unreasonable, though it's just one possible perspective.