Brown and Caldeira: A closer look shows global warming will not be greater than we thought

A guest post by Nic Lewis

Introduction

Last week a paper predicting greater than expected global warming, by scientists Patrick Brown and Ken Caldeira, was published by Nature.[1] The paper (henceforth referred to as BC17) says in its abstract:

“Across-model relationships between currently observable attributes of the climate system and the simulated magnitude of future warming have the potential to inform projections. Here we show that robust across-model relationships exist between the global spatial patterns of several fundamental attributes of Earth’s top-of-atmosphere energy budget and the magnitude of projected global warming. When we constrain the model projections with observations, we obtain greater means and narrower ranges of future global warming across the major radiative forcing scenarios, in general. In particular, we find that the observationally informed warming projection for the end of the twenty-first century for the steepest radiative forcing scenario is about 15 per cent warmer (+0.5 degrees Celsius) with a reduction of about a third in the two-standard-deviation spread (−1.2 degrees Celsius) relative to the raw model projections reported by the Intergovernmental Panel on Climate Change.”

Patrick Brown’s very informative blog post about the paper gives a good idea of how they reached these conclusions. As he writes, the central premise underlying the study is that climate models that are going to be the most skilful in their projections of future warming should also be the most skilful in other contexts like simulating the recent past. It thus falls within the “emergent constraint” paradigm. Personally, I’m doubtful that emergent constraint approaches generally tell one much about the relationship to the real world of aspects of model behaviour other than those which are closely related to the comparison with observations. However, they are quite widely used.

In BC17’s case, the simulated aspects of the recent past (the “predictor variables”) involve spatial fields of top-of-the-atmosphere (TOA) radiative fluxes. As the authors state, these fluxes reflect fundamental characteristics of the climate system and have been well measured by satellite instrumentation in the recent past – although (multi) decadal internal variability in them could be a confounding factor. BC17 derive a relationship in current generation (CMIP5) global climate models between predictors consisting of three basic aspects of each of these simulated fluxes in the recent past, and simulated increases in global mean surface temperature (GMST) under IPCC scenarios (ΔT). Those relationships are then applied to the observed values of the predictor variables to derive an observationally-constrained prediction of future warming.[2]

The paper is well written, the method used is clearly explained in some detail and the authors have archived both pre-processed data and their code.[3] On the face of it, this is an exemplary study, and given its potential relevance to the extent of future global warming I can see why Nature decided to publish it. I am writing an article commenting on it for two reasons. First, because I think BC17’s conclusions are wrong. And secondly, to help bring to the attention of more people the statistical methodology that BC17 employed, which is not widely used in climate science.

What BC17 did

BC17 uses three measures of TOA radiative flux: outgoing longwave radiation (OLR), outgoing shortwave radiation (OSR) – being reflected solar radiation – and the net downwards radiative imbalance (N).[4] The aspects of each of these measures that are used as predictors are their climatology (the 2001-2015 mean), the magnitude (standard deviation) of their seasonal cycle, and monthly variability (standard deviation of their deseasonalized monthly values). These are all cell mean values on a grid with 37 latitudes and 72 longitudes, giving nine predictor fields each with 2664 values for three aspects (climatology, seasonal cycle and monthly variability) for each of three variables (OLR, OSR and N). So, for each climate model there are up to 23,976 predictors of GMST change.

BC17 consider all four IPCC RCP scenarios and focus on mid-century and end-century warming; in each case there is a single predictand, ΔT . They term the ratio of the ΔT predicted by their method to the unweighted mean of the ΔT values actually simulated by each of the models involved the ‘Prediction ratio’. They assess the predictive skill as the ratio of the root-mean-square error of the differences for each model between its predicted ΔT and its actual (simulated) ΔT, to the standard deviation of the simulated changes across all the models. They call this the Spread ratio. For this purpose, each model’s predicted ΔT is calculated using the relationship between the predictors and ΔT determined using only the remaining models.[5]

As there are more predictors than data realizations (with each CMIP5 model providing one realization), using them directly to predict ΔT would involve massive over-fitting. The authors avoid over-fitting by using a partial least squares (PLS) regression method. PLS regression is designed to compress as much as possible of the relevant information in the predictors into a small number of orthogonal components, ranked in order of (decreasing) relevance to predicting the predictand(s), here ΔT. The more PLS components used, the more accurate the in-sample predictions will be, but beyond some point over-fitting will occur. The method involves eigen-decomposition of the cross-covariance, in the set of models involved, between the predictors and ΔT. It is particularly helpful when there are a large number of collinear predictors, as here. PLS is closely related to statistical techniques such as principal components regression and canonical correlation analysis. The number of PLS components to retain is chosen having regard to prediction errors estimated using cross-validation, a widely-used technique.[6] BC17 illustrates use of up to ten PLS components, but bases its results on using the first seven PLS components, to ensure that over-fitting is avoided.

The main result of the paper, as highlighted in the abstract, is that for the highest-emissions RCP8.5 scenario predicted warming circa 2090 [7] is about 15% higher than the raw multimodel mean, and has a spread only about two-thirds as large as that for the model-ensemble. That is, the Prediction ratio in that case is about 1.15, and the Spread ratio is about 2/3. This is shown in their Figure 1, reproduced below. The left hand panels all involve RCP8.5 2090 ΔT as the predictand, but with the nine different predictors used separately. The right hand panels involve different predictands, with all predictors used simultaneously. The Prediction ratio and Spread ratio for the main results RCP8.5 2090 ΔT case highlighted in the abstract is shown by the solid red line in panels b and d respectively, at an x-axis value of 7 PLS components.

Figure 1. Sensitivity of results to predictors or predictands used and to the number of PLS components used. a, Prediction ratios for the nine predictor fields, each individually targeting the ΔT 2090-RCP8.5 predictand. b, As in a but using all nine of the predictor fields simultaneously while switching the predictand that is targeted. c, d As in a, b respectively but showing the Spread ratios using hold-one-out cross-validation.

Is there anything to object to in this work, leaving aside issues with the whole emergent constraint approach? Well, it seems to me that Figure 1 shows their main result to be unsupportable. Had I been a reviewer I would have recommended against publishing, in the current form at least. As it is, this adds to the list of Nature-brand journal climate science papers that I regard as seriously flawed.

Where BC17 goes wrong, and how to improve its results

The issue is simple. The paper is, as it says, all about increasing skill: making better projections of future warming with narrower ranges by constraining model projections with observations. In order to be credible, and to narrow the projection range, the predictions of model warming must be superior to a naïve prediction that each model’s warming will be in line with the multimodel average. If that is not achieved – the Spread ratio is not below one – then no skill has been shown, and therefore the Prediction ratio has no credibility. It follows that the results with the lowest Spread ratio – the highest skill – are prima facie most reliable and to be preferred.

Figure 1 provides an easy comparison between different predictors of skill in predicting ΔT for the RCP8.5 2090 case. That case, as well as being the one dealt with in the paper’s abstract, involves the largest warming and as a result the highest signal to noise ratio. Moreover, it has data for nearly all the models (36 out of 40). Accordingly, RCP8.5 2090 is the ideal case for skill comparisons. Henceforth I will be referring to that case if not stated otherwise.

Panel d shows, as the paper implies, that use of all the predictors results in a Spread ratio of about 2/3 with 7 PLS components. The Spread ratio falls marginally to 65% with 10 PLS components. The corresponding Prediction ratios are 1.137 with 7 components and 1.141 with 10 components. One can debate how many PLS components to retain, but it makes very little difference whether 7 or 10 are used and 7 is the safer choice. The 13.7% uplift in predicted warming in the 7 PLS components case used by the authors is rather lower than the “about 15%” stated in the paper, but no matter.

The key point is this. Panel c shows that using just the OLR seasonal cycle predictor produces a much more skilful result than using all predictors simultaneously. The Spread ratio is only 0.53 with 7 PLS components (0.51 with 10). That is substantially more skilful than when using all the predictors – a 40% greater reduction below one in the Spread ratio. Therefore, the results based on just the OLR seasonal cycle predictor must be considered to be more reliable than those based on all the predictors simultaneously.[8] Accordingly, the paper’s main results should have been based on them in preference to the less skilful all-predictors-simultaneously results. Doing so would have had a huge impact. The RCP8.5 2090 Prediction ratio using the OLR seasonal cycle predictor is under half that using all predictors – it implies a 6% uplift in projected warming, not “about 15%”.

Of course, it is possible that an even better predictor, not investigated in BC17, might exist. For instance, although use of the OLR seasonal cycle predictor is clearly preferable to use of all predictors simultaneously, some combination of two predictors might provide higher skill. It would be demanding to test all possible cases, but as the OLR seasonal cycle is far superior to any other single predictor it makes sense to test all two-predictor combinations that include it. I accordingly tested all combinations of OLR seasonal cycle plus one of the other eight predictors. None of them gave as high a skill (as low a Spread ratio) as just using the OLR seasonal cycle predictor, but they all showed more skill than the use of all three predictors simultaneously, save to a marginal extent in one case.

In view of the general pattern of more predictors producing a less skilful result, I thought it worth investigating using just a cut down version of the OLR seasonal cycle spatial field. The 37 latitude, 72 longitude grid provides 2,664 variables, still an extraordinarily large number of predictors when there are only 36 models, each providing one instance of the predictand, to fit the statistical model. It is thought that most of the intermodel spread in climate sensitivity, and hence presumably in future warming, arises from differences in model behaviour in the tropics.[9] Therefore, the effect of excluding higher latitudes from the predictor field seemed worth investigating.

I tested use of the OLR seasonal cycle over the 30S–30N latitude zone only, thereby reducing the number of predictor variables to 936 – still a large number, but under 4% of the 23,976 predictor variables used in BC17. The Spread ratio declined further, to 51% using 7 PLS components.[10] Moreover, the Prediction ratio fell to 1.03, implying a trivial 3% uplift of observationally-constrained warming.

I conclude from this exercise that the results in BC17 are not supported by a more careful analysis of their data, using their own statistical method. Instead, results based on what appears to be the most appropriate choice of predictor variables – OLR seasonal cycle over 30S–30N latitudes – indicate a negligible (3%) increase in mean predicted warming, and on their face support a greater narrowing of the range of predicted warming.

Possible reasons for the problems with BC17’s application of PLS regression

It is not fully clear to me why using all the predictors simultaneously results in much less skilful prediction than using just the OLR seasonal cycle. I am not very experienced in the use of PLS regression, but my understanding is that it should create components that each in turn consist of an optimal mixture – having regard to maximizing retained cross-covariance between the predictors and the predictand – of all the available predictors that is orthogonal to that in earlier components. Naïvely, one might therefore have expected it to be a case of “the more the merrier” in terms of adding additional predictors. However, that is clearly not the case.

One key issue may be that BC17 use data that have been only partially standardized. For any given level of correlation with the predictand, a high-variance predictor variable will have a higher covariance with it than a low-variance one. That will result in higher variance predictor variables tending to be more dominant in the decomposition than lower variance ones, and more highly weighted in the PLS components. To avoid this effect, it is common when using techniques such as PLS that involve eigen-decomposition to standardize all predictor variables to unit standard deviation. BC17 do standardize predictor variables, but they divide them by the global model-mean standard deviation for each predictor field, not by their individual standard deviations, at each spatial location of each predictor field. That still leaves the standard deviations of individual predictor variables at different spatial locations varying by a factor of up to 20 within each predictor field.

When I apply full standardization to the predictor variables in the all-predictors-simultaneously case, the excess over one of the Prediction ratio halves, to 7.0%, using 7 PLS components. The Spread ratio increases marginally, but it is only an approximate measure of skill so that is of little significance. By contrast, when only the OLR seasonal cycle predictor field is used, either in full or restricted to latitudes 30S–30N, full standardization has only a marginal impact on the Prediction ratio. These findings provide further evidence that BC17’s results, based on use of all predictor variables without full standardization, are unstable and much less reliable than results based on use of only the OLR seasonal cycle predictor field, whether extending across the globe or just tropical latitudes.

Why BC17’s results would be highly doubtful even if their application of PLS were sound

Despite their superiority over BC17’s all-predictors-simultaneously results, I do not think that revised results based on use of only the OLR seasonal cycle predictor, over 30S–30N, would really provide a guide to how much global warming there would actually be late this century on the RCP8.5 scenario, or any other scenario. BC17 make the fundamental assumption that the relationship of future warming to certain aspects of the recent climate that holds in climate models also applies in the real climate system. I think this is an unfounded, and very probably invalid, assumption. Therefore, I see no justification for using observed values of those aspects to adjust model-predicted warming to correct model biases relating to those aspects, which is in effect what BC17 does.

Moreover, it is not clear that the relationship that happens to exist in CMIP5 models between present day biases and future warming is a stable one, even in global climate models. Webb et al (2013) [ix], who examined the origin of differences in climate sensitivity, forcing and feedback in the previous generation of climate models, reported that they “do not find any clear relationships between present day biases and forcings or feedbacks across the AR4 ensemble”.

Furthermore, it is well known that some CMIP5 models have significantly non-zero N (and therefore also biased OLR and/or OSR) in their unforced control runs, despite exhibiting almost no trend in GMST. Since a long-term lack of trend in GMST should indicate zero TOA radiative flux imbalance, this implies the existence of energy leakages within those models. Such models typically appear to behave unexceptionally in other regards, including as to future warming. However, they will have a distorted relationship between climatological values of TOA radiative flux variables and future warming that is not indicative of any genuine relationship between them that may exist in climate models, let alone of any such relationship in the real climate system.

There is yet a further indicator that the approach used in the study tells one little even about the relationship in models between the selected aspects of TOA radiative fluxes and future warming. As I have shown, in CMIP5 models that relationship is considerably stronger for the OLR seasonal cycle than for any of the other predictors or any combination of predictors. But it is well established that the dominant contributor to intermodel variation in climate sensitivity is differences in low cloud feedback. Such differences affect OSR, not OLR, so it would be surprising that an aspect of OLR would be the most useful predictor of future warming if there were a genuine, underlying relationship in climate models between present day aspects of TOA radiative fluxes and future warming.

Conclusion

To sum up, I have shown strong evidence that this study’s results and conclusions are unsound. Nevertheless, the authors are to be congratulated on bringing the partial least squares method to the attention of a wide audience of climate scientists, for the thoroughness of their methods section and for making pre-processed data and computer code readily available, hence enabling straightforward replication of their results and testing of alternative methodological choices.

[2] Uncertainty ranges for the predictions are derived from cross-validation based estimates of uncertainty in the relationships between the predictors and the future warming. Other sources of uncertainty are not accounted for.

[3] I had some initial difficulty in running the authors’ Matlab code, as a result of only having access to an old Matlab version that lacked necessary functions, but I was able to adapt an open source version of the Matlab PLS regression module and to replicate the paper’s key results. I thank Patrick Brown for assisting my efforts by providing by return of email a missing data file and the non-standard color map used.

[4]N = incoming solar radiation – OSR – OLR; with OSR and OLR being correlated, there is only partial redundancy in also using the derived measure N.

[6] For each CMIP5 model, ΔT is predicted based on a fit estimated with that model excluded. The average of the squared resulting prediction errors will start to rise when too many PLS components are used.

[8] Although I only quote results for the RCP8.5 2090 case, which is what the abstract covers, I have checked that the same is also true for the RCP4.5 2090 case (a Spread ratio of 0.66 using 7 PLS components, against 0.85 when using all predictors). In view of the large margin of superiority in both cases it seems highly probable that use of the OLR seasonal cycle produces more skilful predictions for all predictand cases.

43 Comments

As he writes, the central premise underlying the study is that climate models that are going to be the most skilful in their projections of future warming should also be the most skilful in other contexts like simulating the recent past.

In the US, when brokerage houses advertise, to keep from deceiving the rubes they are required by law to include language like this:

“investments are quite different from physical systems.” Very true! But there are also similarities in that both the climate and returns on financial assets are complex, chaotic systems about which making predictions about future events are fiendishly difficult.

Climate models and financial e.g., growth, models are similar in that they are both underspecified and mis-specified. They cannot include all the correct variables in the correct forms and some independent variables are either under specified or omitted due to lack of data or inability of modelers. If you include too many variables (overspecification) the prediction algorithm contains redundant predictor variables .. the model may be “correct,” but you have gone overboard by adding predictors that are redundant leading to problems such as inflated standard errors for the regression coefficients” (i.e., overconfidence in the prediction algorithm). As such these should be used with great caution for prediction of a response, and cannot be used to ascribe cause and effect. Feynman would say that we have gone overboard and made the model more complicated and hard to understand than necessary – therefore make it simpler.

Nick,
Thank you again for a clear exposition with a stark conclusion.
The first Analysis of Variance I was asked to perform was by hand, before even calculators were easily available. The design was necessarily limited to a few hundred combinations of variables, not the ‘up to 23,976 ‘ you calculated here. It was drummed into me that inspection of all or most of the values of these combinations was an integral part of the procedure, to see if the calculated values contained any that were contrary to expectation, or just looked plain silly. A single wrong data entry can lead to this. If you have a design giving 20,000 combinations, manual inspection needs to be replaced with an automated one. I am not sure that this can be done effectively, particularly to find a wrong entry.
It still remains the case that most of the findings from climate science suffer from lack of formal, rigid error analysis. It would be quite a task for a statistician to create and calculate a proper trail of errors as this BC17 exercise progressed through its many successive steps. One is left again with the lurking suspicion that proper, final error bars would be larger by far than the results claimed. Geoff.

So they couldn’t see that Fig 1 part c destroys their entire paper! Apparently not fully conversant with statistics, rather a recurring theme with climate scientists. Perhaps they will discover an error in the OLR seasonal data and thus remove this troublesome parameter. A wonderful side benefit of this is that they could then accept Nic’s observation that a single parameter does better than all of them put together, and point out that the OLR climatology parameter (which appears to work about as well as all of them together) leads to a 20% increase in the 2090 temperature!

The key point is this. Panel c shows that using just the OLR seasonal cycle predictor produces a much more skillful result than using all predictors simultaneously.

Nick, doesn’t the presumption that the models are functioning based on physics necessitate an increase in skill with the addition of each independent component predictor? Thanks for your extensive work and sorry if my question reflects a lack of understanding of all of it.

Ron Graf,
Irrespective of how the models function, if the PLS method is achieving its object then one might expect adding additional predictors to improve skill, as long as they were not perfectly correlated, as well as skill increasing with the number of retained PLS components (each of which is a differently-weighted combination of all the predictor variables), up to when overfitting occurs. But I think the problem is probably that overfitting or misfitting (resulting in a suboptimal PLS component) occurs right from the construction of the first PLS component – not really surprising given the huge number of predictors to choose weights for.

If prediction error is (wrongly) measured using a single fit for the entire set of CMIP5 models (rather than using cross-validation), then when all nine predictors are used simultaneously the prediction error does decline more rapidly with the number of PLS components used – and is lower except when three PLS components are used – than when just the OLR season cycle predictor is used. In either case, the prediction error reduces to zero when the maximum number of PLS components is used (one less than the number of models), since then there are sufficient degrees of freedom available to exactly fit each CMIP5 model’s predictand.

If the PLS method were able to minimize cross-validation based prediction error when forming each PLS component, rather than maximizing cross-covariance, then it probably would achieve a superior result (lower Spread ratio) when using all predictors simultaneously than just any one of them, but such a method would be extremely computationally demanding.

Nic, if I understand correctly, you’re saying the basic statistical intention of the paper was not achieved; the PLS method as applied apparently improperly weighted predictors as evidenced by the superior skill of a single predictor, OLR seasonal cycle, over their group of predictors. It seems you are also leaving open the possibility for productive use of their idea if enough computational force is available and it can be done correctly. I read you as giving little weight to the statistical result of the 6% uplift you obtained from using simply the OLR SC.

IMO, one cannot cross-validate against an ensemble of unvalidated ideas or models (as Brown et al does) to produce anything of value. I welcome other’s opinions on this.

However, testing the models’ component physical functions against the real observed record, especially since the models was last tuned, is I believe a worthy and long overdue enterprise.

By the way, I left a comment at Patrick Brown’s blog regarding the critique of his paper and invited him to respond here. (It might be in moderation for a while.)

I am having difficulty understanding the spread ratio. Nic wrote;
“They assess the predictive skill as the ratio of the root-mean-square error of the differences for each model between its predicted ΔT and its actual (simulated) ΔT, to the standard deviation of the simulated changes across all the models. They call this the Spread ratio. For this purpose, each model’s predicted ΔT is calculated using the relationship between the predictors and ΔT determined using only the remaining models.”

So for a given model, let’s say the RMS divergence at 2090 between the predicted temperature by the author’s method, and the model’s simulated temperature, is 0.2 C. That is the numerator of the ratio. The denominator is the standard deviation of the all the models ΔTs at 2090. Is that what “of the simulated changes across all the models” means? If the standard deviation of the models temperature change of RCP8.5 at 2090 is 0.75 (est. from fig. TS.15), then the ratio is 0.2/0.75 = 0.27. This obviously isn’t correct. Where is this wrong?

Ken,
Your understanding is basically correct, except that the RMS divergence at 2090 between the predicted temperature by the author’s method and simulated temperatures is the mean across all the models, not for a single model.

The standard deviation of the estimated RCP8.5 2090 simulated warming, across the 36 models with data, is 0.60 C, a bit lower than your 0.75 C.

However, your 0.2 C estimate of the RMSE prediction error in their skill estimate is wildly optimistic. Using all predictor fields simulatneously, and the 7 PLS components they use, the actual figure is double that: 0.40 C. So, as they report, the Spread ratio is 2/3.

If just the OLR seasonal cycle magnitude field is used, the RMSE prediction error redues to 0.32 C, or a bit lower if only 30S-30N latitidue zone values are used.

It is not a very impressive prediction skill in any of these cases. And the Spread ratio is even higher for cases other than RCP8.5 2090. The average prediction error across the other seven scenario – projection date combinations, when using all predictor fields, is 0.9 – a trivial reduction in RMSE error to 0.54 C from the original stadard deviation of 0.6 C.

One of the problems is that the variables used in the paper as predictors tend to be used explicitly for tuning GCMs. As a result, none of the GCMs gets the predictors very far away from observations, save where they have high energy leakages. So a lower difference from observations may largely reflect more attention having been paid to matching observations during the tuning process. It is certainly far from obvious that there will be any consistent link between a GCMs fidelity to observed recent TOA radiative fluxes and the realism of its projected 2090 warming.

You can also see the differences between clear and cloudy skies. It also has output for some climate models that are trying to predict weather 1-3 months in the future. So you can see what models do well or poorly at reproducing (given current SSTs, which change slowly).

Nic, Here’s an example that I would like to put forward as a way of thinking about their method. One historical metric we could use to constrain models is aerosol forcing. If we constrain an OAGCM to agree with historical best estimates, according to Mauretsen and Stevens, we would get more warming in the future. That’s because the models with high ECS also have higher aerosol forcing (and too high in the historical period). It seems to me this exercise merely shows that the original model was wrong in more fundamental ways, such as having an ECS that is too high to reproduce historical warming.

From a mathematical point of view, this method seems to be uninformative. It merely would adjust the “best tuning” of the modeling group to match more historical measures and indeed might contradict expert judgment and other constraints used to build and tune the model.

“Models, all the way down”, is not and will never be a convincing argument, even when it is published in Nature. This forgettable study will be, quite reasonably, forgotten in a few months. Really, the study is symptomatic of what is wrong with climate science.

Yes, Steve, the fundamental problem here is that AOGCM’s are simply not skillful at much and tweaking them in this way doesn’t in my view tell us very much. My example illustrates why the paper may in fact show that AOGCM’s are actually flawed.

Steve, this citation of Bjorn Stevens could be enlightining when it comes to the ability of the GCM to replicate the aerosol (direct and indirect) forcing:
“We are not adverse to the idea that Faer may be more
negative than the lower bound of S15, possibly for reasons
already stated in that paper. We are averse to the idea that
climate models, which have gross and well-documented
deficiencies in their representation of aerosol–cloud interactions (cf. Boucher et al. 2013), provide a meaningful
quantification of forcing uncertainty. Surely after decades
of satellite measurements, countless field experiments,
and numerous finescale modeling studies that have repeatedly
highlighted basic deficiencies in the ability of
comprehensive climate models to represent processes
contributing to atmospheric aerosol forcing, it is time to
give up on the fantasy that somehow their output can be
accepted at face value.”
Source: http://pubman.mpdl.mpg.de/pubman/item/escidoc:2382803:9/component/escidoc:2464328/jcli-d-17-0034.1.pdf

Steve, yes indeed. It’s rather difficult to formulate a harsher judgement as Stevend did in a scientific paper IMO. Anyway, the BC 17 -paper seems to reflect such an approach to find models which are more reliable then others. The GHG-Aerosol forcing balance seems to be the key point for any TCR/ECS estimation. Therefore ( beyond some difficulties with their own statistic tools as Nic showed) this paper can’t contribute to narrow the bounds of the sensivity of the real world vs. GHG IMO. It’s missing the point?

Frank, That’s for that reference. It’s a good one that I bookmarked for future reference.

I would have thought that as evidence of lack of skill and fundamental problems with AOGCM’s accumulates, climate scientists would stop relying on them to predict the future. I suspect its an easy publication to make a bunch of model runs. Fundamental progress is much harder.

Frankclimate, dpy6629, Nic, and stevefitzpatrick: Given the uncertainty in aerosol forcing and the uncertainty it produces in the output of EBMs, perhaps BC17 should have added the ability to reproduce CERES measurements of aerosol reflection of SWR (through clear skies over the oceans) to their predictors. See Figure 1 vs Figure 2 in Stevens paper:

As I said elsewhere, most of the predictors/observations used by BC17 are not phenomena that are related to dOLR/dTs and dOSR/dTs, the factors determine climate sensitivity. Aerosol forcing is somewhat relevant to dOSR/dTs, because aerosols have increase dOSR during the period of historic warming. High sensitivity models may have been tuned to reproduce historic warming by creating high sensitivity to cooling by aerosols. To the extent that such models don’t agree with observations of aerosols as a predictor, BC17 methodology (assuming I understand it) would underweight that model’s contribution to the high end of the climate sensitivity range.

Frank, I think the choice of the OSL/OLR seasonal cycle ( many models are tuned to this measure) is some kind of circular reasoning. The most important issue for high/low ECS ( GHG vs. Aerosol forcing impacts on GMST) is the elephant in the room, not included in BC17. This is a pitty at least.

frankclimate: Tsushima and Manabe (2013) shows that various AOGCMs disagree seriously with each other and with CERES about the change the seasonal cycle for: 1) OLR from cloudy skies, 2) OSR from cloudy skies, and 3) OSR from clear skies (seasonal change in surface albedo). CERES and AOGCM agree about OLR from clear skies (WV+LR+Planck feedback = 2.1 W/m2/K). It would be interesting to know if the latter agreement exists because models have been tuned to reproduce this metric.

Let me try to understand.
The most clever models to predict surface warming for the next century are the models that get the TOA radiation most correct. Some models are clever at hindcasting the TOA radiation for the years 2001 to 2015. These are the same models thet are clever at hindcasting the seasonal OLR radiation for the same years. These are the models we shall trust.
So, which models are we talking about, and what values do they show when it comes to hindcasting and forecasting OHC, TOA net SW radiation, TOA net LW radiation and changing lapse rate?

Nic Is it not true that the harsh reality is that the output of the climate models which the IPCC rely’s on on their dangerous global warming forecasts have no necessary connection to reality because of their structural inadequacies. See Section 1 athttps://climatesense-norpag.blogspot.com/2017/02/the-coming-cooling-usefully-accurate_17.html
Here is a quote:
“The climate model forecasts, on which the entire Catastrophic Anthropogenic Global Warming meme rests, are structured with no regard to the natural 60+/- year and, more importantly, 1,000 year periodicities that are so obvious in the temperature record. The modelers approach is simply a scientific disaster and lacks even average commonsense. It is exactly like taking the temperature trend from, say, February to July and projecting it ahead linearly for 20 years beyond an inversion point. The models are generally back-tuned for less than 150 years when the relevant time scale is millennial. The radiative forcings shown in Fig. 1 reflect the past assumptions. The IPCC future temperature projections depend in addition on the Representative Concentration Pathways (RCPs) chosen for analysis. The RCPs depend on highly speculative scenarios, principally population and energy source and price forecasts, dreamt up by sundry sources. The cost/benefit analysis of actions taken to limit CO2 levels depends on the discount rate used and allowances made, if any, for the positive future positive economic effects of CO2 production on agriculture and of fossil fuel based energy production. The structural uncertainties inherent in this phase of the temperature projections are clearly so large, especially when added to the uncertainties of the science already discussed, that the outcomes provide no basis for action or even rational discussion by government policymakers. The IPCC range of ECS estimates reflects merely the predilections of the modellers – a classic case of “Weapons of Math Destruction” (6).

Harrison and Stainforth 2009 say (7): “Reductionism argues that deterministic approaches to science and positivist views of causation are the appropriate methodologies for exploring complex, multivariate systems where the behavior of a complex system can be deduced from the fundamental reductionist understanding. Rather, large complex systems may be better understood, and perhaps only understood, in terms of observed, emergent behavior. The practical implication is that there exist system behaviors and structures that are not amenable to explanation or prediction by reductionist methodologies. The search for objective constraints with which to reduce the uncertainty in regional predictions has proven elusive. The problem of equifinality ……. that different model structures and different parameter sets of a model can produce similar observed behavior of the system under study – has rarely been addressed.” A new forecasting paradigm is required.

Nic: If climate models are going to correctly predict climate sensitivity, then they need to correctly predict dOLR/dTs and dOSR/dTs from both clear and cloudy skies. That is why I have always been fascinated by Tsushima and Manabe (2013) which looks at these parameters GLOBALLY during the seasonal cycle. The global amplitude of the seasonal cycle globally is 3.5 K (which I think is a hemispheric average about 10 K of warming in the NH and -3 K of “warming” in the SH). This drives a global seasonal increase in OLR of about 7.5 W/m2 (or 2.1 W/m2/K). There is less dynamic range in equatorial regions than elsewhere. Since all climate models agree (possibly because of tuning) about the combined WV+LR feedback through clear skies (globally at least), the differences between models arises and between models and observations (globally at least) comes from OLR from cloudy skies. With little humidity and constant CO2 above cloud tops, variation in OLR from cloudy skies may reflection variation in cloud top altitude.

I’ve always thought that the next step after TM13 would be to look at how well climate model perform REGIONALLY in terms of seasonal changes dOLR/dT and dOSR/dT from both clear and cloudy skies. However CB17 looks at regional changes OLR, OSR, and radiative imbalance (N), not explicitly how they change with regional temperature (Ts). And he didn’t separate clear and cloudy skies, where the physical mechanisms have different origins. The superior utility of the OLR seasonal cycle in narrowing the spread of climate sensitivity may arise because it is a better measure of dOLR/dTs.

Unlike OLR, the seasonal cycle in global OSR is only partially explained by the seasonal cycle in global Ts. Some components of global OSR (reflection from sea ice and seasonal snow cover) lag behind changes in Ts. Both Lindzen and Spenser find better correlations between OSR and Ts about 3 months earlier (in the tropics for Lindzen). Physically, emission of OLR is a function of temperature, but reflection of SWR is not. Reflection of SWR depends on a need for convection to carry 100 W/m2 of (mostly latent) heat away from the surface, which produces parcels of rising cloudy air and descending clear air. This circulation took place during the LGM when GMST was 5 K lower and will still be occurring if GMST rises 5 K.

All AOGCMs presumably are tuned so that the global climatology of OLR and OSR agree with observations. So the global average OLR must be about 240 W/m2 and OSR 102 W/m2. Therefore, this observation/predictor reflects regional DIFFERENCES in temperature (OLR), clouds (OLR and OSR), and surface albedo (OSR). AOGCMs are also tuned to agree with observations of sea ice extent, so this aspect of OSR from clear skies is tuned. Climatological OLR at a particular location obviously varies with “climatological Ts”, but it isn’t a direct measure of how well a model reproduces dOLR/dTs.

Assuming constant solar irradiation at the TOA, the radiative imbalance (N) is a function of OLR and OSR. In theory, adding N to the OLR and OSR in this analysis provides redundant information.

When using all of these observations/predictors to narrow the spread in climate sensitivity, CB17 is assuming that statistics will properly weight the physical importance of these parameters to climate sensitivity. This process should be broken down into two steps: 1) Ask how well do models reproduce what we observe? 2) Based on physics, ask which of these observables is most clearly linked to climate sensitivity. My personal answer to that question is dOLR/dTs and dOSR/dTs from both clear and cloudy skies. TM13 tells us models need improvement. (The limited correlation between ANOMALIES in monthly N and Ts – with variation in Ts driven by ENSO – don’t provide a satisfying answer.)

Frank
There is much in what you say. The lack of any direct link between BC17’s predictor variables and measures of dOLR/dT and dOSR/dT is indeed a very major weakness of the study. It is made worse by the variabkes that they use being ones that most if not all models are tuned to match.
Your point “The superior utility of the OLR seasonal cycle in narrowing the spread of climate sensitivity may arise because it is a better measure of dOLR/dTs.” is a good one.

Nic: Above I mentioned that BC17 should first analyze their predictors before using them to weight climate sensitivity. IF I understand correctly, predictors are discrepancies between model output and observations. Let’s suppose some models do a better jobs of reproducing OSR (or perhaps seasonal changes in OSR) from areas with marine boundary layer clouds. Other models might have storm tracks in slightly the wrong location, but with little NET change in OSR from this problem. Given the importance of changes in boundary layer clouds to climate sensitivity, it would make more focus on predictors from areas where boundary layer clouds are important. However, one would want a predictor that has not been used to specifically tune the behavior of boundary layer clouds, perhaps seasonal change rather than annual climatology. If models have been tuned to produce the “right amount” of boundary layer clouds, perhaps the ability to simulate an appropriate amount of natural variability in boundary layer clouds would be an important metric.

By ignoring the nature of the “predictors” – which I think might also be termed “model errors” (assuming I do understand this methodology), BC17 cleverly avoids characterizing those errors, but asserts that taking them into account narrows the spread of climate sensitivity. I think your analysis has demonstrated the pitfall of applying this method to predictors without a clear connection to climate sensitivity.

From the abstract of Zhao (2016), GFDL model:

“The authors demonstrate that model estimates of climate sensitivity can be strongly affected by the manner through which cumulus cloud condensate is converted into precipitation in a model’s convection parameterization, processes that are only crudely accounted for in GCMs. In particular, two commonly used methods for converting cumulus condensate into precipitation can lead to drastically different climate sensitivity, as estimated here with an atmosphere–land model by increasing sea surface temperatures uniformly and examining the response in the top-of-atmosphere energy balance. The effect can be quantified through a bulk convective detrainment efficiency, which measures the ability of cumulus convection to generate condensate per unit precipitation. The model differences, dominated by shortwave feedbacks, come from broad regimes ranging from large-scale ascent to subsidence regions. Given current uncertainties in representing convective precipitation microphysics and the current inability to find a clear obser- vational constraint that favors one version of the authors’ model over the others, the implications of this ability to engineer climate sensitivity need to be considered when estimating the uncertainty in climate projections.”

The “drastic difference” in climate sensitivity is from ECS 3.0 K to 1.8 K (assuming F_2x = 3.7 W/m2). If the authors are correct and no observational constraint favors one model over the other, then the approach of BC17 appears worthless. I’d like to see how well each model reproduces feedbacks in response to seasonal warming. (Studying the relative merits of a climate model with an ECS that agrees with EBMs might not enhance one’s career.)

Frank: you make an excellent point. I agree, the demonstration by Zhao et al (2016) that one could engineer ECS in a GFDL GCM by varying the convective parameteriization without any clear change in how well the model represented any observable aspects of the current climate seems to me a death blow to using such aspects, as BC17 does, as emergent constraints for ECS. My reading of Zhao et al was that the authors were rather worried, and I think surprised, by this finding.

Nic and frankclimate: The more I think about Zhao, the more questions I have about I have about asserting a climate sensitivity of 1.8 K, which is perhaps why the paper (nor Isaac) refer to ECS. Zhao was working on the atmosphere-only model (AM4) for the next generation GFDL model(s). It suddenly dawned on me that I didn’t know what kind of ocean (if any) was below that atmosphere and why the term “Cess climate sensitivity” was being used. So I have pasted some of the caveats from the paper about the “Cess approach”. Some of the papers you have discussed suggest that the curvature in Gregory plots (and the changing feedbacks they imply) arise from changes in the ocean.

We don’t know what atmosphere (AM4-H, AM4-M, AM4-L, or something totally different) was used in the model(s) GFDL is using for CIMP6 or whether any would exhibit an ECS less than 2 K. We can be sure that whatever atmospheric parameterization was chosen, there are probably equally good choices that could have dramatically different climate sensitivity. This parameterization uncertainty – which isn’t properly explored using the “IPCC’s ensemble of models” – is often ignore.

“To assess the models’ cloud feedback and climate sensitivity, we follow the Cess approach by conducting a pair of present-day and global warming simulations for each model using prescribed SSTs and greenhouse gas (GHG) concentrations (Cess et al. 1990). The present-day simulations are forced by the observed HadISST climatological SSTs averaged over the period of 1981–2000, with GHG concentrations fixed at the year 2000 level. The global warming experiments are identical to the present-day simulations, except SSTs are uniformly increased by 2 K. A Cess climate sensitivity parameter λ can then be computed as λ = ΔTs/ΔG, where Ts denotes global mean SST, G is TOA net radiative flux, and Δ indicates the difference between warming and present-day simulations”

“Another simplification of the Cess approach is the use of uniformed SST warming experiments with an atmospheric-only model for studying cloud feedback. Previous studies demonstrated that the cloud feedbacks derived from the Cess experiments well capture the intermodel differences of feedbacks in the equilibrium response of slab ocean models to a doubling of CO2 (e.g., Ringer et al. 2006). Very recently, Ringer et al. (2014) and Brient et al. (2015) analyzed the CMIP5 fully coupled ocean–atmosphere models and their corresponding Cess experiments and confirmed again that the Cess experiments provide a good guide to the global cloud feedbacks determined from the coupled simulations, including the intermodel spread. The differences in total climate feedback parameter between the Cess and coupled models arise primarily from differences in clear-sky feedbacks that are anticipated from the nature of the Cess experimental design (i.e., ignoring the polar amplification and sea ice albedo feedback). As a result, the Cess climate sensitivity parameter should not be interpreted at its face value for estimates of model equilibrium climate sensitivity. With these limitations in mind, the Cess approach has been widely used in characterizing and understanding many aspects of intermodel differences in cloud feedback and climate sensitivity between GCMs”

Schmidt found some errors in forcings and ocean heat uptake and redid the calculations and obtained a graph (removed from the Real Climate post perhaps to save disk space). I found it elsewhere and it shows that if total current aerosol forcing was about -1.0 W/m2 then the best ECS estimate was about 1.7C. ECS of 3.0 would require a current aerosol forcing of -1.75 W/m2.

1. This all I believe is corroboration of Nic’s recent energy balance estimates of ECS especially Nic’s statement that uncertainty in aerosol forcing is the biggest uncertainty in this type of calculation.
2. It shows once again how unreliable alarming papers can be, even in Nature.
3. It shows once again that if we use aerosol forcing vs. recent data as a measure of GCM reliability, only the lower ECS models survive.

3 above would also call into question Brown and Caldera and one would have to ask why this obvious measure was not used.

Frank, some aspects ot the possibility to “engineer” the sensivity of models with microphysics of clouds are also discussed by Isaac Held: https://www.gfdl.noaa.gov/blog_held/66-clouds-are-hard/ . With respect to Zhao (2016) he works out:
“The problem is that, while it may be possible to find some properties of the climate simulation that look better in one of these models than the others, the biases in other parts of the model affecting the same metric can make it hard to make a convincing case that you have constrained cloud feedback. At this point, we are not convinced that we have emergent constraints that clearly favor one version of this proto-AM4 model over the others. We are uncomfortable having the freedom to engineer climate sensitivity to this degree. You can always try to use the magnitude of the warming over the past century itself to constrain cloud feedback, but this gets convolved with estimates of aerosol forcing and internal variability. Ideally we would like to constrain cloud feedbacks in other ways so as to bring these other constraints to bear on the attribution of the observed warming.”
It’s a more fundamental critque of the approach of BC17 IMO beyond some technical issues which I’m not familiar with.
PS: It’s so sad that the last post of Isaac on his blog is more than 1 year old!

[…] thank Patrick Brown for his detailed response (also here) to statistical issues that I raised in my critique “Brown and Caldeira: A closer look shows global warming will not be greater than we thought” […]

[…] Patrick Brown for his detailed response (also here) to statistical issues that I raised in my critique “Brown and Caldeira: A closer look shows global warming will not be greater than we thought” […]