Review Comments on the "IPCC Test"

In a recent post, I’ve indicated that IPCC authors seems to have invented a “test” for long-term persistence that is nowhere attested in any statistical literature and, if I’ve interpreted what they’ve done correctly, appears to be a useless test.

Jean S and I have made a few references to the Review Comments on the “IPCC Test” and I thought that it would be interesting to collate them a little more systematically as they show the bullying tactics of IPCC authors and the total failure of review editors to ensure an adequate reply to reasonable comments.

Cohn and Lins 2005

The issues raised in Cohn and Lins 2005 underpin the problems with Table 3.2 and the futile attempts of IPCC authors both to fix the Table 3.2 problems and their rejection of critical comments. Cohn and Lins 2005 is familiar to CA readers (it in turn citing interesting analyses by Koutsoyannis). They analyzed the CRU temperature series (also discussed in AR4 Table 3.3); they reported similar trends as IPCC, but observed that the significance of the trend was reduced by orders of magnitude under a plausible form of persistence:

All of the tests report nearly the same estimated trend magnitude (), which ranges from 0.0045 to 0.0053 C/year. As far as the magnitude is concerned, it makes little difference which test is used. Choice of trend test, however, does matter when computing trend significance. The simplest test, Tb,{0,0,0} (which assumes no LTP), finds strong evidence of trend, a p-value of 1.8 * 10 27. Tb,{f,0,0} (which allows for shortterm persistence) yields a p-value of 5.2 * 10 11, 16 orders of magnitude larger and still highly significant. The p-value corresponding to either Tb,{0,d,q} or Tb,{f,d,0}, an unadjusted LRT trend test that considers both short-term and long-term persistence, is about 7%, which is not significant under the null hypothesis. In changing from one test to another, 25 orders of magnitude of significance vanished. This result is somewhat troubling given the uncertainty about the stochastic process and consequently about which test to rely on.

First Order Draft

The AR4 First Draft contained no mention of Cohn and Lins 2005, and, without considering the issues raised in Cohn and Lins 2005 on trend significance, waded right into attempts to claim trend significance. Here is the caption for Table 3.2 in the First Draft, which does not mention of the Durbin Watson test, “after allowing for first-order serial correlation”:

[First Draft] Table 3.2. Temperature trends (K decade—1) in hemispheric and global land surface air temperatures, SST 11 and night marine air temperatures. Trends with ⯲ standard error ranges and significances (bold: <1%; 12 italic, 1%—5%) were estimated by Restricted Maximum Likelihood (Diggle et al., 1999) see Appendix 3.A.1.2, which allows for serial correlation in the residuals of the data about the linear trend. All trends are based on annual averages without estimates of intrinsic uncertainties.

In the First Draft Appendix 3.A, they said at 113:50ff :

Some of the linear trends for global fields and averages in this Chapter have been estimated using Restricted Maximum Likelihood (REML, Diggle et al., 1999). REML estimates yields error bars which take account both of the estimated errors of the input data and of the autocorrelation of the residuals from the fitted trend. The error bars are, therefore, wider and more realistic than those provided by the Ordinary Least Squares (OLS) technique. If, for example, a century-long series has multidecadal variability as well as a trend, the deviations from the fitted linear trend will be autocorrelated. This will cause the REML technique to widen the error bars, reflecting the greater difficulty in distinguishing a trend when it is superimposed on other long-term variations, and the sensitivity of estimated trends to the period of analysis in such circumstances.

Clearly, however, even the REML technique cannot widen its error estimates to take account of variations outside the period of record when used, for example, to estimate trends from MSU data, which began in 1979. So, the errors estimated by REML may still be too small for short records.

First Draft Comments
Table 3.2 prompted a number of FOD Comments, of which the most cogent was by a certain Ross McKitrick (I was unaware of this when I started this analysis). He said:

3-425 A 9:0 Table 3.2. Here, and in Appendix 3.A.1.2, reference is made to “Restricted Maximum Likelihood” standard errors, but the citation is to a general text book (Diggle), not a published article. Considering the importance of the contents of this table to the Chapter, the reader needs considerably more guidance about the estimating methodology, as well as reference to current literature.

There is a substantial literature dating back to the early 1990s showing that anomaly data have long autocorrelation processes in them, making for long term persistence and near unit-root behaviour. It is well known in the climate literature that this can severely bias significance estimates in trend regressions. Yet there is no mention of this problem and it seems that the t-stats in this table reflect only a first order autocorrelation correction, almost certainly making them misleading. I will suggest some improved wording, but I believe this table needs a serious re-do and the reader is owed a substantial discussion of the problems of estimating significance of trends in climatic data. Below I cite a forthcoming treatment of the issue by Cohn and Lins, who comment “It is therefore surprising that nearly every assessment of trend significance in geophysical variables published during the past few decades has failed to account properly for long term persistence….

For example, with respect to temperature data there is overwhelming evidence that the planet has warmed during the past century. But could this warming be due to natural dynamics? Given what we know about the complexity, long term persistence, and non-linearity of the climate system, it seems the answer might be yes.” All the trends should be re-estimated using, at minimum, an ARMA(1,1) model, not an AR(1) model; and the lag processes need to be extended out to sufficient length to ensure the ARMA coefficients become insignificant. The treatment of this key issue in this chapter is at least 10 years behind the state of the art (see, for instance, Woodward and Gray JClim 1993, who were already ahead of where this discussion is), and unless substantial improvement is made this Table and related discussions should be removed altogether.
[Ross McKitrick]

The “IPCC Test” for long-term persistence (the Durbin-Watson statistic “after allowing for first-order serial correlation”) is first mentioned in the reply of the chapter 3 authors to McKitrick’s comment – without any statistical reference showing the validity of this method:

👿 Discussion expanded to new appendix. Residuals from linear trends after fit of AR(1) model do not show strong long term autocorrelation processes as illustrated by the Durbin-Watson statistics now given in Table 3.3.

As shown below, the DW statistics were not actually shown in Table 3.3 despite this statement (but the values subsequently provided by Phil Jones are over 2).

In a related First Draft comment, McKitrick said:

3-431 A 9:7 9:7 In light of the above, after line 7 insert the following: “Table 3.2 provides trend estimates from numerous climate databases. It should be noted that determining the statistical significance of a trend line in geophysical data is difficult, and most forms of bias in such time series will tend to overstate the significance. Zheng and Basher (1999), Cohn and Lins (2005) and others have used time series methods to show that failure to properly treat the pervasive forms of long term persistence and autocorrelation in trend residuals can make erroneous detection of trends a typical outcome in climatic data analysis.” References for the above: Zheng, Xiaogu and Reid E. Basher (1999). “Structural Time Series Models and Trend Detection in Global and Regional Temperature Series.” Journal of Climate 12, 2347-2358. Cohn, Timothy and Harry J. Lins (2005). “Nature’s Style: Naturally Trendy.” Geophysical Research Letters, Accepted and Forthcoming, Fall 2005.

[Ross McKitrick]

prompting a similar response from the IPCC authors.

👿 Accepted. Discussed more in new Appendix. We refer to recent paper by Cohn and Lins (2005). See also 3-425.

Their statement that they “accepted” this comment was not entirely candid as their discussion in the Appendix contained a dismissal of Cohn and Lins (without actually providing a citation for the dismissal.) Another reviewer (Ellsworth Dutton) observed:

3-2036 A 113:53 113:56 The discussion of the Ordinary Least Squares trend error bars is incorrect because the a proper application of the uncertainty of the least squares fit to any function requires that the residuals from that fit are randomly distributed. Therefore, if there are autocorrelated variations other than accounted for by the linear fit, which it is stated that there are, then it is improper to assign the error bars whose definition is derived on the assumption that the fit accounts for all autocorrelation – this is a common error in the application of least square fit statistics. A proper statistical analysis should be pursued that provide for all the autocorrelated behavior of the system and assigns correct uncertainties to all parameters of the fit. – This is fundamental statistics. Nonetheless, the section as is does serve to point out how the error bars were determined, which is legitimate but maybe would be smaller if the linear fit error was isolated from the other components.
[Ellsworth Dutton]

The IPCC authors again purported to “accept” the comment, but only agreed to “clarify” their claims. The language in the Second Draft does not fix the underlying problems.

👿 Accepted. Text clarified.

Second Order Draft

In the Second Draft, the IPCC authors first introduce their “test” for long-term persistence, stating that the DW statistic “after allowing for first-order serial correlation” do not show “positive” serial correlation.

Table 3.3. Linear trends (°C decade—1) in hemispheric and global combined land surface air temperatures and SST. Trends are estimated and presented as in Table 3.2. Annual averages, along with estimates of uncertainties for CRU/UKMO, were used to estimate trends. R2 is the squared trend correlation in percent. The Durbin Watson D-statistic (not shown) for the residuals, after allowing for first-order serial correlation, never indicates significant positive serial correlation.

The actual DW statistics, promised in the Review Comments, were not shown. Phil Jones sent them to me on request and the relevant DW statistics (for what they are worth) are all above 2 – which is actually showing significant negative autocorrelation. Indeed, DW statistics higher than 2 are 95% significant for negative autocorrelation. Had these statistics been actually reported in the Second Draft (as promised), someone might have thought this odd. In the running text, there were a couple of important caveats on trend significance. At page 9:16, they said:

Table 3.2 provides trend estimates from a number of hemispheric and global temperature databases. Determining the statistical significance of a trend line in geophysical data is difficult, and many oversimplified techniques will tend to overstate the significance. Zheng and Basher (1999), Cohn and Lins (2005) and others have used time series methods to show that failure to properly treat the pervasive forms of long-term persistence and autocorrelation in trend residuals can make erroneous detection of trends a typical outcome in climatic data analysis.

In the Appendix at 116:53-57, they said:

The linear trends are estimated by Restricted Maximum Likelihood regression (REML, Diggle et al., 1999), and estimates of statistical significance assume that the terms have serially uncorrelated errors and that the residuals have an AR1 structure. The error bars, shown as ⯲ standard error ranges, are therefore wider and more realistic than those provided by the standard ordinary least squares technique. If, for example, a century-long series has multi-decadal variability as well as a trend, the deviations from the fitted linear trend will be autocorrelated. This will cause the REML technique to widen the error bars, reflecting the greater difficulty in distinguishing a trend when it is superimposed on other long-term variations, and the sensitivity of estimated trends to the period of analysis in such circumstances. Clearly, however, even the REML technique cannot widen its error estimates to take account of variations outside the sample period of record. While more sophisticated and non-linear methods are available, they are not as transparent. Robust methods for the estimation of linear and nonlinear trends in the presence of episodic components became available recently (Grieser et al., 2002).

As some components of the climate system respond slowly to change, the climate system naturally contains persistence. Hence, the statistical significances of REML AR1-based linear trends could be overestimated (Zheng and Basher, 1999; Cohn and Lins, 2005). Nevertheless, the results depend on the statistical model used, and more complex models are not as transparent and often lack physical realism. Indeed, long-term persistence models (Cohn and Lins, 2005) have not been shown to provide a better fit to the data than simpler models.

Some of these caveats are both stronger and more prominently placed than the final AR4 text itself.

Second Draft Review Comments

Reviewer McKitrick, one of only a couple of reviewers to pay attention to this section, objected to the sentence : “Nevertheless, the results depend on the statistical model used, and more complex models are not as transparent and often lack physical realism”,
noting a certain irony to IPCC rejecting a model because it was not “transparent”:

3-1132 A 116:55 116:56 The sentence beginning, “Nevertheless, the results depend…” is vague, disputatious and incorrect. It applies more to the REML results, which are presented without such caveat in the chapter. No citation to any literature is given to defend the implication that fractionally-integrated estimators are less physically-realistic than the linear regression models used elsewhere. Persistency models were developed in hydrology precisely to improve physical realism, so as to provide a better match between the stochastic model and the geophysical phenomena. As for transparency, the lack of transparency of GCM’s or other numerical models is never regarded as a deficiency in IPCC documents. And there is no sense in which fractional-integration models lack transparency–the methods are well-known and code is published. They are not trivial, but that doesn’t mean they are not transparent. The sentence is wrong, unnecesary and should be removed.
[Ross McKitrick (Reviewer’s comment ID #: 174-13)]

After conceding a little in the SOD, Jones and Trenberth now launch a little counter-attack.

👿 Fractionally-integrated estimators have not been shown to be good models or fits to the data. On the contrary some examples exist where it is demonstrated they are not (e.g. Trenberth, K. E., and J. W. Hurrell, 1999: Comment on “The interpretation of short climate records with comments on the North Atlantic and Southern Oscillations”. Bull. Amer. Met. Soc., 80, 2721—2722. Trenberth, K. E., and J. W. Hurrell, 1999: Reply to Rajagopalan, Lall and Cane’s comment about “The interpretation of short climate records with comments on the North Atlantic and Southern Oscillations”, Bull. Amer. Met. Soc., 80, 2726—2728. We added comments in Section 3.2 and Tables 3.2 and 3.3 supporting the validity of using AR1.

I’ve consulted these citations. Trenberth and Hurrell 1999 is a comment on an article by Carl Wunsch (the Wunsch article is online, but not the Trenberth and Hurrell comment available only in dead tree form.) The Wunsch article cautioned researchers that seeming “trends” can occur in stochastic processes with no underlying trend, commenting on a series from Trenberth and Hoar as an example. The only relevant comment in TRenberth and Hurrell 1999 that I could identify is:

“We modeled the seasonal anomalies with an Autoregressive Moving Average (ARMA) model of order (3,1) which reduced the residuals to white noise random values. The latter is the definition of a good model statistically (e.g. Brockwell and Davis 1991) – in contrast to Wunsch’s claim that the ‘specific ARMA model of Trenberth and Hoar (!997) is probably too great an underparameterization of the time series (.e. nopt sufficiently structured”)

This is hardly a magisterial statistical authority. Moreover, Wunsch, an eminent authority, made a very reasonable response to Trenberth and Hurrell, ignored here by IPCC:

The reader of this exchange will recognize that it concerns inferences about time series behavior dependent on details of models and of statisitcal tests at the very edge of their significance levels. In this situation, one ought to be uncomfortable about the validity of the various underlying assumptions, for example, Gaussian statistics. If the conclusion is an important one, such as the inference that ENSO behavior requires a changing climate system, the careful investigator may decide that an agnostic conclusion is the only sensible one, pending the arrival of more data.

No one is likely to argue that the climate system is statistically stationary in any rigorous or simple sense…. The debate here concerns only the conclusion that recent ENSO statistics alone demand a statistically significant change in the overall climate system. The central message of my paper was that stationary stochastic processes often exhibit highly unintuitive behavior and that one needs to be careful in drawing strong inferences from short records.”

One presumes that the IPCC authors were familiar not only with the Trenberth and Hurrell comment of 1999, but with the Wunsch reply to that comment and that their failure to consider the Wunsch reply simply is one more instance of problems arising from authors being too closely attached to their own work and their own intellectual POV.

There is also a lengthy and thoughtful comment from the U.S. Government (mentioned previously by Jean S), re-iterating the problems with the flawed efforts to calculate statistical significance:

3-33 A 0:0 Throughout the chapter, results of linear trend analyses are presented that include estimates of statistical significance. In two specific sections of the chapter (page 3-9, lines18-22 and page 3-116, lines 53-56), the comment is made that the statistical significances of trends in variables estimated using Restricted Maximum Likelihood regression (REML) — which is the method used within the report — are likely to be overestimated; with citations given for Zheng and Basher, 1999 and Cohn and Lins, 2005. On page 3-116, lines 55-56, after acknowledging that this problem stems from the presence of long-term persistence in the underlying climatic processes, the report then states “Nevertheless, the results depend on the statistical model used, and more complex models are not as transparent and often lack physical realism.” Indeed, the results do depend on the model used and, as pointed out by Cohn and Lins, 2005, simple models (like REML) do not capture the complexity of long-term persistence — that’s why results based on the use of simple models are in error. The comment that “more complex models are not as transparent and often lack physical realism” contradicts the central point of Cohn and Lins, 2005. If long-term persistence exists within climatic processes, and the 4AR draft says that it does (page 3-116, lines 53-54), then a more complex model, such as that used by Cohn and Lins (2005) MUST be used to estimate statistical significance. This is not a matter of subjective model choice but, rather, of selecting a model that can be demonstrated as capturing the inherent behavior of the process in question. REML, and all other simple linear models, do not capture the observed temporal behavior of land surface temperature, sea surface temperature, precipitation, and any other hydro-climatic variable.

The 4AR draft is reporting statistical significances that are known to be gross overestimates. To address this problem, the authors have two choices. One is to recalculate the statistical significance estimates of all variables for which significance is currently reported using a procedure such as Cohn and Lins’ (2006) Adjusted Likelihood Ratio Test that is specifically designed for use with data exhibiting long-term persistence. Alternatively, the report could retain all of the current information regarding trend magnitude (which Cohn and Lins document as being insensitive to the method used to estimate it), but remove all reference to statistical significance — in text, tables and figures. Indeed, the latter option may be desirable because, as noted by Cohn and Lins, “it may be preferable to acknowledge that the concept of statistical significance is meaningless when discussing poorly understood systems.”
[Govt. of United States of America (Reviewer’s comment ID #: 2023-132)]

Again, the IPCC authors rejected the comment, alleging that the Cohn and Lins method is “likely wrong” but without providing any reference for this allegation. They say that they “looked into the issue”, but do not provide any peer-reviewed literature supporting their allegation of error. The only grey literature on the matter was Rasmus’ laughable post on the matter at realclimate, where Rasmus quickly got into heavy weather with absurd responses to commenters, resulting in Gavin putting him into the penalty box and taking over the file directly. So this is not imposing authority either.

👿 Rejected, but change made. After already looking into this issue it is apparent that the Cohn and Lins method is likely wrong and misrepresents statistical significance by overestimating long term persistence. There is no known paper showing these are improved models. We have computed the Durbin Watson statistics for all series and none suggest that residual long term persistence is present. It does NOT mean the simple models are in error. Lines 54-56 redone.

I presume that the U.S. Government comment incorporated the comments of USGS employees Cohn and Lines. Thus, out of the “2500+” reviewers, the only ones who touched on issues of statistical significance of this table in the Second Draft were people familiar to CA readers: Ross McKitrick and (presumably) Cohn and/or Lins. Even they did not pick up on the novel use of the Durbin-Watson test “after allowing for first-order serial correlation”. And their comments were rejected, primarily through bullying as opposed to citation of authoritative statistical literature.

AR4
The Table 3.2 caption (and Table 3.3 is similar) in the final version is pretty much identical to the Second Draft version:

Table 3.2. Linear trends in hemispheric and global land-surface air temperatures, SST (shown in table as HadSST2) and Nighttime Marine Air Temperature (NMAT; shown in table as HadMAT1). Annual averages, with estimates of uncertainties for CRU and HadSST2, were used to estimate trends. Trends with 5 to 95% confidence intervals and levels of significance (bold: <1%; italic, 1—5%) were estimated by Restricted Maximum Likelihood (REML; see Appendix 3.A), which allows for serial correlation (first order autoregression AR1) in the residuals of the data about the linear trend. The Durbin Watson D-statistic (not shown) for the residuals, after allowing for first-order serial correlation,never indicates significant positive serial correlation.

But elsewhere the counter-attack against persistence was intensified. Remember the caveat in the Second Draft about persistence, originally introduced in response to comments on the First Draft:

Determining the statistical significance of a trend line in geophysical data is difficult, and many oversimplified techniques will tend to overstate the significance. Zheng and Basher (1999), Cohn and Lins (2005) and others have used time series methods to show that failure to properly treat the pervasive forms of long-term persistence and autocorrelation in trend residuals can make erroneous detection of trends a typical outcome in climatic data analysis.

In the final version, this is gone.. There was no Review Comment objecting to this caveat. The deletion of this caveat was entirely from the authors themselves, presumably relying on their assertion (based on the absurd Rasmus internet post at realclimate, that Cohn and Lins were “likely wrong”. Instead of the sensible caveat of the Second Draft, at AR4 page 242, they introduce new language – never submitted to reviewers – which stated:

242 – In Table 3.2, the effects of persistence on error bars are accommodated using a red noise approximation, which effectively captures the main influences. For more extensive discussion see Appendix 3.A

While they deleted the sensible caveat (even though no reviewer objected to it), they retained (page 336) the disputatious paragraph in the Appendix criticized by reviewer McKitrick without changing a comma:

As some components of the climate system respond slowly to change, the climate system naturally contains persistence. Hence, the statistical significances of REML AR1-based linear trends could be overestimated (Zheng and Basher, 1999; Cohn and Lins, 2005). Nevertheless, the results depend on the statistical model used, and more complex models are not as transparent and often lack physical realism. Indeed, long-term persistence models (Cohn and Lins, 2005) have not been shown to provide a better fit to the data than simpler models.

Summary
So we have an interesting situation where IPCC introduced an implausible statistical test without any statistical authority and the IPCC authors cited this test to reject reasonable criticism. As a result, reasonable review comments were rejected without the authors providing any valid reason. The authors also deleted a sensible caveat included in the Second Draft – without any review comments objecting to the caveat and inserted in its place, an unsupportable claim about their statistical methodology never submitted for review. As a result, one does not have an authoritative and independent review of the important issues relating to long-term persistence in climate series, but something that does not rise much above a realclimate posting.

In assessing the review process here (as opposed to the conclusion), despite the supposed wonders of the IPCC review process by “2500+” reviewers, the actual review appears to be weaker than the review for an ordinary article in the journal (unless one is talking about something like Stephen Schneider editing Wahl and Ammann). There is negligible participation by the “2500|” IPCC reviewers, with the primary review coming from McKitrick and (presumably) Cohn/Lins. While we have not seen what the review editors said, the evidence is that they passively acquiesced in the bullying responses by the lead authors and even permitted them to make changes to the Second Draft expounding their own point-of-view. All in all, read through in sequence , the handling of these review comments by the lead authors is a very unsavory spectacle.

The 66% refers to the AGW in the past. There are other events in the warming trend, so they are not entirely sure what parts are Anthro and what parts are non-Anthro. But they understand the relationship between CO2 and temp well, so in the future, as CO2 increases, they have high confidence in extreme events, that will be clearly distinguished from the background signal fluctuations.

It isn’t a case of “we know the future better than we know the past” in general, which seems to be your confusion. This isn’t “in general” it is specific to pulling the human contributed events away from the background variation.

Re #1:
I think because of their belief in the effects of increasing co2 on climate, in which they have a “very strong” belief. Their knowledge of the actual climate is not so strong as their belief in co2 effects, so it gets marked down to “likely”. Remember, these are not objectively determined statistical probabillities, but measures of persnal confidence (belief).

Re the main topic:
The main editors should not be involved in any of the research used. This conflict of intellectual interest makes the quality of the report lower than that of the peer reviewed literature, which itself is far from perfect.

Ken:
But surely #1 actually has a point in that it assumes that AGW is a major component of any warming trend in the past and in the future. If, for example, AGW is actually a small though measurable component in the past, the assertion that general “extreme” events will be more frequent in the future seems debatable to say the least. For the second statement to be valid, they must claim (a) AGW is significant and not likely to be swamped by other effects which could feasibly move in the opposite direction readily off-setting a “small but measurable AGW” effect and (b) will continue. Surely (b) on its own is insufficient for the assertion? Doesn’t the relative contribution of AGW compared to other factors needs to be specified?

Steve, Thanks so much for the work you put into this post. It reads to me like a publication. Sonja Boehmer-Christiansen was looking for something about the IPCC recently, it strikes me this case study on how the IPCC consensus is achieved on an important issue may be appropriate for E&E. Not saying you should, just that it stands out like that.

Ken
What troubles me most and makes no sense to my laymans mind is this…..

It is my understanding that the Mann’ graph of our climate over the last 150 years is the starting point for most of the climate modals.
The Mann graph details an area of uncertainty” of almost .5 degrees.

I read this area of uncertainty as a plus or minus, ie the temperature in 1900 was X plus or minus .5 degrees. So all these climate models that use the Mann data to test, are in fact starting with data that is to say the least not entirely accurate when one considers that we are talking about a .6 degree so called climate change in the last few decades. So if the starting data has a large margin of error, and the climate models that are calibrated using that data have their own built in error margin, the 66% confidence level of past AGW would in it’s self seem high. So again if when models are tested against past climate the models generate only a 66% confidence level, it would seem to me that to state any level of confidence above 66% for the future when the answer is not known would seem to be an unattainable, if one is being honest.

Ken says “But they understand the relationship between CO2 and temp well..”

I wasn’t sure if Ken was being ironic. The IPCC does not understand the relationship between CO 2 and temp otherwise it would acknowledge the non-linear nature of the impact of increasing CO2, it would be able to explain why we have had warming from the LIA with no increase in CO 2, why all the models that are programmed to believe an increase in CO2 means an increase in temeperature predict warming when the satellite record shows virtually no recent change at a time when CO2 has climbed, why none of the models have any useful predictive skill and why the IPCC removed an original real 1000 temperature record from the second report and replaced it with the HS in ARC3 which denied the historical record that did not require dubious statitistical spaghetti to show that it was warmer than today 1000 years ago.

By the way, our old friends RMS announced ⡵00 million for damage caused by heavy rain over the past two weeks in the north of England. Two days later, the Asscn of British Insurers reckon more like ⡱.5 bn. Strange though, reference to AGW as the cause of the rain more muted than usual. Regrettably, looks like the sun has come out for the Gorefest Livearth concert.

In the absence of the IPCC formalising a “dissenting report” (a customary procedure for events like Parliamentary Inquiries) I have several times suggested that active people in each country contact their homeland authors and ask them for public comments, be they affirming or dissenting, then compile and publish them. I’m getting encouragement to do this in Australia.

Some contributing authors stick doggedly to the Party line, but there is a discernable undercurrent of dissent and disagreement. Tap it in your own country and report.

(1) What is Restricted Maximum Likelihood? I know what Restricted Least Squares is and what Constrained Maximum Likelihood is, but I have net heard of Restricted Maximum Likelihood?

(2) Am I reading the post correctly regarding the procedure used in chapter 3?
i) estimate the trend (from a univariate series with no explanatory variables but time?) by REML;
ii) calculate the residuals;
iii) clean the AR1 from the residuals (regress the residuals on an AR1 or lag term);
iv) take the second set of residuals and calculate a Durbin-Watson statistic.

If that is the procedure, it is worse than undocumented. It’s nutty. If one sticks a first order lag or AR term in a regression and then estimates (iterated linear or non-linear), the D-W (I) statistic is algebraically driven to around 2 [the D-W stat is approximately 2*(1-rho hat) and the residuals with a first order lag term have the rho hat effect removed]. If the term “allowing for” the 1st order serial correlation means correcting for it, then the calculation of the D-W (I) statistic is nothing more than a pro forma statement of the correction. It say nothing about other serial correlation properties, which as I understand is the pertinent issue. McKitrick’s point about the other tests of serial correlation doesn’t just apply here but has been applying for over three decades. The Pierce-Box Q statistic (for the correlogram) was around 1970 (with the now-used Box-Ljung about 10 year later). A look at those statistics (on the correlogram of the residuals after removing the first order autocorrelation) shows that most temperature series are doing “something else” (taking the words of Illie Nastaste regarding what he and most other tennis pros were doing versus what Bjorn Borg was doing). Whether the “something else” is fractional differences, long a complicated ARMAs, or whatever is both a valid and important question, but from what is shown in the post is being vigorously ducked.

(3) Were the D-W stats — D-W (I) I presume? — that you received from Phil Jones really greater than the negative correlation upper bound (2.6 or something like that, right?)?

REML is used in mixed effects modeling – see Pinheiro and BAtes. It’s mentioned in Diggle, a real statistician, not the Team. But the context here is different since there’s no mixed effects modeling. I think that they allow for AR1 residuals in fitting the trend, an option in the R nlme package so that they simultaneously fit a trend and AR1 stats. Then they do the above procdeure on these residuals. The DW stats are typically above 2, but not above 2.6. Phil Jones agreed that this “nutty procedure was what they used, but I would like to see the code to ensure that it is what they actually usedm as I’m not sure that Jones really knows.

After all, it’s just an IPCC report and there have been “2500+” reviewers. Why would anyone expect anyone to know what they did?

SteveM, thanks for another fascinating post. Once again, you have hit the nail squarely.

I, too, was somewhat surprised by the AR4 statement that

…long-term persistence models (Cohn and Lins, 2005) have not been shown to provide a better fit to the data than simpler models.

When I first read this, it seemed that such a bone-headed dismissal of models employing LTP likely reflected ignorance rather than deliberate misrepresentation. Not everyone is well versed in the Hurst phenomenon, and I accept that some people remain skeptical about things they do not yet understand.

However, having now seen Ross’s lucid and constructive review comments on the early drafts, I’m not sure what to think.

First, as was noted by Ross, there is an enormous body of literature (going back to Hurst [1951], with hundreds of references in the 1960s and 1970s, and still growing rapidly) showing that LTP models do indeed provide a better fit to data arising from this class of physical processes (see Koutsoyiannis’s recent papers and references therein). If the IPCC authors were unaware of this literature, Ross’s review comments should have set off alarm bells.

Second, also along the lines of Ross’s comments, I am perplexed by the assertion that LTP models lack simplicity. For the case at hand, the LTP model involves adding a single parameter (specifically the Hurst coefficient, H, which is equivalent to the fractional differencing coefficient, d) to the model. Model complexity is usually quantified as the number of model parameters (degrees of freedom), and by this metric the LTP model is no more complex than the simplest iid linear trend model.

In any case, simplicity is not the primary goal for a model. First, the model must reflect the salient features of the physical system (i.e. “models should be as simple as possible, but no simpler“). The reason to reject the “simpler models” — those apparently endorsed by IPCC — is that they fail to do so.
Dozens of researchers have shown this.

As SteveM has noted, the worrisome aspect of this story is the failure of the IPCC review process. Given Ross’s reviews, the authors should have produced a more thoughtful report.

“It is very likely (IPCC defines the “very likely” as 90% probability) that hot extremes, heat waves, and heavy rains will continue to become more frequent.

Is there any evidence that any of these weather events have become more frequent, much less “continue” to do so? If so, more frequent as compared to what baseline? And why is that baseline the “correct” one?

The Use of the Durbin-Watson Statistic as a Goodness-of-Fit Test for the Linear Model

I hope the following may be of some value on this thread. The above is the title of Chapter 2 of my 1978 doctoral thesis. I attempted to get it published in JRSS series B and Biometrika, but failed. So it is science, just not published science🙂

The thrust of the chapter is to examine what happens if the Durbin-Watson statistic is applied to residuals from a general linear model in which they are defective, not by serial correlation, but by non-identicality arising from a low frequency function missing from the model. The chapter shows how to compute the power of the test against any such component, and that it is quite a powerful test. Of course, against a specific extra (alternative) component, it is not quite as powerful as the F-ratio, but detects a wider range of such components.

The importance to this thread is as follows. The “IPCC method”, as per previous discussion on the thread, seems to assume that a significant DW result arises from errors which are autoregressive of degree 1, then estimate the coefficient in that process, then remove the effect from the residuals, and finally declare (through a second DW) that they are free of correlation. Exactly how they adjust the “error bars” in the estimates of slope and intercept as a result of this procedure is not clear, apparently to me or anyone else on thread. However, even if it is done correctly for that correlated model of the errors, I argue that it is irrelevant if the errors instead suffer from a missing component. For, a missing component both changes the estimates of the original components and changes the projections if an extrapolation is essayed.

At my viva voce I said that it would be great to have a statistical test which distinguished between this rough confounding of correlation and non-identicality. The external examiner opined that that would be very difficult. My day job is not principally in statistics, but I kept a close eye on stats journals up to the mid 80’s and saw nothing on this in that period. Perhaps other correspondents could confirm whether or not there has been anything further on this since then.

As an example of what can happen, Chapter 1 of my thesis analyzed some published data which was claimed to conform to y = bx + cx2. The high significance of DW, after amalgamation over 95 experiments, persisted when a cubic term was added, but disappeared when a non-zero intercept was applied (which had been excluded because (0,0) was a canonical point in the experiment). The conclusion was that it was best to assume this simple model, rather than to assume correlated errors.

My subjective belief is that for data where the DW statistic is significant, unless there is an a priori reason why true serial correlation should be present, a missing component is a far more likely explanation.

Looking, for example, at the “Juckes Union” (http://www.climateaudit.org/?p=945) data residuals, my eye suggests that a negative quadratic term would help, and if done orthogonally to the linear term the latter coefficient would actually increase. Thus we would have a sharper rise fitted, but leading to a flattening off which, if-we-extrapolate-which-is-dangerous-but-what-the-linear-alarmists-do-anyway, leads to a decline. That’s just my eye – someone should actually do the regression.

To summarize, the DW test can detect many sorts of departure from i.i.d. residuals. It is therefore a flawed process to correct them via an AR(1) process without consideration of more plausible alternatives, in particular modest but important curvature in the dependent variable. What are the data really trying to tell you?

See (#19), you make an interesting and important point that the DW test is sensitive to a

low frequency function missing from the model.

If one were only interested in testing model validity, it might not matter whether model mis-specification were due to an omitted variable or because of unmodeled time-series structure in the noise. However, in general it seems that one would want to know.

Incidentally, I’m not sure I understand:

…my eye suggests that a negative quadratic term would help, and if done orthogonally to the linear term the latter coefficient would actually increase…

IIUC, (properly constructed) orthogonal functions allow one to include higher-order predictors without affecting the coefficients on the lower-order predictors. However, the statistical significance of coefficients will typically improve. Is that what you mean?

TAC: you may call me “Rich” (sorry about the silly pseudonym but I’m sticking with it).

I think you’re right. What I was doing was assuming that the Juckes residuals resulted from a straight line regression, so the next component would be quadratic. But, by eye admittedly, I felt that the quadratic could not be symmetric about the centre, which it would have to be to be orthogonal to a linear component, but wanted to be higher on the right. But that would mean it included a linear trend as well, which would have been picked up by the straight line regression.

On reading the article on p=945 more carefully, I see that linear regression is not mentioned, just residuals between a model and data. I am concluding that I would actually expect a bx+cx2 fit to the residuals to reduce the error sum of squares quite a lot, with b.gt.0 and c.lt.0.

As I said before, someone should do the regression – if pointed at the data I might even do it myself.

I am slightly disappointed Steve didn’t pick up on this, but I can see he’s been very busy, and perhaps he hasn’t even seen it.

It seems that CA has recently been chasing down data problems and focusing less on mathematics and statistics. This is not a bad thing: Until we get a decent dataset, I’m not sure it makes much sense to worry about how we do our analyses. I, for one, am amazed at how bad the data are.😦

In any case, I think your point about the DW statistic and specifically how it applies to long-term persistence (“LTP”) is correct.

Your point is also quite general. It is not usually easy to figure out why a stochastic process exhibits LTP. Particularly where one can identify plausible external forcings (e.g. the Sun) and complex internal dynamics (e.g. Navier-Stokes processes), it may be impossible to tell. And your point is important from a practical perspective: It goes to the heart of the AGW/NGW (“NGW” = “Natural Global Warming”) “attribution” debate.

As you know, in practice statisticians are often comfortable attributing unexplained variability to “noise” — it is part of their training — while non-statisticians tend to look for an “undiscovered” causal factor. I tend to view this latter habit as potentially hazardous. In the past, it has led to some unfortunate incidents when individuals or groups have latched onto something — sometimes purely physical, but also sometimes involving religious, moral or tribal attributes — as the explanation for some observed effect. In some cases, despite overwhelming evidence to refute the original hypothesis, beliefs created in this manner can persist for a very long time. Incidentally, one symptom of this approach is the “what else could it be?” argument, which shows up with some frequency in the AGW debate.

Nassim Taleb has published a couple of wonderful books, including most recently Black Swans, which touch on this topic.

Of course, the list of “things it could be” is infinite: As you point out, omitted variables (both exogenous “forcings” and endogenous interactions); Natural dynamics; Measurement error; etc. Nature is replete with scaling phenomena; it is unusual to encounter natural systems that do not exhibit LTP.

Finally, if you have not done so, I would recommend taking a look at Demetris Koutsoyiannis’s recent papers. He has given a lot of thought to both the physical and statistical aspects of natural systems, and he knows a great deal about LTP.

Within the context of an AR(p) model for the residuals, I don’t see that the IPCC test, as clarified by Jones and Parker, is all that outrageous.

I’ve been playing with data from Port Columbus International Airport, one of the newly-revealed CRU stations and one that was used recently by my OSU colleague Jeff Rogers as demonstrating a warming trend here in Ohio. The last decade is a little warmer than 1948-57, the first in the series, but not statistically so when just comparing decadal means. So I tried a linear trend line.
OLS gives a positive and nearly significant trend (t = 1.938), but a DW of 1.380, well below the 1.55 1-tailed 5% lower bound for the sample size (58) and number of regressors (1). So I had EViews estimate an AR(1) term by ML, which is like the old Cochrane-Orcutt method but a little fancier in its treatment of the first observation.

The trend was still positive but now insignificant (t = 1.560, p = .125), but EViews still gave me a DW, now 1.955. I assume this was the DW for the innovations of the AR(1) equation, and so would be a test for the presence of further, i.e. AR(2) serial correlation in the residuals about the time trend.

It’s apparently true (for reasons I don’t comprehend) that the presence of a lagged dependent variable (here equivalent to the AR(1) term) invalidates the usual interpretation of the DW stat, but 1.955 is so close to 2 that it’s hard to see that there would be any second order serial correlation. Just to be safe, I tried adding an AR(2) term to the equation. Sure enough, it was zilch (t = -.054, p = .957). The distribution of AR terms is not as simple to interpret as that of ordinary regressors, but the problems that arise are biggest in the neighborhood of a unit root, not zero, so dropping the AR(2) terms seems to be justified here.
I suppose the canonical thing would be to transform the DW into a Durbin’s h-stat if there were a close call. However, just using the DW directly doesn’t seem to be unreasonable in this case, and is certainly a lot better than just ignoring the possibility of higher order autocorrelation.

Even though EViews reports the DW for the AR(1) regression, this does not necessarily mean that it is approving its use, however, since it reports a DW for every regression it runs, even one it knows is not a time series!

This is not to say that the trend standard errors were necessarily computed correctly in an AR(1) context in Table 3.2, so it’s certainly worth trying to replicate them to see how they stand up. Also, it may be that a fractionally integrated Long Term Persistence model of the type advocated by Cohn and Lins might not be appropriate — I’ve heard a lot about such models, but just don’t know anything about estimating them yet. I’ll have to look at C&L.

BTW, Steve McI in his introduction above writes

The actual DW statistics, promised in the Review Comments, were not shown. Phil Jones sent them to me on request and the relevant DW statistics (for what they are worth) are all above 2 – which is actually showing significant negative autocorrelation. Indeed, DW statistics higher than 2 are 95% significant for negative autocorrelation.

This is wrong — the upper critical values for DW are rarely tabulated, since DW were originally more concerned about positive serial correlation (which makes OLS standard errors too small) than about negative serial correlation. However, the distribution is approximately (though not exactly) symmetrical about 2, so that the upper bounds of a two-tailed 10% test for zero serial correlation would be roughly as far above 2 as the conventional one-tailed 5% bounds are below 2. Just being above 2 doesn’t indicate anything in itself except the absence of any evidence of positive serial correlation.
(The exact DW critical values depend on sample size, number of regressors, and also the serial correlation of the regressors themselves. Although it is possible to compute these exactly — and some packages like SHAZAM will do this — most people just use the upper and lower bounds reported by DW for the best and worst cases. A DW below the lower value rejects 0 serial correlation (on a 1-tailed test versus positive s.c.), a DW above the upper value can’t reject 0 serial correlation, and a DW between the two bounds may or may not reject depending on the exact distribuiton. Here the test is “inconclusive” for those (like me) who are too lazy to compute the exact critical value.)

Although EViews 6.0 does thrust the DW in your face when you run a regression with an AR(1) term added, the new manual does in fact say that you should ignore it and instead look at the Q-stats or Breusch-Godfrey test. (Vol. II, p. 66). So it is safe to say that IPCC’s use of DW is outdated and is discouraged by current econometric practice, at least. At a minimum, they should have looked at Durbin’s h-stat.

However,Long-Term memory (ie fractional differencing), which may be pertinent, raises much deeper issues than are considered by routine econometric analysis at present. I’m very interested, but really don’t understand this at the moment.

The treatment is only as good as the data permit. You did not mention, but which version of the Port Columbus data did you use? Remember that much of the record now has one or several “adjustments” made to it and the thought of measures like serial correlations of already crunched data gives me concerns. Geoff.

Anyway, I think the main thing to take from the initial DW stat is a big warning sign. Once you add a lagged dependent variable DW gets biased towards 2 so a DW test “after allowing for first order serial correlation” is a bit of a odd thing to talk about. Nine times out of ten adding a lagged dependent variable will make the DW stat ‘clear up’ – but that doesn’t mean you’ve necessarily dealt with the underlying problem.

Hu (#23, #24): I’m not sure I follow your reasoning. First: Has anyone actually studied the properties of the DW statistic as a test for LTP? At a glance, DW — which employs correlation between adjacent pairs of time-series data — would seem ill-suited to the task of detecting LTP because the essential property of LTP is non-vanishing correlations between distantly separated observations. Finite-parameter stationary ARMA models cannot reproduce this property (to be clear, there are ARMA models (e.g. ARMA(1,1), with AR parameter chosen close to 1 and the MA parameter close to -1) that exhibit interesting LTP-like properties in finite samples).

To return to SteveM’s original question: Is anyone aware of research supporting use of the DW statistic as a test for LTP?

RE TAC #27:
I don’t disagree at all. DW just tests for AR(1) “short memory” serial correlation. (Even if rho is near 1, the memory is said to be “short” because the autocorrelations fall off geometrically with distance.)

My point was just that IPCC’s using it to test for AR(2) serial correlation, is a fairly obvious, if naive and invalid, thing to do.
I don’t think their wording on Table 3.2 claims that their DW is testing for LTP. They merely dismiss LPT as raised by Cohn and Lins out of hand, and then use their DW (incorrectly) to test for AR(2). Their wording isn’t terribly clear, however, whence the confusion. They can be legitimately faulted for using DW instead of Durbin’s h (or some other valid test), but they would not be the first to make this mistake.

Durbin (1970) showed that DW gives inconsistent results when applied to the residuals of an AR(1) model, and proposed instead his h statistic. This converts DW into an approximate serial correlation coefficient using rho ~~ (1-DW/2), and then jacks it up by a factor which is the square root of a certain quotient. This has an asymptotic N(0,1) distribution.

Many people are frightened away from Durbin’s h by the fact that the denominator of the quotient can be negative, resulting in an imaginary test statistic that is hard to place on a table of normal critical values. However, as the denominator approaches 0, the statistic approaches infinity, and hence clearly indicates a reject. My view is that a negative quotient should thus be interpreted as giving a statistic that is “beyond infinity”, and therefore an even easier reject.

Inference and testing with “Long Memory” or fractional differencing (where the correlations die out slowly as a power function rather than geometrically) has long baffled me. However, on thinking about it over the past couple of days, I think the following brute force method would work for an ARFIMA equation of the form P(L)(I – L)^d y = Q(L)e: First, figure out the autocorrelation matrix in terms of d and the variance of e. (There are messy equations somewhere for the FI part of this.) Then compute the likelihood as a function of d, var(e), and the P and Q coefficients, and maximize this numerically. The null hypothesis d = 0 is on the boundary of the parameter space, and so the LR for it doesn’t have the usual chi-square distribution, and in any event, if the y’s are the residuals of a first-stage regression (eg of temp on a time trend, CO2, or tree rings), d and/or the AR coefficients will probably be biased downwards. However, all the parameters can be median unbiased along the lines of the Monte Carlo method of Andrews (Econometrica 1993), something I have gotten to work recently in the AR(p) case. This procedure should also give an exact finite sample p-value for a null such as d = 0 (conditional on the median-unbiased estimates of the other coefficients.)

What was stumping me was how to write out a recursive one-pass formula for the likelihood. Evidently there just isn’t one! In fact, the only way to simulate such a process (as required by Andrews’ method) is to use the Cholesky decomposition of the n X n autocorrelation matrix.

RE Geoff Sherrington, #25:
I used SOM 3220 data from NCDC, which is relatively unadjusted. (Areally at most). This is not a USHCN site, and so doesn’t get the full battery of CDIAC adjustments. In any event, it has always measured at midnight, so there is no TOB to worry about, and apparently has always used glass thermometers so there is no MMTS adjustment to make.

Wooster OH, which is also in CRU and is relatively unurbanized, shows a distinct downtrend since 1920 or 1930 in the 3220 data. I haven’t tried to test it for any kind of significance yet, though, as I’m still trying to figure out how to get data through 2006 with and without all the adjustments. (It switched from midnight to 0700 and then 0800, which is a relatively small TOB, but I might as well use the official adjustments for this.)

I’m guessing that the downtrend for Wooster will also be insignificant, but that the difference between the two stations since 1948 will have a significant trend, indicating progressive urbanization of the Columbus airport site, and therefore the unsuitability of it and stations like it for CRU.

What was stumping me was how to write out a recursive one-pass formula for the likelihood. Evidently there just isnt one!

Hosking [1984] presents a remarkably accurate approximation to the FARIMA (arfima) likelihood function that avoids using the full correlation matrix. Although this amounts to just a numerical shortcut — the Cholesky-decomposition approach will work — it is of practical importance: Long-term persistence may require keeping track of thousands to millions of lagged correlations, and the corresponding correlation matrix thus may have millions to trillions of elements.

Persistence and Time Trends in the Temperatures in Spain
Luis A. Gil-Alana

Abstract:

This paper deals with the analysis of the temperatures in several locations in Spain during the last 50 years. We focus on the degree of persistence of the series, measured through a fractional differencing parameter. This is crucial to properly estimate the parameters of the time trend coefficients in order to determine the degree of warming in the area. The results indicate that all series are fractionally integrated with orders of integration ranging between 0 and 0.5. Moreover, the time trend coefficients are all positive though they are statistically insignificant, which is in contrast with the results based on nonfractional integration.

So persistence matters after all when deciding whether a trend is significant. The IPCC’s treatment of this issue was a real scam. Remember that after the 2nd review round they had inserted the following paragraph based on reviewer comments (who were drawing from peer-reviewed literature) to which the lead authors had no response except hand-waving and denial:

Determining the statistical significance of a trend line in geophysical data is difficult, and many oversimplified techniques will tend to overstate the significance. Zheng and Basher (1999), Cohn and Lins (2005) and others have used time series methods to show that failure to properly treat the pervasive forms of long-term persistence and autocorrelation in trend residuals can make erroneous detection of trends a typical outcome in climatic data analysis.

Then with no authorization this paragraph was subsequently deleted from the published IPCC report. They can remove the text about persistence from the IPCC report, but they can’t remove the persistence from temperature data.

With respect to the time trend coefficients, the values are positive though they are statistically insignificant, which is in contrast with the results based on I(0) disturbances. The fact that the time trend coefficients are found to be insignificant in our work does not necessarily rule out the hypothesis of warming in the temperatures since this can be a consequence of the sample size used or even misspecification of the model.

If someone has access to the IMPROVE-series, I think it might be worth repeating the experiment with that dataset.

[…] is the point in bothering with taking part in the IPCC review process. There have been repeated complaints that valid, critical comments have been ignored by the IPCC authors. Another potential problem is […]