McShane and Wyner Discussion

McShane and Wyner is being published as a “discussion paper” and has attracted numerous submissions so far, including a discussion by Ross and I which has been accepted. As readers have noticed, discussions by Schmidt, Mann and Rutherford and by Tingley are online. Other submissions have been made by Wahl and Ammann and by Nychka et al.

Too much discussion so far has focused on whether or not Mc-W’s short history of the dispute has captured all the nuances of the dispute. I presumed that they were trying to merely trying to set the table in a few pages, but, needless to say, this short history has attracted controversy (which I’m not going to dwell on today.) It’s too bad that climate scientists have focused so much on this minor aspect of the paper. Most of the rest of the attention has been paid to their “own” reconstruction later in the paper, about which I will comment on another occasion.

Today I’m going to comment about the analyses in their sections 3.2 and 3.3, which are, in effect, an extended commentary on benchmarking the RE statistic – though few, if any, climate scientists seem to have grasped this point thus far, in part, because MW do not clearly link their analysis to this issue, though the relationship is quite clear.

MW observe that the climate science practice of assuming that the “proxies” are a “signal” plus low-order AR1 noise is not one that is supported by observed AR1 coefficients – see their Figure 4. This is a point familiar to CA readers. They observe that pseudoproxy tests limited to low-order AR1 noise insufficiently replicate observed proxies and conduct more testing simulations using “empirical AR1″ coefficients and brownian motion (random walk) pseudoproxies.

Their test template is to calculate holdout (i.e. verification) RMSE (root mean square error) statistics using the lasso multivariate methodology, summarizing their results in their Figures 9 and 10. The denominator in an RE calculation is proportional to the square of the RMSE of the “in-sample mean”, which they also call the “intercept” model. Thus, the RE of a proxy reconstruction is directly linked to the RMSE of the proxy reconstruction and the in-sample mean RMSE_{intercept} as follows:

RE_{proxy}= 1- holdout_RMSE_{proxy}^2/ holdout_RMSE_{intercept}^2

The benchmarking against a class of pseudoproxies is calculated from the distribution obtained from simulations:

In their Figure 9, if the RMSE boxplot for a class of pseudoproxy is better(smaller) than the RMSE boxplot for the in-sample mean (intercept), then the RE statistic for the proxy reconstruction will not prove “significant” against that class of pseudoproxy.

They make the surprising observation that white noise pseudoproxies outperform low order red noise (the usual climate science benchmark) under their test. More controversial claims arise from their observation that “empirical AR1″ pseudoproxies and Brownian motion pseudoproxies( random walks) outperform actual proxies under their test setup. (This can be seen in the smaller RMSEs in the figure shown below.)

The benchmarking of RE statistics was an issue that was put into play in McIntyre and McKitrick 2005a, where we observed that you could get high RE statistics from pseudoproxies with autocorrelation coefficients mimicking the autocorrelation coefficients of actual proxies, rather than the very low-order AR1 coefficients assumed (without proof) by climate scientists. While MW cite us favorably on the concept of “empirical AR1″ pseudoproxies, they unfortunately don’t directly link their figures to the issue of RE benchmarking as clearly as they might have – confusingly tracing the issue back to von Storch et al 2004 (which addressed a different issue), rather than McIntyre and McKitrick 2005a,c.

They did their simulations using a multivariate method ( the “lasso”), familiar in their community, but not actually used in paleoclimate. They faced a tradeoff here – between analysing the properties of the proxy network using a methodology known to the outside world (as opposed to today’s version of RegEM or CPS). I think that they’d have been better off doing a form of CPS, as it would have removed an obvious criticism. Having said that, my own surmise is that the relative performance of Mann08 proxies versus pseudoproxies will not be particularly sensitive to the use of CPS versus lasso. I just don’t see that the weighting patterns from the lasso are going to be different enough to radically change the results.

They accurately note that the Team responded vehemently against the concept of “empirical AR1″ coefficients in Ammann and Wahl 2007. The battleground issue here is whether the high AR1 coefficients observed empirically are properties imparted to the “proxies” by the climate system or whether they are because of gross problems with the proxies. Ammann and Wahl (2007) has argued that using autocorrelation coefficients estimated from actual proxies results in

“train[ing] the stochastic engine with significant (if not dominant) low frequency climate signal rather than purely non-climatic noise and its persistence”.

This was asserted rather than demonstrated and, as we know, assertion suffices far too often in climate science. It seems evident that if the proxy networks contained a “dominant” or even “significant” “low frequency climate signal” (as Ammann and Wahl assert), then the graphs of the proxy series would have a consistent low frequency appearance (as opposed to the visually inconsistent appearance shown in MW Figure 6 and elsewhere.) Ammann and Wahl do not explain this. The very inconsistency of the series within proxy networks such as Mann et al 2008 argues forcefully against the interpretation of high empirical autocorrelation coefficients as being imported from a climate “signal”, as opposed to being an inherent feature of the proxies themselves.

So while I expect that climate scientists will argue against “empirical AR1″ coefficients as too severe a pseudoproxy test, I, for one, do not think that “empirical AR1″ coefficients are too severe a test – if anything, they are probably not severe enough. (Note that empirical AR1 coefficients place less structure on the autocorrelation than the hosking.sim simulations used with the NOAMER tree ring network in our 2005 simulations and simplify this aspect of this analysis.) MW makes the following additional and reasonable observation in dismissing Ammann and Wahl’s objection to empirical AR1 coefficients:

it is hard to argue that a procedure is truly skillful if it cannot consistently outperform noise– no matter how artfully structured.

33 Comments

Interestingly, I was just reading about benchmarking RE in Montford’s book last night. As I recall, the issue is whether the value is about 0 about .58, Mann favoring the former and you the latter, but I’d better go back and check.

Dave, it’s not that there’s any “right” number. All the statistic does is compare against percentiles generated by forms of noise. Many standard statistical tests assume very elementary forms of noise – more or less white noise – things that don’t apply with these highly autocorrelated series.

We observed that pseudoproxies with more realistic autocorrelation threw out very high RE statistics. I notice that Mann et al 2008 used a high RE statistic, but with the various smoothing that they did, I suspect that they hadn’t benchmarked the RE statistic appropriately in that setup either.

It’s great to hear that there’s a response to McShane and Wyner from Steve and Ross that’s been accepted. It’s hard to express how grateful we are for your efforts over so many years – and for the ‘ringside seat’ we enjoy in Climate Audit.

So let’s talk about Jones’ own assertion that there’s so much noise in these “proxies” that we’ll probably never get a reasonably precise modern-to-medieval comparison. What do Schmidt, Mann et al. have to say to that? The science gets better with each iteration? Yah. And one day I’m gonna make the big leagues. Because every day my game improves.

I don’t think Jones is on Mann’s Christmas card list anymore. Mann will attempt to ignore any comments or papers by Jones for as long as possible. I think the recent Jones paper published in Nature showing natural climate variability forcing ocean temps down in late 60s and early 70s indicates Jones is actually trying to do science. Of course, I’ve been accused of being an optimist before.

I’m not at all sure that AR1 is even the correct model. I’ve been playing with FARIMA (1,d,1). Quite a few of the proxies seem to have values of d greater than 0.1, not to mention fairly high MA coefficients. Unfortunately, the R package fArma leaves something to be desired in terms of completeness. In a FARIMA, or ARFIMA model, d ranges from 0 to 0.5 and is related to the Hurst coefficient H by d = H – 0.5.

I thought that M&W’s main point was that the calibration period was too short and the proxies too noisy for any useful model to be created. The point was that the verification statistics of the models created by different techniques were more or less equivalent. However the reconstructions created from these separate models were widely variant.

I’m thinking if there really is a “climate signal” buried in noise they may be able to “track” the signal using a Kalman filter, assuming they have some kind of a priori knowledge of the signal characteristics. So far I have heard that it it “low frequency”. I wonder what other parameters the signal has other than that? Of course, the climate signal, as inbedded in the proxies, is expected to increase over the 20th century and be mildly flat, decreasing or increasing in earlier centuries. Higher frequency changes in the signal due to volcanoes, forest fires and other natural phenomenon can be filtered out, or ignored by the Kalman filter. Would like someone to tell me, how will we know when we “detect” the climate signal and how can we know that it is due to increased GHGs vice other natural phenomena? The KF should be able to track a well defined signal through nearly any kind of noise except noise that is highly correlated with the signal itself.

“The very inconsistency of the series within proxy networks such as Mann et al 2008 argues forcefully against the interpretation of high empirical autocorrelation coefficients as being imported from a climate “signal”, as opposed to being an inherent feature of the proxies themselves.”

Especially when slopes always ‘correlate’ even if they have noise on them.

In a post I did on the topic, the known temperatures dropped to 7 percent of the signal in order to match the rejection rate of Mann08. The noise itself (when matched to proxy AR1) passed nearly to the 40% level without a signal. The point of that is that the 40% level was used as clear evidence of a temperature signal as pure noise would only pass to a much lower level – indicating a good signal to correlate to.

From Mann08 SI

Although 484 (40%) pass the temperature screening process
over the full (1850–1995) calibration interval, one would expect
that no more than 150 (13%) of the proxy series would pass
the screening procedure described above by chance alone.

In my post the AR1 matched random data nearly passed the 40% threshold, this clearly cannot be true. When we consider that he picked two gridcells and took the best one, it would take the 7% signal to noise estimate and drop it to indistinguishable from zero. No temp at all. Of course I could have messed something up but it was pretty startling IMO but I wonder how surprised a biologist would be by a 0.8C change not measurably affecting growth.

It’s my guess that most people who are used to analysing experimental data will look at the proxy data and feel that however subtle the statistical technique, and however great the effort expended, little good can come from this whole approach, even now that honest men are trying.

There are lots of technical points here whose connection to the final answers to the big question is very indirect (but whose answering is of course a part of the serious work). However, if we focus on this point:

“It is hard to argue that a procedure is truly skillful if it cannot consistently outperform noise– no matter how artfully structured,”

it is really an essential point. If your explanation of some data is only as skillful as something that can be called “noise”, then you really don’t have any real evidence that the underlying processes show anything else (or more ready to be extrapolated) than “noise”.

Well, I would probably not go quite as far as MW. If some noise were *really* awkward, contrived, or likely not to appear from the actual laws of Nature, then its high skill could be inconsequential.

But it’s clear that the past climate data may be described as “pretty reasonable” types of noise so far which means that the reliable enough evidence for something that is substantially different from noise is non-existent.

“It is hard to argue that a procedure is truly skillful if it cannot consistently outperform noise– no matter how artfully structured,”

In addition to being a key substantive point, as Lubos points out, this is a well-deserved sarcasm. Namely, you might think your procedure/method/calculation is very impressive, understandable only to experienced “experts” in the climate community. But if it can’t outperform noise, is it junk.

The burden is on the party propounding its theory, in this case, Mann, et al. So far, we have no affirmative reason to believe their methodology works, and they certainly haven’t been able to demonstrate that it does. Until they can, there is no justification for taking it seriously.

The criticism goes deeper than that, Eric. There isn’t any statistical theory that can assign physical meaning in the absence of a physical theory.

There is no physical theory that can convert a tree ring into a temperature. There is no physical theory that can parse an O-18 series to separate paleo-temperature trends from from the effects of paleo-variations in storm tracks.

Every single proxy paleo-temperature reconstruction is a mockery of science.

Pat, I think this is a bit harsh and I wouldn’t go as far as your statement regarding every single proxy palaeo-temperature reconstruction. There are big efforts ongoing in several laboratories to define new palaeothermometers that have a rigorous physical theory behind them and are only a function of temperature. One of these is the ordering of 13-C and 18-O in carbonate minerals. At high temperatures the isotopes of carbon and oxygen are randomly distributed across the lattice. At low temperatures there is an ordering such that there is a measurable deviation from a random distribution. This is temperature dependent, is independent of the isotopic composition of the precipitating water (seawater, freshwater etc.) and thus independent of storm tracks, the position of the ITCZ, monsoon strength etc.

There are analytical problems, and the sensitivity is such that the precision is only on the order of 1 to 2 degrees C. This is not adequate for Holocene climate change but the field is progressing fast.

However, the point you make about storm tracks etc. can be parsed into the widely recognised issues of measuring the isotope fractionation between a carbonate mineral and its source water when there is no way to estimate the source water isotope composition. This is especially difficult for freshwater, terrestrial samples where the source water composition is fixed by the water cycle. For marine samples this is not so much of a problem, especially open ocean studies.

Paul, your comment is fairly technical (to me at least) but I think that you are saying that there is a linear relationship between temperature and growth.

At the risk of repeating a point made many times on CA, how can the palaeoclimatologists extract a temperature signal from tree rings when they assume a linear relationship between width of ring and temperature. That is, the wider the ring, the higher the temperature.

Any gardener knows that this does not accord with the real world. Maximum growth occurs when a range of conditions are optimum – temperature, fertilisation, moisture etc etc. If temperature is high, the plants are stressed, and don’t grow as well. If temperatures are low, the plant doesn’t thrive.

The relationship between temperature and ring width is likely to be an inverse quadratic function. Which means maximum growth occurs when temperature (and the many other factors) are at an optimum. Too high or too low and growth stalls.

How come the palaeoclimatologists cannot figure this out?? Don’t they ever observe their gardens?

Paul, risking Steve’s wrath, I hereby withdraw my criticism for any proxy temperature reconstruction based in physical theory. :-) That, and very best wishes for your efforts, and those of your colleagues.

folks, please make comments that are specific to McShane and Wyner rather than editorializing against the concept of proxies. The issues at hand are technical ones about reconstructions in the presence of large and imponderable “noise” and the behavior of pseudoproxies.

One of the key comments for me in M&W is –
“On the other hand, limiting the
validation exercise to these two blocks is problematic because both blocks
have very dramatic and obvious features: the temperatures in the initial
block are fairly constant and are the coldest in the instrumental record
whereas the temperatures in the final block are rapidly increasing and are
the warmest in the instrumental record. Thus, validation conducted on
these two blocks will prima facie favor procedures which project the local
level and gradient of the temperature near the boundary of the in-sample
period. However, while such procedures perform well on the front and
back blocks, they are not as competitive on interior blocks. Furthermore,
they cannot be used for plausible historical reconstructions!”

This throws the spotlight back on the instrumental temperature record and whether the above statement is correct. There continue to be many reasons to suggest that the temperature record is inaccurate. Unless the relation between a proxy response and a local instrumental temperature is accurately characterised, we can get the instrumental era tail shaking the millennium dog. (Accuracy should not be confused with correlation or precision). There are dangers in accepting the temperature record at face value.

BTW, it would be an interesting analysis if both the initial and final blocks were horizontal lines of similar averages. Many such discrete locations exist as weather stations from 1900 onwards. We might end up with a historic record from books, showing MWP agriculture in cold places, with a statistical reconstruction showing an invariant temperature for over 1000 years. I mention this only in support of the M&W statement that problems arise because the typical proxy response to temperature is weak – and it destructs at this limiting case.

I’d like to comment on the RE statistic because, like Dave Dardinger, I recently learned about it from the “Hockey Stick Illusion”, in which SM is credited with most of the analysis, and because I’d like to understand it better. In a shortened notation, from the present article we get:

RE = 1-MSEx/MSEy = (MSEy-MSEx)/MSEy

Here MSEy is the mean squared error from the simple mean, or intercept, model y, and MSEx is from the more complex model x.

Now, if model x was actually a submodel of model y with parameters fitted by least squares, then RE would be like an F-statistic, and necessarily positive (so would not be a sensible threshold for significance).

But model x and model y are apparently measurements from a “holdout”, or “verification” period, using parameters estimated from a distinct calibration period, and hence RE.gt.0 is not mathematically enforced. Even so, given that model y presumably has more parameters and fewer degrees of freedom, one should still expect RE to be bounded away from zero to be significant. One might expect the threshold to depend on the degrees of freedom, as in an F-statistic.

I’d be grateful for any clarification that experts can add to that (vague) assertion.

Tingley, Martin P. Spurious predictions with random time series: The Lasso in the context of paleoclimatic reconstructions. A Discussion of “A Statistical Analysis of Multiple Temperature Proxies: Are Reconstructions of Surface Temperatures over the Last 1000 Years Reliable?” by Blakeley B. McShane and Abraham J. Wyner. Submitted to the Annals of Applied Statistics pdf. A more detailed version can be found here.

I am not sure of your point. The second paper you reference certainly mentions M&W but does not contain any criticism of the paper. I am also not certain why you would use the adjective “flamboyant” to describe the authors. This paper appears to have a refreshing objective of trying to bring modern statistical methods to climate science. If anything, it is a (mild) criticism of “mainstream” reconstructions for not making use of statistical knowledge available.

I’m obviously coming rather late to the party here, but Tingley’s comments about the LASSO being inappropriate have been bugging me for a while.

Tingley claims the LASSO is inappropriate for multi-proxy studies because LASSO is a method for sparse regression (i.e. most appropriate when only a small subset of the possible covariates are expected to have a true effect). Tingley notes that the LASSO is equivalent to putting a double-exponential prior on the regression coefficients and that this does not make sense from a scientific perspective. He is correct that, for a fixed value of the bounding parameter, the LASSO is equivalent to such a Bayesian analysis. However, M&W use k-fold cross-validation to choose the bounding parameter. If, as is claimed, the temperature proxies are all highly informative, the cross-validation would find that very little bounding is needed (i.e. tend towards a standard unconstrained regression with lambda =1). Hence the choice of bounding parameter is highly data driven, meaning the actual procedure is nothing like using a double-exponential prior.

Tingley in the detailed version of his discussion, tries to show how poor M&W’s use of the LASSO is by using the LASSO but he states

“I do not perform the cross-validation procedure used in MW2010 to determine
the LASSO penalization parameter (lambda on page 13 of MW2010). Instead, I use the default setting of the glmnet package, which sets lambda to be 0.05 times the smallest value of lambda for which all coefficients are zero. The LASSO penalization is thus very small.”

Hence instead of doing what M&W have proposed, he is making sure the regression will always be really sparse. Unsurprisingly, he then finds that his version of the LASSO performs badly.

It is a classic straw man argument. However, I suspect Tingley doesn’t understand the importance of the cross-validation step, because what he has done is so easy to rebut and so obviously wrong to anyone with knowledge of the LASSO.

I haven’t read them properly yet but its good to see Tingley get a mauling in the rejoinder.

It’s a shame the editor felt the need to write:

“Thus, while research on climate
change should continue, now is the time for individuals and governments to
act to limit the consequences of greenhouse gas emissions on the Earth’s
climate over the next century and well beyond.”

[RomanM: Sounds like a good excuse for a new thread, Steve. I see there is a submission from Ross and yourself in the mix.]:)