Calibration in the Mann et al 2007 Network Revisited

In a post a few months ago, I discussed MBH99 proxies (and similar points will doubtless apply to the other overlapping series) from the point of view of the elementary calibration diagram of Draper and Smith 1981 (page 49), an older version of a standard text. Nothing exotic.

One of the problems that arose in these plots is that virtually none of the individual calibrations made any sense according to the Draper and Smith style plot. Since then, I’ve dug a little further into the problem. If a statistical relationship is not significant under a standard t-test of 2, then a Draper and Smith style plot throws up nonsensical confidence intervals. I’ve dug a little deeper into the matter and have determined that if you lower your t-standards, then you can typically “get” confidence intervals for the individual calibrations that at least are intelligible. The larger question is whether a multivariate calibration can overcome the deficiencies of the individual calibrations (the topic of Brown 1982 and subsequent articles, which several of us have been parsing.)

However, before re-visiting that text, I want to re-plot calibration diagrams for all the MBH99 proxies, lowering the t-tests where required to get an intelligible confidence interval. The post title refers to the Mann et al 2007 network because it is the same as the MBH98-99 network. In all cases, I’ve limited my analysis to the verification period mean, since Wahl and Ammann have agreed that MBH has no “skill” at higher frequency. So we might as well see which proxies, if any, have “skill” at the verification period mean. In the MBH99 period (as with AD100), only one “climate field” is reconstructed, so all the piffle about lower order temperature PCs can be disregarded.

Let me start with showing a plot for the MBH99 series with the highest t-score in calibration – the NOAMER PC2 (not the Mannian PC1). I haven’t shown plots of individual series (you can see these here – link), but the PC2 doesn’t have a HS shape. Its’ individual “prediction” of the verification period temperature (sparse centered on 1902-1980) is an even 0.0 deg C, with 95% confidence intervals of -0.1 and 0.16 deg C. While the confidence intervals are quite precise, unfortunately the observed verification temperature of -0.18 deg C lies outside the confidence intervals.

The next highest t-score comes from Briffa’s Tornetrask reconstruction. In this case, it verifies almost exactly. The problem with putting much weight on this particular reconstruction is that Briffa adjusted his results so that they worked as discussed [link].

Only one other series in the MBH99 passes a standard t-test – Cooks’ old Tasmania version. Here the 95% confidence intervals are rather uninformatively between 0.04 and 8.81 deg C, but even this wide interval failed to bracket the oberved -0.19 deg C.

All other series failed a simple t-test in the calibration period and yielded perverse confidence intervals that failed to bracket the estimate. I examined some of the more original literature, including Fieller 1954, and the most logical approach to these failed confidence intervals seems to be to lower the standard. If you can’t get a 95% confidence interval that makes any sense, maybe you can get a 75% confidence interval or a 50% confidence interval. There are some interesting mathematical issues involved in this, which Ross and I have been mulling over.

But the best way of seeing what is going on is simply to look at a lot of plots and the Mann et al 2007 data set (which is identical to MBH98) really provides an excellent compendium of perverse cases, which can be recommended to statistics students even if climate science “moves on”. In the following plots, I’ll show on the left, the 95% confidence intervals (where there is a breakdown) and on the right, a confidence interval based on a lower standard. Notice that the left hand plots all show both intersections on the same branch of the hyperbolae, thus the nonsense intervals. By lowering the confidence target, one intersection on each hyperbola branch is achieved and thus a “confidence” interval.

Here is the famous NOAMER PC1. (Note: This is from the original MBH archive and I need to re-check whether this is the “fixed” version or not. While this version fails a standard t-test, if the t-standard is lowered to 1.5, then one can say that with (say) 90% confidence, the verification period mean was between -1 and -16 deg C (actual value -0.19 deg C.) So this doesn’t seem as helpful a standard as one might hope.

The next highest t-score for NH temperature came from the Qualccaya 1 ice core (t=1.31, which is still 90th percentile). Somewhat disappointing though is the fact that that Quelccaya 2 ice core has a t-value of only 0.36. Now that I think of it, I think that we’ve seen a tendency in some recent studies to use the Quelccaya 1 ice core as a proxy on its own, discarding the Quelccaya 2 results. Hmmm. Quelccaya 1 results purport to show confidence interval (t=1.31) of between -0.39 and -4.8 deg C (once again not containing the observed -0.19 deg C.) Quelccaya 2 results, at an even lower confidence level (t=0.36) bracket -2 and -8 dg C, again not containing the observed -0.19 deg C. Hmmm.

Next in descending t-scores is Briffa’s Polar Urals series. This is the older version before the update (which resulted in a warm MWP), prompting the Team to switch to Yamal. Once again, on the left side, we see the broken down confidence intervals at the usual 95% t-values, but with a lowered confidence standard (t=1.25), confidence intervals of between -0.9 and -8.4 deg C result (again not containing the observed value.)

I hope you’re not getting too border, because there are some interesting examples still to come, though the next few are more of the same. Next we have the NOAMER PC3. In passing, how does application of a Preisendorfer rule result in 3 PCs for the AD1000 NOAMER network and 2 PCs for the AD1400 network. Maybe the PR Challenge will tell us. The t-value has now declined to 0.95, yielding the typical broken down diagram on the left. By lowering the t-standard, we can get a less confident interval, this time between 0.46 and 5.8 (and once again unfortunately not including the observed value.)

Next come two accumulation series from Quelccaya. It’s intriguing that this one site accounts for 4 of 14 proxies in the network. Perhaps each individual ice core is thought to be tuned to different channels on the teleconnection dial. The confidence intervals for one accumulation series are between 0.05 and 0.9 deg C with low confidence (but not containing the observed value), while the other accumulation series using a t-confidence of only 0.5 yields an uninformative interval of 0.5 -15.5 deg C, again unfortunately not containing the observed value.

Next here are three proxies all with t-values in the seemingly uninformative 0.6-0.65 range: an Argentine tree ring series, a French tree ring series and a Greenland dO!8 series (the last one being used over and over again in these studies.) At very low confidence intervals, each of these “proxies” yields only very wide “confidence” intervals, none of which actually overlap the observed value.

The last proxy has a bit of a place of honor. In this case, to obtain an intelligible “confidence” interval, one has to lower the t-standard to 0.02 (!!), resulting in a “confidence” interval between 12 and 19 deg C for the verification period reconstruction.

Reviewing the bidding, only one of the MBH99 proxies, considered in an individual calibration, yielded confidence intervals that contained the observed verification mean (and that proxy – Briffa’s Tornetrask series – had been fudged so that it “worked”.)

The interesting statistical question, for which hopefully the methods of Brown 1982 and subsequent literature can assist, is whether a multivariate calibration using proxies which have all individually failed so badly can yield an answer with a valid confidence interval using proper methods (as opposed to methods applied by IPCC relying on Wahl and Ammann and the rest of the Team.)

I may be way off base, but to me the statistical question posed by all the failed proxies seems to be roughly analogous to Mann telling the old business joke, “We lose money on every unit, but we make it up in volume.”

#3. That’s an apt comment. I can envisage some circumstances where you have a bunch of proxies, none of which individually work very well, but from which you can extract a valid signal. So there’s more to it than the univariate case and I’m re-visiting these issues.

There’s a lesson here for the PR Challenge folks if they want to model realistic pseudoproxies. It’s not enough to pitch slow pitch as they do – adding a bit a white noise or low order red noise to a signal and recover the signal. That’s easy. What they have to do is to construct “interesting pseudoproxies” that fail similarly to the ones here and show that you can recover a signal from it.

It’s an interesting problem – but I think that it’s mainly back to statistics and math and doesn’t have much really to do with climate models or the properties of corals. It really is too bad that they decided not to involve me with PR Challenge – I’m really quite a bit more familiar with the issues than the people who were there and have thought about the problems in a more fundamental way. But hey, they’re the Team.

Are there any proxies out there and available that you can/have tested on which do pass a standard T test? Yamal, Tornetrask, any of the speleo, other ice core, non strip bark dendro proxies, etc? I’m wondering whether this type of test would be valid on any proxy at this point.

Steve, I’ll echo cdquarles’ request. For your rather large non-statistician peanut gallery, it would help to have a few pertinent examples showing what we *should* expect.

Right now all we can do is read and say “I dunno what this really means, but Steve says it is bad, and the measures are obviously outside the CI”… not knowing how often these kinds of results are seen, we have no basis for comparison.

The Tornetrask example would be an example of a “good” calibration. The trouble with this as a proxy is that it’s been “adjusted” to remove the divergence problem. As far as I can tell, it was Briffa’s first bit of chiropractic, but not his last.

I’ve been doing similar plots on the Juckes version of these proxies and the t-values in the Juckes version all seem to have improved even if the location is the same. I’ll need to order some of mosher’s crazy pills as well.

I would love to see some **unadjusted** data (any kinds to begin with, then eventually some climate data) that calibrates well.

At this point, it feels like this is a statistical test for discovering human intervention in data series. Should NOT be so, AFAIK… I’m just hoping that other scientific data sources can be mined to show that this test is reasonable and useful as a way of finding “good” data. I just get a bad taste when I see sooooo many adjustments…

I’ve figured out why the later correlations improve – the temperature data has been adjusted. The comparisons shown here use the temperature data used in MBH. Subsequent revisions to the temperature have “improved” the trend and with it the correlations to upward trending proxies. It’s hard to find bottom.

0 (Steve) Intuitively, the answer to your interesting statistical question is ‘no’. These are not a ‘precise’ enough tool to give ‘accurate’ results. It’s like the roomfull of monkeys; you might get an accurate result, but not likely.
=========================================

The following was recently posted on Open Thread 3 at Open Mind (‘Luminous Beauty’,July 19th, 10.41pm)

“Mann et al. (1998, 1999) used a network of 415 annually resolved proxy data … to reconstruct temperature patterns over the past thousand years. Zhang et al. (Z. Zhang, M. Mann, S. Rutherford, R. Bradley, M. Hughes et al., manuscript in preparation) have more recently assembled a much larger network of 1232 annually resolved proxy data consisting of tree rings, corals and sclerosponge series, ice cores, lake sediments, and speleothems combined with reconstructions of European seasonal surface temperatures back to 1500 CE based on a composite of proxy, historical, and early instrumental data (Luterbacher et al. 2004) … The additional inclusion of non-annually resolved, but still relatively high (e.g., decadal), resolution proxies (e.g., nonlaminated lake and ocean sediments) with high enough resolution and accurate enough age models to calibrate at decadal resolution leads to an even larger network of 1302 proxy series.

I make that 887 “new” proxies.”

I wondered if you Steve, or any one else was aware of this new paper?
Steve: Nope. But the “415 proxies” are the same old MBH98 proxies, Graybuill bristlecones and all. They will probably add in the Briffa MXD network used in Rutherford et al. but I’m not sure what else will be in it. But always keep in mind that the vast majority of these records are very short and do not affect things before 1400. The issue will be what’s in the MWP network – bristlecones. And there will be a lot of Graybill bristlecone chronologies, that’s for sure.

The script is still not working for me. Hard coding the scale.method object took care of the first object not found error, but now I get the object not found error referencing an object named Sxxinv. I have since updated my copy of R, but no joy still when using your script.

Thanks for the R education and refreshing my statistics skills.

Steve: Sorry bout that. I need to re-run this not in a session that’s already open. Sxxinv should be solve(Sxx) .

Steve, Looong ago you provided me with Mann et al’s 112 columns of data, 13 of which were temperatures and the remainder proxies of various sorts. I have done a /huge/ amount of processing on these data, and long ago reached my own conclusions about what might reasonably be deduced from them, taking the values Mann provided at face value. (Thus not worrying about the data inconsistencies and errors that you detected and published).

Are we still talking about, and operating on, these data? (Presumably referred to as MBH98).

I use a method of analysis completely different from your elegant work, making no serious attempt to derive a decent quantitative estimate of the “uncertainties”, although I can, under various assumptions, produce one.

The method I use provides a grand overview of the data. It avoids setting up an hypothesised model, such as the very commonly used and accepted linear one, which for all we know might well be totally unrelated to the reality behind the origins of the numerical data.

The technique derives the grand patterns in the data, and enables the behaviour of any chosen subgroups of the data to be compared directly. This yields outcomes that I find very instructive, and which incidentally seem to refute completely any suggestion of “hockey stick” behaviour, save for a few isolated single site series.

If current thinking is that there might be a “more reliable” data set or sets, would it be possible to get a direct link to them?

I’m no mathematician but love the challenge of trying to understand this sort of stuff. My conclusion, as they would say in the East End of London, is “tree rings don’t prove nuffink, mate”. It gives a warm glow to a plump old man when he is able to reduce all your technical gubbins to simple English.