MBH98 Proxies

One point that many people do not understand is that merely labelling something a "proxy" and putting it in a multiproxy dataset does not mean that it has any correlation to temperature.

I’ve plotted up the 22 proxy series in the 15th century MBH98 dataset so that others could see a little more clearly what this means – the proxies are in black. This type of detailed plotting is really needed in presentations of multiproxy studies and is not really suited to academic journals (one of the many gaps that would be filled by the equivalent of engineering-calibre analysis of these studies before policy usage.)

The blue in Figure 2 is the Wahl-Amman version of the AD1400 MBH98 reconstruction based on the 22 “proxies” plotted individually; the red in Figure 1 is the last portion of the MBH98 stepwise reconstruction using 112 “proxies”. The level change in the MBH98 proxy index about 1930 is quite distinct.

Figure 1. The first 11 proxy series in the AD1400 network

Notice that most of the proxies look like white noise, but the Gaspé series has a distinct trend. I’m sure you can pick out Gaspé without it being identified. Of course the updated Gaspé series (which Jacoby has withheld) does not have this trend and they have "lost" the location. The series at bottom left (Svalbard ice melt) is obviously very non-normal, but no consideration is given to this in MBH98.

Figure 2. The other 11 proxy series in the AD1400 network, this time with the Wahl-Amman AD1400 version of MBH98 in blue.

Again most of the proxies look like white noise, but the bristlecones (the MBH98 NOAMER PC1) has a distinct trend. I’m sure you can pick out the MBH98 NOAMER PC1 without it being identified. I understand that Hughes has new data for Sheep Mountain, but this has not been reported.

You can readily see how the Gaspé and bristlecones stamp the reconstruction. In effect, if you take the weighted average of these two series, you replicate most of the variance in the MBH98 AD1400 reconstruction – the other proxies are there for padding.

Many people think that MBH98 is in some way an average of the proxies. It isn’t. For the portion of the algorithm that does most of the lifting, they regress the proxies against the first temperature principal component, which has an upward trend. The index itself is essentially a weighted average of the proxies weighted by this correlation. So their regression module has a mining element – over and above the mining element in their tree ring principal components calculation.For example, the upward trend in the top right Figure 2 series has been mined from 70 series and doesn’t represent an average or even a properly calculated principal component.

One reader commented that use of all these series shows that they weren’t data mining: that’s not the case. What happens is that their regression module, which is quite a weird algorithm itself, strongly picks up the trends. I’ve mentioned before some simulations where I’ve applied his method to one hockeystick shaped PC1 and 21 white noise series and the MBH98 method is imprinted by the hockeystick shaped PC1. So most of the other series are simply an illusion in terms of contributing – their weights may be as little as .1% of the Gaspé series.

21 Comments

For some reason this reminds me of taking a number of GCM generated climate projections and averaging the results together to “improve” the forecast.

Questions:

In Figure 2 the two middle datasets on the right hand side show incredibly linear results for the last ~decade, how did this happen? These totally unrepresentative data points are positive, so does this also skew the results?

I edited the article for clarity, punctuation, style, artistic merit etc.But one thing bothers me: why are the labels referring to 11 proxies when there are 12 shown? I don’t have the math degree, so I just wondered…

And this is supposed to be a bad thing? If the researchers had only used the increasing data, you would rightly accuse them of cherry picking data to fit their conclusion. But if they use several independant sources of data without handpicking and still get a strong signal, that indicates a lot more.

John A. I think the upper left series in each figure is the temperature reconstruction based on the other 11 series.

If the reconstruction was done by averaging, they had to do some very fancy weighting to get the red and blue results. How does the reconstruction in Fig 1 get a ten fold reduction in range from +- 2 to +- 0.2? The statistics of noise should only give sqrt(11) ~ 3.3, ie, +- 0.6. Same question for Fig 2.

I;ve added some further comments in the post to clarify a couple of questions here. Also there’s lots of cherrypicking in the data. The Gaspe series in the AD1400 dataset only because they fiddled their selection criteria to get it into the AD1400 roster – the only case of an early extrapolation (and then they misrepresented the start date of the series). The Jacoby series were already cherrypicked at the Jacoby stage, by his withholding data that was not “temperature sensitive”.

As for selecting series, most of these “proxies” sure look like noise to me.

The reason for the linearity is that the series ended in the mid-1970s – remember all the heavy equipment that’s needed to update proxy series. So they were extended by persistence. I’m pretty sure that this doesn’t make a material difference – other than Gaspe and the bristlecones almost nothing makes a difference. So they talk about how “robust” eveerything is, but if you lay a hand on their bristlecones, what squealing. Accusations of wholesale “throwing out” of data – you know the litany.

The original article pointed out that there are numerous proxies that appear to show no correlation with temperature. The very few proxies that show a hockey stick shape seem to dominate the final reconstruction.

Pehaps if ruidh could rephrase their concern I would understand better.

I don’t buy that in figure 2, two plots made from two unique datasets could be so alike. Look again at the two middle datasets on the right hand side. The chance of them being such a close match is zero. So they used one proxy twice with oh so slight changes? What gives?

John, the two nearly looking series on the right hand side of the 2nd figure are the French tree ring chronologies fran009, fran010 (both at ITRDB http://www.ngdc.noaa.gov/paleo), which are two versions of the same site Les Merveilles France – one live trees and one mixed live and dead trees. Why these merit two different series is beyond me. As I recall, the original author posited them as precipitation proxies (but MBH doesn’t care, it just mines for hockey sticks.)

As for replication, this is small beer. The Gaspe series was used twice – as an individual proxy it was extrapolated; as a contributor to the NOAMER series, it wasn’t. Spruce Canyon CO appears to have been used 6 times in the AD1400 network (and other networks). It was used as co509w and co509x in the NOAMER PC network; it is reported as being used twice in the Stahle/SWM network. They only give a list of sites and list of numbered series and no concordance; the versions differ from ITRDB versions. However, there is a best match by correlation and start dates to ITRDB versions and one doublet of earlywood and latewood widths can be allocated. Then there is another doublet which has identical values to the first doublet in the first 120-125 years and high correlation thereafter – it’s probably a not-quot duplicate edition. I’ve tried unsuccessfully to get an identification of this other site. In the Nature Corrigendum, they said it was Stahle (pers. comm.) and all parties – Nature, Mann, Stahle – have refused to provide further identification.

Re#13, Reading between the lines of… “merely labelling something a “proxy” and putting it in a multiproxy dataset does not mean that it has any correlation to temperature” and then looking at the numerous proxy records presented in Figures 1 & 2 and seeing that several of them do not appear to correlate with the temperature history presented as the hockey stick renders my interpretation “that there are numerous proxies that appear to show no correlation with temperature”. Am I missing something?

Steve, I was hoping you would continue this thread by posting a table that has the 22 gains calculated in the training period and even more important the weighting for each of the 22 proxies in the recontruction. I think that it would be informative. thanks Phil B.

I haven’t yet looked carefully any of these studies, but I have read some of your blog pages and agree that there are problems with some of this approach to picking out data based on a pattern and then weighing accordingly… at least depending on the type of data.

The reasoning behind “Mann’s” approach is imperfect but has logic behind it. [I am only considering the general idea since I haven’t looked at the actual paper details.] Simply, if the proxy does a horrible job matching known good values, then it should get weighted very little if at all since it is likely a bad proxy or noise. If the proxy correlates well, then it should be given a high weight since it is likely a good proxy, at least under the reasonable (but I believe incorrect) assumption that the proxy can be modeled as an accurate temp profile with homogenous noise distribution. [I am also assuming there is a cut-off level and that weak correlation does not result in any weight whatsoever.]

Yes, you can badly trick this general algorithm rather easily with contrived data or a large enough data set (so that the improbable noise profile that passes the cutoff has a chance of being generated).

Again, I made some assumptions here about what I’ll call a “general algorithm” of weighing during reconstruction based on correlation with known good values and after filtering at some cutoff. I don’t know the details of Mann’s algorithm.

If I was less than skim milky clear in my prior comment, let me try a little bit harder:

A) I agree there can be serious problems,

B) but I think the “general” methodology is workable. I disagree with this blog posting (if it says) that it is generally a bad idea to use a criteria to filter proxy data and to give some sort of weight to proxies for use in reconstruction.

C) I have no comment on the precise details of what Mann did because I have not read or studied the relevant papers and source code.

Good science involves testing predictions. What you are agreeing with is in essence the Texas Sharpshooter fallacy: I can collect a bunch of data and only keep (or heavily weight) the data that agrees with my hypothesis.

The problem is that we ask too much of these proxies. A quick look at a random selection of proxies from co2science.org indicates how much garbage exists among proxies.

What kind of confidence can you place in a generic proxy which does not correlate well at all with the period 1850-present?

The ideal way is to have proxies we can trust, where there is already a lot of confidence the proxy correlates a certain way with temperature a high probability of the time, but until/unless we have such proxies, the generic algorithm described is about the only quasi-trustworthy example, with the understanding that it is better than nothing but not expected to be accurate.

Just yesterday I glanced at the IPCC ar4 section with those graphs and noticed Mann’s graph had very large error bars for the older time periods. I am not sure how to place error bars on what could be total garbage, but if we use some sort of confidence level analysis along with the expertise of a group who specializes in that sort of proxy and whom we feel we can trust, then we probably can end up with error range that is useful. And if current temps happen to be near the top of that range, that speaks loads for the likelihood that current temps are among the highest of the past 1000+ years. Yes, the mean can be like a hockey stick, which we ignore, but the error boundaries can still allow us to potentially make confident statements. [Since I haven’t studied this carefully, I won’t try to say much specifically about Mann’s 1998 paper.]

Honestly, are you suggesting taking proxies that don’t correlate well with the 1850-present period?

Again, I agree that the best case is to already have confidence in the proxies and chose that way; however, if the 1850-present range doesn’t correlate through some identifiable mapping, I expect not to trust the rest of that time series’ data too much.

I see our options as, do what Mann did (generically) and get an idea useful for low resolution work, or ???? and have much lower confidence level of the reconstruction.

What do you suggest is better than what Mann did? Do you generally agree with the position I have taken? All/somewhat/none?

It seems that the only ones doing so are those making claims (e.g. unprecedented modern warming) supposedly answered by the proxies, without stating how they can differentiate the variables which a proxy can be influenced by..

I didn’t address the Texas sharpshooter fallacy specifically, so let me do that now.

If your criteria for passing a test is high (with a clean cutoff that weighs down to 0 the failures), the chances that accidental random data sneaks in there goes down to a very small value. We would be accepting only data that did a very good job in a large section of the time series, and that section would be large enough that the odds we’d have a different time series do a decent job elsewhere but fail there would be extremely low, as would be the odds of a time series doing a good job at the end but failing elsewhere. The problem with this approach might be that we’d end up with 0 qualifying proxies.

We could assume objectiveness on the part of the scientist to try and make this work. Problem is that it’s very easy not to be objective and cheat the system, even subconsciously with various biases we might have. I agree completely, and this is one reason I think we should have wide error bars that we treat seriously and not just keep around for decoration and formality. And of course, peer review (eg, by McIntyre as well as others who may not be as critical) should go along way to achieving a fair amount of objectivity and balance.

An important issue is the distribution of the noise. An evenly distributed noise (I am guessing) that is of low amplitude should rightly lead to low error ranges (and passing score). However, having confidence that we have identified such a noise distribution is a different matter.

Finally, I did not address weighing values very much. I think I like the idea of a cutoff (eliminate the garbage and the suspect) but keep everything else with equal weight. I think you probably are arguing for equal weighing, right?

.. however, I would still test against the most accurate info we have, the temperatures for more recent times, with the greatest confidence lying in the last half century’s worth of data. To fail to test against this data increases the odds that the person picking the selection of proxies will pick a biased set, a set with greater error, yet potentially end up with a result that has smaller error bars than is warranted. We prefer hockey stick mean with large errors than some other shape that has a greater chance of being incorrect but had smaller errors bars attributed.

[Nothing I stated in this reply lessens what I stated before that we should seek an initial universe set of proxies which have corroborating evidence and supporting theories, generally implying they are among the most likely to be accurate.