Juckes and the Esper and Jones CVMs

The Euro Hockey Team (EHT) present a variety of results using CVM methods on proxies from 5 canonical Team studies: Jones Briffa 1998; MBH99; Esper 2002; Moberg 2005 and Hegerl 2006. CVM (take the average of scaled proxies and re-scale the average to match the instrumental record) seems to me to be a less dangerous method than inverse regression. However, the supposedly "robust" results are all from very small subsets, which have been "data snooped" in advance. I’ve replicated 4 of the CVM results so far – the non-MBH ones.

The EHT purport to test for robustness – but their idea of robustness is to test the impact of removing one proxy after they’ve already ensured that there are, say, two foxtails plus Yamal. On numerous occasions, I’ve stated that medieval-modern levels in Team studies are highly dependent on the presence/absence of a few stereotyped series: bristlecones/foxtails, the Yamal substitution, Thompson’s Dunde ice core. The EHT present 5 CVM results and these provide a nice template for showing this one more time. Today I’ll do quick robustness analyses of the CVM-versions of Esper 2002 and Jones 1998 and do the others later.

Here is their key spaghetti graph of CVM reconstructions for the 5 studies plus the Union (All-Star?) reconstruction.

Euro Team Fig. 4. Reconstruction back to AD 1000, calibrated on 1856 to 1980 northern hemisphere temperature, using CVM, for a variety of different data collections. The MBH1999 and HPS2000 NH reconstructions and the Jones et al. (1998) instrumental data are shown for comparison. Graphs have been smoothed with a 21-year running mean and centred on 1866 to 1970. The maximum of the “Union” reconstruction in the pre-industrial period (0.25 K, 1091 AD) is shown by a cyan cross, the maximum of the instrumental record (0.841 K, 1998 AD) is shown as a red.

Esper Version
The figure below isolates the Esper composite, which consists of the 5 Esper series available to 1000 – two foxtail series, a Tornetask version, the Polar Urals Update (only used by Esper) and Taymir. It’s surprising that they didn’t include Mongolia in their Esper composite as it theoretically goes back to 1000 and has strong 20th century growth. There is something weird going on with Esper’s Mongolia version. This is the only one (out of 14 chronologies) that Esper didn’t produce when requested by Science. Hegerl et al, who rely on Esper versions, mention in passing that Mongolia was unavailable and performed a re-collation of 9 sites – Sol Dav/Tarvagatny Pass (mong003) will be one of them, but the others are unknown at present. It looks like Esper might have misplaced his Mongolia chronology as used. If not, the fact that it continues to be Missing In Action is really quite remarkable.

First, I’d like to demonstrate that I’ve replicated their result sufficiently to permit a sensitivity analyis. In the figure below, the archived EHT composite was plotted in black, with my emulation in red overlying it. The difference is plotted in blue. Because the emulation is very close, you can barely see the black. The correlation between the two versions is good to more than three 9’s. An obvious observation when this series is disentangled from the spaghetti is that, by itself, I don’t think that anyone would call it a "Hockey Stick". 20th century values are somewhat elevated, but not in an exceptional range – and this is with 2 foxtail series.

Figure 2. Replication of EHT Esper composite. Black – Archived by EHT; red – emulation; blue – difference.
Fig. 4. Reconstruction back to AD 1000, calibrated on 1856 to 1980 northern hemisphere temperature, using CVM, for a variety of different data collections. The MBH1999 and HPS2000 NH reconstructions and the Jones et al. (1998) instrumental data are shown for comparison. Graphs have been smoothed with a 21-year running mean and centred on 1866 to 1970. The maximum of the “Union” reconstruction in the pre-industrial period (0.25 K, 1091 AD) is shown by a cyan cross, the maximum of the instrumental record (0.841 K, 1998 AD) is shown as a red cross.

The obvious question is what happens if you don’t use the foxtails – after all, these results are supposed to be fantastically "robust". The figure below shows the EHT Esper composite on the left and a Esper variation on the right without the two foxtail series. As you see, without the foxtails, 11th century values are higher than modern values and, indeed, the series ends up at about its long-term average. So the amount of 20th century elevation is very small even in the EHT Esper version and even this limited 20th century "warmth" vanishes without the foxtails. The 1856-1980 correlation of the EHT Esper composite to their instrumental temperature record is 0.59 – higher than the level of 0.43 for the non-foxtail composite. However both results would be "significant" under the statistical methods presented by the EHT.

Jones et al 1998 is shown in red in the EHT spaghetti graph (their Figure 4 excerpted above.) The EHT Jones composite uses 6 series – three NH and 3 SH: Polar Urals (the Briffa 195 version); Tornetrask (the Briffa 1992 version – which has an amusing "adjustment" discussed last year – see the Jones et al category); West Greenland ice core (a slightly different version than MBH); Cook’s Tasmania tree ring chronology and the Lara-Villalba tree ring chronology from Rio Alerce, Argentina.

Once again, I was able to make a nearly exact replication of the EHT Jones composite as shown below (using the same color scheme as above). The appearance of this composite is obviously significantly different than the Esper composite – with more high-frequency. Again one would not be inclined to label this paticular series as a Hockey Stick.
EHT Jones composite. Black -archived version; red- emulation; blue difference.

In this case, I’ve done a routine sensitivity analysis in which I’ve replaced the West Greenland stack used in Jones et al 1998 with the slightly later version used in MBH99 (I’ve verified that the Jones version is an earlier version with the originator of the data.) In the Euro All-Star Team, in their unaccountable policy of using older data, they use the Jones version. There’s not much difference between the two series, but I’ve used the later series. I’ve also used the updated Polar Urals version (from Esper) instead of the Briffa Polar Urals series (there are other problems with this series in the 11th century which I wrote about last year. I’m convinced that some of the tree cores in the Briffa 1995 study have been misdated.) Anyway, merely by using the updated Polar Urals version, the relative medieval-modern levels are changed. In this case, the 1856-1980 correlations for the variation (0.37) are a very slight improvement on the correlations for the Juckes version (0.36) – so any "significance" attributed to the Juckes version is shared equally by the version with a higher MWP.

20 Comments

Your ability to reproduce their results is impressive. You must be getting close to the point where you could pull a paper together on the ACTUAL robustness of these reconstructions (as opposed to their IMAGINED robustness). Nice job.

It’s interesting to see the series individually – disentangled from the spaghetti graphs plus the smear of instrumental temperatures.

I’ve got a really pretty graph which I’ll post up in a couple of days.

I think that I’ve identified a nice bit of cherry picking. If you look at their proxy rosters going, a number of proxies, especially Moberg proxies, have been left out of the All-Star Team in addition to the Polar Urals Update being left out. Not very surprisingly, a number of the deselections have elevated MWP – the Sargasso Sea, Indigirka tree ring,… So I’ve made a composite of the cherry-unpicked series. It has a very elevated MWP and looks a lot like my apple-picked series submitted to NAS.

They have not provided ANY justification for why they keep some Moberg proxies and deselect others.

The traditional test of robustness in these multiproxy studies is to leave one proxy out and see how how the reconstruction changes. This is a fairly minimal test, especially after pre-selection of proxies. Has anyone tried bootstraping resampling the proxies, and looking at the between reconstruction variance? This should make a more rigerous test.

#4 Interesting idea. Bootstrapping is a good way to determine robustness when you anticipate having a small number (k .ge. 2) of “outliers” or “highly influential” observations — it is certainly better than “leave-one-out” methods. However, for the bootstrap to work, the sample you employ (the set of proxy series) must be representative of the target population. Unfortunately, as SteveM has repeatedly pointed out, the sample of proxy records used in many of the published studies has been “cherry-picked.” As a result, the “sampled population” is not the same as the “target population,” which greatly complicates interpretation of the results. I, for one, do not know of any general statistical method that can compensate for biased sampling.

You will drive yourself crazy if you try to avoid cherry-picked series; they’re all cherry-picked to some degree. I think these recons are fragile enough that a bootstrap resample will readily expose the lack of robustness.

#9. I suppose that what I’m doing here is a low-tech form of bootstrapping to allow for the effect of cherrypicking and datasnooping. I know which series they’ve left out and so the most logical experiment is to do the calculation with the left out series. If they’ve cherrypicked Briffa’s Yamal and excluded the Polar Urals Update, there’s no point bootstrapping using the cherrypicked series (I’m not saying anything that anyone here disagrees with.)

It seems to me that these simple counter-examples refute the bootstrapping arguments. I’m open to suggestions on the topic.

What I suggest is two flavors of bootstrap. (Doesn’t that sound tasty?)

(1) The first would include even the most snooped-out and untenable series, such as bcp, Yamal, Polar Urals. Because they’ve got the blade, they will be directly comparable to the Team’s work. But because you are doing a bootstrap resample, your confidence intervals will be fittingly wide.

(2) The second would exclude all suspicious series, such as those the NAS disfavors, or those that Team are always making sure are inserted in their recons (i.e. the source of Wegman’s Fig 5.8 HT recon non-independence). This will probably bring the MWP back, but at the same time show the wide uncertainty around the MWP. I suspect the envelope will be so wide as to entirely encompass Lamb’s “cartoon”.

The other point is that this kind of work is constructive and eminently publishable. As opposed to the “tendentious”, “poking holes” approach that is so disagreeable to TCO & the tree-ringers. (Good name for a band.)

#11. At least the Euro Team have provide enough documentation that one can more or less see what they’re doing and at least attempt sensitivity studies of this type. It’s all very well for TCO to argue that I should have done this for the other studies, but it’s not as easy as all that when I didn’t even have the Esper chronologies, which underpin other results. It also makes a big difference having Esper’s version of the Polar Urals Update. I can substantially replicate his calculation, but you can imagine the caterwauling that would have taken place had I presented a sensitivity study with my RCS calculation of Polar URals update. (These RCS calculations are mathematically trivial, but the ringers make them out to be some kind of magic.)

On publication. Auditing is impossible without adequate documentation. That’s why they invented shredders. I’m amazed you’ve gotten as far as you have.

On RCS. Some people have to resort to making things sound like magic in order to get them on the shelf of the marketplace of ideas. The ones who just quietly go about their science don’t get the big grants.

I guess I’m dense. These studies simply do not follow the scientific method and should be rejected on that basis alone. I just don’t understand how any scientist can justify selecting only certain series that conform to their “expectations,” without some documented a priori basis for doing so. It must be that the tree ringers have somehow tacitly accepted some really bad scientific procedures. Maybe they actually think they are justified in doing this, because it is a standard procedure in dendrochronology.

Yes. I’ll post up a script showing my replication of the Euro Team composites. In this case, it looks like they’ve scaled on 1856-1980 – ao that there is no “verification period”, but I might have not got it exactly.