Reconciling to Wahl and Ammann

When Wahl and Ammann’s script first came out, I was able to immediately reconcile our results to theirs – see here. As Wegman later said:

when using the same proxies as and the same methodology as MM, Wahl and Ammann essentially reproduce the MM curves. Thus, far from disproving the MM work, they reinforce the MM work.

You’d never know this from Team and IPCC accounts of the matter, but this is actually the case. The most recent Wahl and Ammann fiasco with their RE benchmarks is merely one more example.

While they archived their code (excellent), it wasn’t set up to be turnkey, nor is their most recent code. My own code at the time wasn’t turnkey either, since my original concept in archiving code was really to provide documentation. However having started down that road, it’s not much extra time or trouble to make things turnkey and, aside from the convenience to readers, there are considerable advantages for one’s own personal documentation in buttoning up code so that it’s turn key. When I picked up this file again in connection with the recently archived Ammann SI, I re-visited my script implementing Wahl and Ammann’s code. I modified it slightly to make it turnkey.

I also modified it slightly so that I could readily prove that our results reconciled exactly to theirs, also with turnkey scripts, as I reported in May 2005. This reconciliation is demonstrated below on a line by line basis, which also serves to illuminate the structure of the calculations in a useful way.

It takes a little while to work and when finished, will note on the console that it has “Read 1 item. Read 2 items.,,,, etc.” It creates directories in your d: drive. If you don’t have a d: drive, manually edit the references in the script from d: to c:. This results in an R-list wahlfit with 11 items, one for each of the 11 MBH steps from 1400 to 1820 (1400 first).

The script prints out a few entries so that you can confirm that you’ve got the right object.

Next the script demonstrates the reconciliation of my linear algebra with Wahl and Ammann’s verbose and obscure code (which improves on the even more verbose and obscure original code. First, I collect some of the relevant targets from within Wahl and Ammann code (made possible by collecting the objects into an R-list). The AD1400 step is in wahlfit[[1]] so we start our reconciliation there, first labelling that step, then collecting the proxy network, the reconstruction and the temperature PC1, giving each of them short names for further steps.

Next we do a little linear algebra to collect the weights of each reconstructed temperature PC in the final NH temperature reconstruction. (I showed a couple of years ago that a calculation involving hundreds of thousands of calculations every time you picked up the file, could be done once as the MBH/WS recon is a linear combination of the reconstructed PCs with weights that do not vary (other than whether they are included or excluded. The following implements this linear algebra in one line:

The reconstruction itself also has very simple linear algebra – underneath all the inflated talk of maximizations and minimizations. This was a point that we emphasized as early as MM03. Wahl and Ammann have applied many of the same techniques, though, in my opinion, with some deterioration in insight, as they don’t seem to think naturally in algebraic terms. This linear algebra was posted up at the blog several years ago. In baby steps, here is the reconstruction calculation (

First, we collect the proxies (Y) and the temperature PCs (X) and scale them in the calibration period (1902-80).

Next we use equal weights (following WA; Mann used arbitrary weights – I have no idea how he assigned these weights.) Mannian methods result in the weights being squared (which is unexpected and unreported). It doesn’t “matter” with equal weights which might explain why WA failed to follow MBH on this step in their benchmarking. It was not easy to figure out that Mannian code ended up squaring the weights.

weightsprx=rep(1,ncol(Y))
P=diag(weightsprx)^2

Next we calculate the vector of correlations of each proxy to the PC1. I implemented this way in the 1-D case to show the commonality of this approach to correlation-weighted methods, believed by their proponents to be different than MBH

rho=t(as.matrix(cor(pc[1:79,1:ipc],Y[(N-78):N,]))); #16 19

Mann then has a seemingly ad hoc re-scaling of the temperature PCs to the observed PCs. This has been a troublesome feature for people trying to replicate his results because it’s not mentioned anywhere in th original text; it seems to have been a stumbliing for Wahl and Ammann as well, because although they claimed that their implementation was free-standing, their code acknowledges a pers. comm. from Mann on this point. In our early code, this step was done in a purely empirical way, as Wahl and Ammann do as well. However, it turned out that this operation can also be expressed simply in linear algebra, which I’ve done here. This took me an undue amount of work a while ago to figure out and, aside from simplifying code, it is useful in laying bare the structure:

Now the reconstructed temperature PC1 can be calculated in one line of linear algebra (I use the notation Uhat for the reconstructed PC1 since it is an estimate of the target temperature PC U):

Uhat = Y %*% P %*% rho %*% diag(sd.adj)

From this reconstructed PC1 and the vector of weights calculated previous, we can calculate the NH reconstruction for this step in one line, labelled tau below.

tau= ts(Uhat %*% weights.nh[1:ipc,],start=tsp(proxy)[1])

In case it’s useful, I also calculate the weights (beta) of the individual proxies in the reconstruction, which can also be done in one line:

beta= P %*% rho %*% diag(sd.adj) %*% weights.nh[1:ipc,]

Doesn’t look much like the hundreds of lines of Fortran code. (UC’s Matlab implementation is in the same spirit though it has nuances of difference.)

Could something done so simply actually reconcile to Wahl and Ammann. Well, we collected their reconstruction for the AD1400 step above, so let’s calculate the maximum difference between the two series:

max(tau-recon) # 1.165734e-15

Bingo. I’ve collected the above operations into a function which requires three parameters – the proxy, the weights and the number of temperature PCs. This matches.

Sparse
The following shows how easily the “sparse” calculation can be implemented if one thinks about it logically. Mann defines a subset of 212 gridcells for verification (his “sparse” ) set; these locations are not mentioned anywhere. I blew up the Nature diagram to try to figure it out; Wahl and Ammann have an index list of NH sparse gridcells, which I’ve collected as sparse.index below. As with the NH reconstruction, the sparse reconstruction is a linear combination of reconstructed PCs and the weights can be calculated in one go in an identical manner as with the NH weights as follows:

The estimate for the sparse region can then be calculated in one line as well:

sparse_est= ts(Uhat %*% weights.sparse[1:ipc,],start=tsp(proxy)[1])

Note that in this case, all that happens is that one weight differs a little bit, as only the first weight is used. The constant 1.62152 is used for the sparse estimate, as compared to 1.81831789 for the NH calculation, which attenuates this a little. I’ll show that this also reconciles exactly, but first something odd has to be watched out for. In constructing their calibration-verification combination, Wahl and Ammann don’t calculate calibration and verification statistics consistently on the same data set. They splice the estimate for the “dense” subset in the calibration period with the estimate for the “sparse” subset in the calibration period. If these two things are spliced, you get a perfect match as shown below:
range(W[[3]]$estimator- c(sparse_est[(1854:1901)-1399],recon[(1902:1980)-1399]) )
#] -2.775558e-16 1.665335e-16

The function that I’ve used for the past few years (not as pretty as I’d do it now, but serviceable), gets identical calibration RE and verification RE scores, apples and apples:

Now here’s something for connoisseurs of Texas sharpshooting. What would happen if we did both calibration and verification on the same data set. On an earlier occasion, I extracted the target sparse data set. Instead of splicing two different series, let’s look at calibration and verification on the same time series. This time the calibration RE is only 0.177 (!), while the verification RE still rounds to 0.48 (I guess there are probably rounding differences in this somewhere accounting for the slight difference in verification RE values).

It looks like your request to fetch the AW file on UCAR’s website no longer works.

Steve: I just re-ran it tabula rasa and it worked fine. The script downloads a large gridded temperature data set (about 20 MB) – maybe there was an internet hiccup along the way. I’m on a fast cable connection. It might make sense to have that download in a standalone sequence.

Steve
Thank you for the commentary as it makes the code semi understandable. Old programmer says if the commentary is included with the code then it is there for all time and does not have to be rediscovered. When I look at code that I have written some time ago (several days to 35 years) the comments bring it back to life. I cannot imagine that other people would not have similar results.
Thanks again for the hard work.
Terry

Might it be more complete if you also look at the minimum (i.e. most negative) difference, or take an absolute value? Though I doubt it will show anything different.

What would happen if we did both calibration and verification on the same data set. On an earlier occasion, I extracted the target sparse data set. Instead of splicing two different series, let’s look at calibration and verification on the same time series. This time the calibration RE is only 0.177 (!), while the verification RE still rounds to 0.48

Can you explain why this happens, i.e what the difference is between the two definitions? The nearest I could find was something by Bürger 2007, On the verification of climate reconstructions which explains the distinction between CE and RE, but not so I can easily see between verification and calibration RE if the period means are the same. I may have missed something obvious.

Steve: You’re looking for something that makes sense. This is nothing to do with definitions. They spliced two different series and calculated the calibration RE on one series and verification RE on another and then compared ratios from two different series !?!

With that in mind, I now turn to the latest paper that is getting the inactivists excited by Demetris Koutsoyiannis and colleagues. There are very clearly two parts to this paper – the first is a poor summary of the practice of climate modelling – touching all the recent contrarian talking points (global cooling, Douglass et al, Karl Popper etc.) but is not worth dealing with in detail (the reviewers of the paper include Willie Soon, Pat Frank and Larry Gould (of Monckton/APS fame) – so no guessing needed for where they get their misconceptions). This is however just a distraction (though I’d recommend to the authors to leave out this kind of nonsense in future if they want to be taken seriously in the wider field). The second part is their actual analysis, the results of which lead them to conclude that “models perform poorly”, and is more interesting in conception, if not in execution.

re 7 Hey give the Team a break, Sonic. At least they did ask “What is the actual hypothesis you are testing?”
This is a great leap forward for “Climate Science” – the concept of a testable hypothesis.
Now all we need to do is apply this new fangled stuff to the “Hockey Stick”, increased CO2 = 4+ oC increased global temperature
etc. etc.

“There has been the most extraordinary series of postings at Climate Audit over the last week. As is usual at CA, there is a heavy mathematics burden for the casual reader, which, with a bit of research I think I can now just about follow. The story is a remarkable indictment of the corruption and cynicism that is rife among climate scientists, and I’m going to try to tell it in layman’s language so that the average blog reader can understand it. As far as I know it’s the first time the whole story has been set out in a single posting. It’s a long tale – and the longest posting I think I’ve ever written and piecing it together from the individual CA postings has been a long, hard but fascinating struggle. You may want to get a long drink before starting, and those who suffer from heart disorders may wish to take their beta blockers first.”

I haven’t read the entire filing yet, but a quick skim shows it just repeats talking points on every factual issue it discusses, including multiple ones known to be wrong. There’s not much to say about it that hasn’t been said dozens of times. However, there is at least one thing that’s new:

The statement that Dr. Mann “has molested and tortured data in the service of politicized science that could have dire economic consequences for the nation and the planet” is plainly factual and verifiable.

It “is plainly factual and verifiable” that Mann “mas molested and tortured data.” How do you verify somebody tortured and molested data? Do you ask the data where it was touched?

That, presumably, would be the same Wahl and Ammann cited by the National Academy of Sciences report as evidence that Mann’s reconstruction was no more informative than the simple mean of the data:

Reconstructions that have poor validation statistics (i.e., low CE) will have correspondingly wide uncertainty bounds, and so can be seen to be unreliable in an objective way. Moreover, a CE statistic close to zero or negative suggests that the reconstruction is no better than the mean, and so its skill for time averages shorter than the validation period will be low. Some recent results reported in Table 1S of Wahl and Ammann (in press) indicate that their reconstruction, which uses the same procedure and full set of proxies used by Mann et al. (1999), gives CE values ranging from 0.103 to –0.215, depending on how far back in time the reconstruction is carried.

NAS Report (2006) page 91.

This is the NAS report, of course, that accepted our assertion that the underlying PCA method biases the shape of the results by loading too much weight on the bristlecone pine record:

As part of their statistical methods, Mann et al. used a type of principal component analysis that tends to bias the shape of the reconstructions. A description of this effect is given in Chapter 9. In practice, this method, though not recommended, does not appear to unduly influence reconstructions of hemispheric mean temperature…The more important aspect of this criticism is the issue of robustness with respect to the choice of proxies used in the reconstruction. For periods prior to the 16th century, the Mann et al. (1999) reconstruction that uses this particular principal component analysis technique is strongly dependent on data from the Great Basin region in the western United States. Such issues of robustness need to be taken into account in estimates of statistical uncertainties.

NAS report (2006) page 106-107

The data to which they are referring is the bristlecone pine record,which the panel recommended should not be used:

The possibility that increasing tree ring widths in modern times might be driven by increasing atmospheric carbon dioxide (CO2) concentrations, rather than increasing temperatures, was first proposed by LaMarche et al. (1984) for bristlecone pines (Pinus longaeva) in the White Mountains of California. In old age, these trees can assume a “stripbark” form, characterized by a band of trunk that remains alive and continues to grow after the rest of the stem has died. Such trees are sensitive to higher atmospheric CO2 concentrations (Graybill and Idso 1993), possibly because of greater water-use efficiency (Knapp et al. 2001, Bunn et al. 2003) or different carbon partitioning among tree parts (Tang et al. 1999)… ‘strip-bark’ samples should be avoided for temperature reconstructions

NAS Report page 50

Among our other findings as reported to the NSA was that the uncertainty levels of Mann’s reconstructions were underestimated. Did they find that an inaccurate claim?

Regarding metrics used in the validation step in the reconstruction exercise, two issues have been raised (McIntyre and McKitrick 2003, 2005a,b). One is that the choice of “significance level” for the reduction of error (RE) validation statistic is not appropriate. The other is that different statistics, specifically the coefficient of efficiency (CE) and the squared correlation (r2), should have been used (the various validation statistics are discussed in Chapter 9). Some of these criticisms are more relevant than others, but taken together, they are an important aspect of a more general finding of this committee, which is that uncertainties of the published reconstructions have been underestimated.

NAS Report p. 107.

And then there’s McShane and Wyner 2011; and if Mann thinks the Annals of Applied Statistics is not a peer-reviewed journal, let’s see him get published in it. Wahl and Ammann make an appearance there too:

That the null models may be too weak and the associated standard errors in papers such as Mann et al. (1998) are not wide enough has already been pointed out in the climate literature (von Storch et al., 2004). While there was some controversy surrounding the result of this paper (Wahl et al., 2006), its conclusions have been corroborated (von Storch and Zorita, 2005; von Storch et al., 2006; Lee et al., 2008; Christiansen
et al., 2009).

McShane and Wyber 2011 pp. 17-18

and

We are not the first to observe this effect. It was shown, in McIntyre and McKitrick (2005a,c), that random sequences with complex local dependence structures can predict temperatures. Their approach has been roundly dismissed in the climate science literature…Ammann andWahl (2007) claim that significance thresholds set by Monte Carlo simulations that use pseudo-proxies containing ”short term climate signal” (i.e., complex time dependence structures) are invalid:

Such thresholds thus enhance the danger of committing Type II errors (inappropriate failure to reject a null hypothesis of no climatic information for a reconstruction).

We agree that these thresholds decrease power. Still, these thresholds are the correct way to preserve the significance level. The proxy record has to be evaluated in terms of its innate ability to reconstruct historical temperatures (i.e., as opposed to its ability to ”mimic” the local time dependence structure of the temperature series). Ammann andWahl (2007) wrongly attribute reconstructive skill to the proxy record which is in fact attributable to the temperature record itself. Thus, climate scientists are overoptimistic:
the 149 year instrumental record has significant local time dependence and therefore far fewer independent degrees of freedom.

McShane and Wyner p. 19

M&S are the gentlemen who showed, using Mann’s own data, that one cannot draw strong conclusions about the relative warmth of the MWP compared to today, which was the same point we made back in 2003:

Figure 14 reveals an important concern: models that perform similarly at predicting the instrumental temperature series (as revealed by Figures 11, 12, and 13) tell very different stories about the past. Thus, insofar as one judges models by cross-validated predictive ability, one seems to have no reason to prefer the red backcast in Figure 14 to the green even though the former suggests that recent temperatures are much warmer than those observed over the past thousand years while the latter suggests they are not.

McShane and Wyner pp. 30-31.

[We] conclude unequivocally that the evidence for a ”long-handled” hockey stick (where the shaft of the hockey stick extends to the year 1000 AD) is lacking in the data. The fundamental problem is that there is a limited amount of proxy data which dates back to 1000 AD; what is available is weakly predictive of global annual temperature…Consequently, the long flat handle of the hockey stick is best understood to be a feature of regression and less a reflection of our knowledge of the truth….Climate scientists have greatly underestimated the uncertainty of proxy-based reconstructions and hence have been overconfident in their models. We have shown that time dependence in the temperature series is sufficiently strong to permit complex sequences of random numbers to forecast out-of-sample reasonably well fairly frequently (see, for example, Figure 9). Furthermore, even proxy based models with approximately the same amount of reconstructive skill (Figures 11,12, and 13), produce strikingly dissimilar historical backcasts: some of these look like hockey sticks but most do not (Figure 14).