Reply to Huybers #1

An article by Peter Huybers has been accepted at GRL together with our Reply. I’m going to give a preview of this. This will take a few posts.

Our original article is here. (Copyright 2005 American Geophysical Union. Further reproduction or electronic distribution is not permitted.) The Huybers Comment is posted here . Our reply is here (Accepted for publication in GRL. Copyright 2005 American Geophysical Union. Further reproduction or electronic distribution is not permitted.)

Before discussing Huybers, I’d like to re-post the Abstract of our GRL article entitled “Hockey Sticks, Principal Components and Spurious Significance”. There has been so much disinformation about the article — especially the supposed “MM reconstruction”, that it is useful to occasionally remind oneself of what we actually said. Our concerns about MBH98 were the biased PC methodology, robustness, statistical significance and proxy selection. Here’s our abstract:

The “hockey stick” shaped temperature reconstruction of Mann et al. [1998, 1999] has been widely applied. However it has not been previously noted in print that, prior to their principal components (PCs) analysis on tree ring networks, they carried out an unusual data transformation which strongly affects the resulting PCs. Their method, when tested on persistent red noise, nearly always produces a hockey stick shaped first principal component (PC1) and overstates the first eigenvalue. In the controversial 15th century period, the MBH98 method effectively selects only one species (bristlecone pine) into the critical North American PC1, making it implausible to describe it as the “dominant pattern of variance”. Through Monte Carlo analysis, we show that MBH98 benchmarks for significance of the Reduction of Error (RE) statistic are substantially under-stated and, using a range of cross-validation statistics, we show that the MBH98 15th century reconstruction lacks statistical significance.

On Apr. 28, 2005, Peter Huybers wrote a pleasant letter to me, inquiring about our work. (Our subsequent correspondence has mostly been cordial, although I marvel at the centrifugal tendencies of academic discourse.) Unlike the Hockey Team, I actually like inquiries and I sent him back a nice and detailed letter the same day, summarizing our work and giving some updated thoughts on these issues. This prompted an immediate response from Huybers. We’ve had lengthy correspondence and I do not propose to recapitulate it all. (I’ve done so with Crowley and Mann because of their egregious public misrepresentations of the actual correspondence.)

I provide a limited discussion of this correspondence to clarify points that are not clear in the article itself, especially where the correspondence evidences points of agreement, which are either unstated in the article or stated in very obscure terms.

In my opinion, our Reply is a complete response to Huybers’ points and, in one aspect, improves our original article. But you can decide for yourselves. The following headings correspond to the key points of our original abstract.

Biased Methodology
Our first point was obviously that the MBH PC method was different than represented in the original article and was strongly biased. We had reported that, when applied to persistent red noise, the MBH98 PC method yielded hockey sticks over 99% of the time. Subsequent to the GRL article, we’ve done new (unreported) studies on the effect of 1-2 flawed proxies (e.g. proxies with nonclimatic trends) on the MBH98 PC method and found that 1-2 "bad apples" have an even more profound effect than a pure red noise situation. I mentioned this unpublished work to Huybers in my first letter as follows:

Our main point with the MBH98 method (in statistical terms) was that it was biased – it mined for hockey stick shaped series. This has been confirmed by a few other commentators (e.g. von Storch, Zweiers) although the ultimate impact of this bias is controversial. Intuitively, if your conclusion is that climate has a hockey stick shaped history, using such a biased method is a pretty risky way of supporting that conclusion. …
In some other experiments, arising out of this discussion, we’ve experimented with simulations in which some proxies have an added non-climatic fertilization trend – this demonstrates the effect a little more clearly.

Huybers replied that the existence of the bias in the MBH98 PC method was not an issue as far as he’s concerned (with similar comments in later correspondence):

I thought the PDF you showed of the hockey-stick index was highly convincing regarding the bias of un-centered PCs.

In his article, Huybers credits us with pointing out a bias in the MBH recosntruction and later reports that "this bias was checked using a Monte Carlo algorithm independent of MM05’s". His opening paragraph metions that "having reproduced the statistical results of MM05".

We thought that Huybers’ comment was not very clear as to what he had replicated, so, in our Reply, we made a very clear statement on his agreement as follows:

McIntyre and McKitrick [2005a, “MM05″] showed that the actual MBH98 PC method used an unreported short-centering method, which was biased towards producing a hockey stick shaped PC1 with an inflated eigenvalue. Huybers concurs with these particular findings…

Our editor requested that we take this statement out on the basis that we were trying to divert attention away from the "real" differences of opinion. We argued vehemently against this on the basis that the eventual community wanted to know what we agreed on as well as what we disagreed on. I had raised this issue with Huybers directly, when I learned, some time after hsi first letter, that he was planning a Comment to GRL. I said:

While Comments in journals tend to be biased towards negative comments, in this case, many people are seeking guidance on what to think. To the extent that you agree with many or even some of our points, as indicated below, and have verified at least some of the points in dispute, I think that it would be very constructive to submit a Comment reporting on such verification and I think that GRL would probably welcome something like that. That’s not to say that you shouldn’t also submit on points of disagreement. However, under the circumstances, it’s such a contentious issue that comments about R2 versus RE, on robustness etc. would itself probably attract a lot of attention.

Huybers replied to this a few days later as follows:

Steve, as you noted earlier, comments tend to be rather negative, but it can also be useful to point out where results corroborate what you initially found. To that end, the comment both starts and concludes by calling attention to how short-centering biases the PCA results and hence the MBH98 reconstruction. I also made note of the possible non-temperature effects on the tree rings and the R2 statistical results you published….

We quoted this paragaph to our editor and he agreed that we could say that Huybers "concurred" with the finding of bias in the MBH PC methodology. But it seemed like an odd point to have to fight to get into print. I’ll refer to this above paragraph again in connection with the R2 results. Thus, whatever else may be in dispute, the data-mining bias of the MBH PC method is not on the table as far as Huybers was concerned. In fact, in one of his later emails, he said:

As I mentioned earlier, it seems to me the "short-centered" PCA does affect the results and this is a bias that should be accounted for. Efforts would seem better applied at correcting for the bias as opposed to arguing for its insignificance.

Non-Robustness to Bristlecones
The second big issue is the non-robustness of MBH98 to bristlecones. In our article, we said that (1) the biased MBH method preferentially selected bristlecones into their PC1 and that this series, which MBH had identified in previous controversy as the "dominant component of variance", consisted almost entirely of bristlecones. and (2) there were serious questions in the specialist literature [Graybill and Idso, 1993] about the validity of bristlecones as a temperature proxy due to potential CO2 fertilization. We expanded considerably on this issue on our EE article, where, in addition to CO2 fertilization, we noted other possible non-temperature factors including increased precipitation, phosphate fertilization, nitrate fertilization etc.

The non-robustness of MBH results to bristlecones is notably avoided by realclimate. Their defence is now that any methodology which either does not use the flawed bristlecones (or which reduces their weight in the final reconstruction) amounts to "throwing out" data. They don’t face up as to exactly how how they propose to reconcile this defence with their claims that their reconstruction is robust to the presense/absence of all dendroclimatic indicators (which presumably includes bristlecones).

I raised these issues in my April 28 letter to Huybers as follows:

we have tried to emphasize the relationship between data and methods. We then followed the biased method into the critical North American tree ring network to see what it did: it picked out bristlecone pines which have an anomalous 20th century growth spurt, explicitly said by specialists not to be temperature related. If you remove the bristlecone pines from the dataset, the hockey stick disappears…

We pointed out that using centered PC calculations and 2 retained PC series (as in MBH98) the bristlecone pattern goes down to the PC4. Mann et al. have argued that they can get high 15th century results if they use 5 PCs. (We discuss this in E&E). This goes to robustness – now climate history turns on the PC4 or alternatively on the bristlecones (which are a flawed proxy.) In effect, the biased method searches for and overweights the worst proxies. Another unattractive property of the biased method is illustrated in our E&E article, where we show that the MBH98 method will invert increased ring widths in 15th century non-bristlecones and show lower 15th century results.

Huybers seemed to agree with our concerns about the dependence of MBH results on bristlecones with the following:

I have followed this discussion. Results should be made as robust as possible, and I agree that it is unsatisfying when results are sensitive to a small subset of the data.

In our article, we pointed out that the bristlecones were weighted very differently under the MBH PC methodology and under a conventional PC calculation. We illustrated the differences in Figure 3 of our GRL article, showing that the differences were not just trivial. There are two options in PC algorithms – using a covariance matrix and using a correlation matrix. MBH only said that they used a "conventional" calculation. If a network is in common units, the conventional methodology is a covariance PC calculation, which is what we used to make Figure 3 and what we used in our emulation of MBH98 under centered PC calculations presented in our EE article (no such calculations were presented in our GRL article). We were not saying that covariance PCs would yield a meaningful indicator out of the swamp of MBH tree ring chronologies- only that this is what a reasonable person implementing MBH methodology would do. Huybers asked about this in his first letter. I replied:

In our calculations, we used the covariance matrix rather than the correlation matrix. Tree ring chronologies are already standardized to dimensionless ratios. So I think that the use of the covariance matrix rather than the correlation matrix is the preferred route and can be justified in standard texts. The use of a correlation matrix (i.e. re-normalizing) is certainly an option, but climate history should not stand or fall on this choice. The bristlecones do get promoted higher with a correlation matrix than with a covariance matrix. In our recent debates with Mann, they’ve also used the covariance matrix – their PC4 under centered calculations (shown at realclimate) matches our own calculations. The big issues are robustness and proxy selection- these remain unchanged.

Huybers response seemed to agree with this position, replying:

Yes, it is again rather unsatisfying that answers are so sensitive to seemingly small changes in technique.

However, this ended up being one of the two main issues in his Comment. I’ll discuss this at length in my next post.

As to the impact of bristlecones, Huybers did not directly address our finding. He acknowledged that the validity of bristlecones as a proxy would be a valid question for "future work" as follows:

"Another point raised by MM05 is that many of hte strongest trends in tree ring chronologies may be unrelated to tempertature change [Graybill and Idso, 1993] – in future studies, this may warrant the exclusion or down-weighting of certian records, but this is an additional step which would have to be explicitly stated."

In our Reply, we pointed out that the net effect of the correlation PC1 was to increase the weight of the bristlecones and that it is a little late in the day to defer the determination of bristlecone validty to "future studies". IPCC 2AR had already taken a position that CO2 fertilization was an issue that needed to be handled with caution prior to relying on tree ring chronologies. In our opinion, it is unacceptable that the Graybill and Idso bristlecones – the sites most identified with the issue of CO2 fertilization – should have come to dominate the canonical temperature reconstruction through the back door. If climate reconstruction is to depend on proxies which may be affected by fertilizaiton, this issue should have been articulated and argued in the plain light of day back in 1998, rather than being decoded some years later. MBH98 warranted that the proxies had been carefully selected. You can’t now say that the study of their validity should be deferred to"future studies".

Thus, aside from all the technical issues about correlation versus covariance PCs – and we think that our position is impeccable on thes issues – the fact that the correlation PC1 increases the impact of bristlecones is not itself a result that should be automatically accepted. Any unsupervised algorithm like PCA requires a little supervision prior to making a climate history.,

Spurious Significance
The third leg of our argument — “spurious significance” — is a term in the title of our article that is seldom discussed in the controversy by the realclimate side (for obvious reasons). In our article, we pointed out that, in the controversial 15th century, the MBH reconstruction failed (catastrophically) the most standard cross-validation test (R2), a test used by the Hockey Team in other studies and for the AD1820 step of MBH98 itself, as well as failing other standard cross-validation statistics (CE etc), used by the Hockey Team elsewhere.

While it was then only a surmise that MBH had calculated cross-validation R2 statistics for the 15th century (and not reported), the recent source code provided to the Barton Committee proves beyond a doubt that the cross-validation R2 statistic was calculated (and not reported). As reported on this blog, the source code shows the calcualtion of the cross-validation R2 very clearly together with other statistics. Then when one examines the SI to MBH98, the other statistics are collated into a table but the cross-validation R2 is left out. (There’s one other statistic that’s left out: the RE for their El Nino reconstruction, although in this case, they inconsistently reported the cross-validation R2 statistic. One is left to guess that the El Nino cross-validation RE statistics might also be pretty bad.)

If the MBH temperature index has a true underlying relationship to temperature, it is impossible for it to have a cross-validation R2 of 0.01. It is not the only statistic that should be looked, but it is a very important one to look at. If a reconstruction fails a very simple test like this, then it’s not a valid reconstruction. The realclimate response, such as it is, is to present a bizarre and hypothetical synthetic case where a reconstruction passes an R2 test with flying, but does not pass an RE test. They then accuse us of proposing the exclusive use of the R2 test. Obviously, we did no such thing. We commented on these issues to Huybers in our original letter as follows:

What we observed is that the PC1s from the simulations yielded a spuriously high RE statistic and negligible R2. This is what we observed in the MBH98 15th century calculation. It’s my view that if they are recovering a "temperature signal", they should have both a high RE and R2 statistic…

The take-home point is really that you can have spuriously high RE statistics and that a "skilful" reconstruction should pass several verification statistics. Mann’s response to this has been a diatribe against the R2 statistic.

Again Huybers seemed to agree as follows:

It would be more convincing that the reconstruction had skill if both the R2 and RE statistics were significant. I believe this is a valid point which you made clearly in the GRL paper. The MBH98 methodology is unfortunately difficult to understand —- climate reconstructions would benefit both from greater accuracy and clarity.

In the comment quoted above, Huybers said that in his article:

I also made note of the possible non-temperature effects on the tree rings and the R2 statistical results you published….

In his first version of his article, Huybers stated:

Note that unlike RE, the cross-correlation statistic [R2] is insensitive to changes in variance and thus the MM05 estimate of the cross-correlation critical value (indicating that the observed cross-correlation is insignificant) is not biased.."

Somewhat perversely, this statement was left out of the final version. (This statement is slightly incorrect: the R2 statistic has a known distribution. We concluded that it was insiginificant based on tables, rather than simulations, although the simulations gave results that were more or less consistent with the tables. ) We intially interpreted this comment, together with Huybers’ correspondence, as indicating that he had replicated our R2 results and we felt that it was important to report this apparent agreement. Again our editor felt that this was an attempt to "divert" attention from the "real" issues. Then it turned out that Huybers did not appear to have calculated the R2 statistic. (Note that Wahl and Ammann also failed to report the cross-validation R2 statistic on their webpage. Given our specific focus on this issue, this reluctance to calculate a simple statistic seems incomprehensible – you’d almost think that they were all afraid of the answer.) Anyway we negotiated back and forth on what we could say about Huybers’ postion. Eventually Huybers agreed that we could say that he "did not dispute" our calculation of the R2 statistic, so we went with that and our editor agreed.

The Issues
So if Huybers agrees or doesn’t disagree about these things, what is Huybers’ article actually about? It’s about two things.

First, he argues that Figure 3 of our article, which illustrated the impact of the biased MBH method on the NOAMER network, “exaggerated” the bias by using a covariance PC1 instead of a correlation PC1, another possible PC methodology, although one that we do not believe can be justified. We will show that both of Huybers’ references recommend use of covariance PC1s, when the networks are in common units (as here where the networks are already standardized to dimensionless units.) Huybers provides a figure supposedly showing this exaggeration. But when you examine the figure, you find that the supposed "bias" depends on short-segment centering – a bizarre and ironic situation to say the least. Huybers also described the covariance PC1 as the “MM05 normalization”, which we supposedly "proposed" as a means of "removing the bias" in MBH98. Of couse we did no such thing. We protested vehemently to GRL about this mischaracterization of what we did, but had no success whatever in getting this language altered. Any reader of our works knows that we endorse nothing in MBH methodology and have not proposed any methods of "removing a bias". Merely demonstrating the result of a covariance PC calculation on the NOAMER network does not transmogrify that calculation into an “MM normalization convention”. More on this over the weekend.

The second issue pertains to the benchmarking of the RE statistic. Huybers observed that variances of the simulated PC1s used in our explanation of the spurious RE statistic were less than the corresponding target variances and proposed that these should be re-scaled so that the variances matched the target variances. When Huybers re-scaled the variances, he got an RE benchmark of 0, so that the MBH result once again seemed to have statistical “significance”. Readers of this blog will be attentive to issues of “spurious” significance, i.e. where a statistic is high even without a valid model. Spurious relationships can sometimes be discovered by looking at other statistics, e.g. the Durbin-Watson in some of the discussions we’ve had here. Even if Huybers were right, all that would happen is that the spurious RE statistic would be unexplained.

However, we had a really interesting response, which built on some prior discussions on this blog. You may recall John Hunter grinding me in the spring about whether the hockey stick-ness observed in the simulated PC1s would carry through to a NH reconstruction under MBH methods. (Having raised a good question, he rather spoiled the effect by asking me for an answer about every 4 hours, as though I had nothing else to do.) In a reasonable length of time (a few days), I checked the results by making up a proxy network consisting of 21 white noise series and 1 simulated PC1. The hockey stick-ness carried through, which I reported on the blog.

I used the same approach in replying to Huybers. I used the simulated PC1s saved from our GRL simulations, made up networks of 22 series (with 21 white noise series) and did an MBH-type calculation. Bingo. We got high RE statistics together with matching variances. Our explanation that the RE statistic was spurious held together. I’m going to describe these results in detail in a follow-up post, because I found them interesting and because they show some interesting statistical aspects of the MBH model. I’ll show some very pretty (to me) diagrams illustrating what’s going on. I’ll also show why the "other 20" MBH proxies model like white noise.

Postscript
As you will see, our view is that our simulations using a pseudoproxy network (rather than Huybers’ simple re-scaling), completely outflank Huybers. In my opinion, there should be no residual point of dispute between competent people (as Huybers is). However, the affair is very partisan, and, rather than these new outflanking results putting an end to the issue, I was (and remain) concernted that partisans will pick on some of Huybers’ claims as prolonging the controversy, even if completely refuted.

Given what I perceive as being an underlying interest in resolving these and similar matters, I made the following offer to Huybers:

1) that he could review our simulations replying to him with no obligation;
2) if he agreed with our results — and only if he fully agreed with our results – then we would submit a joint paper to GRL reporting on agreed results.
3) if he did not agree fully with our Reply, then we would proceed on the course that we were pursuing.

I suggested to him, that while this might be unorthodox in academic terms, it was something that I thought GRL would welcome and that certainly the broader community would welcome. Huybers showed no interest whatever in this offer. So readers will be left one more time to try to sort this stuff out. I thought that we made a good suggestion and I’m sorry that the opportunity was missed. Having said that, I’m satisfied that our response is accurate and thorough, that Huybers has made no points which are not completely responded to, and I am quite content to let the chips fall where they may.

50 Comments

Sorry to ask this so late in the day, but the frequent references to “15th Century cross-validation tests” means that you take, say, the 10 tree-ring series that you have for the 15th century and see how well the size of the growth rings correlate with each other in that century? And if they do then maybe they are a good temperature proxy, but if the correlation between them is near-zero, then you got nothin?

No. MBH98 did calculations in steps, changing the roster of proxies with each step, as proxies became unavailability. The AD1820 included 11 actual temperature series (which are a peculiar “proxy” for temperature.) The AD1400 step has 22 proxies, of which 3 are PC series sumamrizing networks of 70 and 6 sites.

A high cross-validation R2 for the 15th century step means for the reconstruction using the AD1400 proxies only. One of the reason that the reconstruction has a cross-validaiton R2 os ~0 is that the indvidual proxies have negligible relationship to temperature – other than the spurious relationship of the data-mined NOAMER PC1.

I pity the poor reviewers and editor ! Strong opinions all around, methinks- and I am sure that in this rather politically contentious area, the editor would want the strongest and best advice.

But in general, fantastic news. More and more people (Huybers, reviewers, editors) are examining the issue that you raise in peer-reviewed journals; all are agreeing on certain points, and so the consensus on what is established is moving on, and the circle of people who are well-informed on what that consensus is continues to widen. I have no doubt that the platform of agreement about the methodology of MBH’98 will continue to widen.

Seems good that they agreed with some of your issues. Not sure about all the contentions between you and them on the substance, but some things I noticed.

a. You talk about Huyber figure 3. I don’t see one.
b. Your figure (curves) are very scrunched and harsh looking. Could you make them look a little nicer? As they are, the eye does not want to dig into them.
c. Maybe it is correct to say that you did not advocate PCA or means. But the Huybers comment is more interesting in that they raise the issue of what IS the right method and suggest that simple means make more sense.
d. It’s interesting (and picky of me to note it) that Huybers appends the instrumental record to the curves. It is distracting and not exactly relevant since all the discussion is about the proxy record itself (and having the instrumental appended makes the eye think that the proxies are saying more than they are…are more hockeystickish then they are.)

TCO, we have minimal space to reply. I’ll put a nicer version of the figure up here, but got squished on count.

I’m not convinced that there is a “right” method to extract an answer from the swamp of this data set through a simple unsupervised statistic, be it a mean or a PC1. IF a mean, why wouldn’t you work in the mean of all 212 series as they become available? There are a lot of bristlecones in the AD1400 network – should they be weighted by area or species? Huybers has given no proof that the mean measures anything. HIs argument that it is a “robust” feature doesn’t wash – read our Reply.

I thought that our comment about heteroskedastic-autocorrelation consistent standardization was pretty neat. It’s a novel issue in this area, but a big issue in econometrics. I’m glad to have it in print, even if the point is tersely made. It will take a while for the point to sink in. It also affects studies like Esper et al [2005], about scaling and regression. So much to do , so little time.

The instrumental add-on is standard Hockey Team practice and I was disappointed to see it here.

a. you still haven’t responded to 6.a.
b. how does word count stop you from drawing appealing figures? That thing looks like the visual equivalent of nails on a blackboard. My eyes glanced at Huyber’s figures to look for inferences…but refused to dig through yours. And I’m partial to you. Well…except for you being a liberal Canadian. ;)
c. second para: Ok…I will reread your stuff. The first thing that jumped out to me was that it was a comment where you were denying that Huybers had disproved an assertion of yours because you had not made one (you know like these silly arguments in the blog that are not even on the issue but on what someone asserted). I guess that would be fine. But it’s more interesting to have an opinion on the issue itself: as such, it would read more strongly if rather than saying, “we took no position”, say: “we took no position…and still don’t because of…blablabla (fill in your issues from post above). Regardless, it still begs the question of how people should do this kind of research. Simple means seem at least not to be prone to some of the shenanigans of the PCA method. As far as your question about how to weight different sets of series that have some closeness, I don’t know…but: (1) I think the philosophy and methods of “meta-analysis” used in sociology may have some helpful guidance here. For instance, try talking to Hunter and Schmidt (you could look at some of their review articles…usually published in Journal of Applied Psychology. (2) what would you do if this were a business analysis? If you had a limited time, scope, budget, etc…and this was a business problem that you had to give best estimate to CEO on? Should you weight for species or area? I donno…damn good question. Highly relevant. blows my mind to think that one can’t have some methods that are better than others. Seems like a very common problem and an interesting one.
d. Autocorrelation. That’s great to slip that in. Do you think that people will start taking note of this kind of issue in this type of work? Will you hit the same concept in other papers?
e. Sorry about the so little time: publish publish publish! P.s. It is the peer reviewed articles in journals of record that matter.

10b: Most publishers limit you by column inches – area in effect, and only so much text will fit into a defined area. Any graphics will reduce the amount of space available for text, and thus the total word count.

Re #1: John, the proxies are not tested for coherence against each other, they are fitted against temperature during the overlap interval 1901-1980 then tested for their ability to predict temperature over the 1856-1900 interval. There’s a bit more to the algorithm than that but the idea is that the proxies available for a time step (e.g. AD1400-1450) have to have predictive power for temperature to validate extending the results over that step.

Re #8: PCs can be computed by (among others) using the covariance matrix or the correlation matrix of the original data. In the AD1400 proxy roster if you draw the 2 PCs with each centered on a zero mean over 1400-1980 they track each other very closely except in the mid-20th century where they diverge a bit. So if you force them to overlap in the 20th century you open gaps all up the length from 1400-1900. That’s why Huybers Figure 2 (not 3–typo in the preprint) makes them look very different, because he rescales to the 20th century portion.
The divergence arises because the correlation matrix PC adds extra weight to the bristlecones. Hence, if it’s an argument over which PC is “more correct” we can’t avoid, or defer to “future studies” an assessment of the bristlecones. Take them out and the whole correlation/covariance issue is moot.
Re #6c There’s no one way of doing PCs that’s always correct. But we’re confident that where the data are pre-standardized to dimensionless indexes (as tree rings are) most texts would recommend the covariance PC. Huybers points to the presence of 2 ring density indexes that have smaller variance as the case for using the correlation PC, but fails to mention that the sites they are from are also in the network as ring width series, so the correlation PC effectively double-counts those sites. Also, the tests he proposes (eg ability to follow the mean) actually come out in favour of the covariance PC if you use all 212 North American proxies, not just the 70-series subset that extends to 1400, and if you correct for autocorrelation when computing the variances.

Re 10c: An irony is that if Huybers is right and the simple mean should be used henceforth then the realclimate position that they should use 5 PCs gets defenestrated. But I see this as falling into a category of research problems where, in the absence of an underlying theory, there’s no one way to construct the index. Sort of like the UN “development index” or the US “Composite Leading Indicators” for the economy. Some methods make more sense than others but calling one method uniquely “correct” is a stretch.

Ross’s comments “the proxies are not tested for coherence against each other, they are fitted against temperature during the overlap interval 1901-1980 then tested for their ability to predict temperature over the 1856-1900 interval.” This comment brings up another question. Suppose the test period was 1901-1940 or 1941-1980 or some other interval rather than 1901-1980. Are the temperature reconstructions robust to these changes or do you get very different results?

Re #17: Steve, you’ll be able to use tree rings to date the tree lines. Don’t forget that some types of trees have a grass stage that can last a decade or more before they put down rings. Marginal conditions wouldn’t speed this phase along would it?

So finally a proper use of tree rings to derive a temperature chronology!

I don’t think I got satisfaction on the issue of PCAs versus means. It almost seems annoyingly as if MM only care about places where they find mistakes and have little interest in a side discussion of the best methodology for a practioner.

And I say again: what do others who do meta-analysis advise? (surely there are similar issues)

And I say again: You ought to be able to say ahead of time whether weighting by species or area is more effective. You’ve looked at the problem a lot. ARe you only going to criticize others after they make decisions and refuse to engage in constructive dialog ahead of time on how to best do the statistics?

I haven’t got to the stage where I think that I can solve how to make a temperature reconstruction. I think that you can have a perfectly good sell recommendation without necessarily having a buy recommendation in mind (and you can detract from your sell recommendation by mixing it up with a less thought out buy recommendation.)

I don’t know how I’d extract a meaningful temperature proxy out of the AD1400 North American tree ring network in particular. It’s a real mish-mash even by tree ring standards. I’m not sure that it’s fair to say that I should have an answer to how to do this. I didn’t nominate this set of proxies.

Originally tree rings were used for dating other things, because their high-frequency patterns were distinctive. Or if they were applied to climate, it was to study precipitation, especially in the American Southwest. It’s only in the past 10-15 years that there have been real attempts to extract temperature information from them – a project associated with core Hockey Team: Briffa, Cook, Jacoby, Hughes, Schweingruber.. In fact, you can almost define the Hockey Team as people trying to get a temperature history from tree rings.

A cynic might even point out that there was more money for studying tree rings if you hooked it up with climate change, than in dating pueblos. Sort of like adding in a terrorism angle if you’re looking for government funding for municipal works.

I would be more inclined to use tree line altitudes as evidence of low-frequency change, but there are undoubtedly problems with this as well, and I’m not saying that it’s a magic bullet.

As to means versus PCA – the best use of PCA in this type of data set is probably exploratory – to see if there are clusters or groups. If you do identify some kind of proxy that you think is a temperature proxy, I thnk that ultimately you have to define some kind of population and then take means without cherrypicking. EAsier said than done. But I don’t think that taking a mean or a PC1 from the NOAMER dataset will give anything meaningful in connection with temperature.

I know I got you spun up and have Jer defending you. But I think for someone who has spent the amount of time you have in the field and with your skills, that not caring about means versus PCA would almost seem like a lack of intellectual curiosity. Anyway…I got you to give a bit more of a response.

About pestering you…of course you have no obligation to answer me on anything…but if you do feel like doing so, I get more aggravated by never coming to grips on the issue (at the spoonfeeding to lazy non-geniuses level ;) ) than on timeliness of response.

One of the reasons why our Reply is a little heavy-handed on this adn why I’m probably a little chippy is because of two things: in his Comment, Huybers misrepresented us as somehow proposing covariance PCs a way of fixing Mann (among many other mischaracterizations); in the first round, the referees thought that he’d made good points against this supposed proposal of ours. So we were somehow supposed to be defending a way of fixing this ridiculous tree ring network. In our re-draft, we ending up probably being a little over the top in disowning this, but it’s because of Huybers’ mischaracterizing things and the referees not dealing with this.

We complained bitterly to GRL about Huybers’ mischaracterizations and asked them and asked Huybers personally to fix their mischaracterizations e.g. the suggestion that we proposed covariance PCs as a method of “removing the bias” in Mann. We got blown off. Then in our referee comments, one referee mentioned that, because of the political interest, he wanted to insure that everything that we said, matched exactly to what Huybers said. Fair enough and I’m actually pretty good about not putting words in people’s mouths. But it didn’t work both ways. They completely ignored our comments about mischaracterizations – which I documented in my usual thorough way. I’m more than a bit irritated about it. In fairness, they are worn out with the topic.

I’m a little irritated about the comments by the new editor-in-chief in the EST article, that he got personally involved in the editing. Maybe it was after the rejection of Wahl and Ammann, which probably made the Hockey Team absolutely wild, and they started pressuring GRL even more. Whatever it was, at some point, there was a sea change in attitude and they didn’t lift a finger to require Huybers to clean up what he said. And Huybers wouldn’t do so at our request. There are some cheap shots and unfair langauge that will prohably be quoted to our disadvantage. It’s one of the reasons why I’ve documented on the blog some of Huybers’ private comments.

I am with you on this one TCO. My take on using PCA is that being a dimension reduction method, its use should be rejected if it results in a lot of PCs explaining a small amount of variance, because it is not really reducing the dimensions, just recoding them. In this case, PC1 explaining only 38% of variance and the rest less than 10% is a red flag. PCA has a lot of restrictive assumptions, such as linearity, and my rule of thumb, that I don’t have any references for, is that in the absence of a few PCs explaining at least 80% of variance, assume the assumptions are invalid and results very suspect. The capacity of a few series to dominate PC1 is not surprising when they only have to explain 38% of the variance. The correct response should be, ‘Oh there are a few hockeysticks in there, and a heap of other stuff, this PCA has produced a dog’s breakfast.’ As to means, well why not, then aggregate by area and species, then discuss.

David, it’s only about 17% in the PC1 in an actual PC method – covariance or correlation. It’s 38% in the Mann method, which is not (in Preisendorfer’s terms) a PC method, since an uncentered method is not an analysis of variance (Preisendorfer, page 24). Otherwise, I agree entirely.

Also, TCO, the PC series are constrained to be orthgonal to higher ones, so exactly what a PC4 or PC5 is representing is not at all clear.

You can see the absurdity of Mann’s scheme if you try to think about how (say) the PC6 of the Stahle/SWM network could come to have a physical relationship with the PC11 of temperatures.

I doubt that Steve needs any defending from your comments. My previous comment to you was just a hint to you that in that post you were being even ditzier than usual. It seems the hint was too subtle; try being less ditzy.

Re #19,29: TCO, PC analysis decomposes a matrix X with k columns into a sequence of up to k weighted averages (PC1, PC2, etc) and associated eigenvalues. The sequence has meaning but the PCs themselves may not. The sequence up to the j-th PC tells you what fraction of the total variance in X can be explained by the first j PCs. But a limitation of PC analysis is that the weights may be an obscure hodgepodge of positive and negative numbers so that, say, PC4 is an index with no clear interpretation.

Steve’s point in #28 was about this problem of obscurity. MBH’s method doesn’t work by taking tree rings and comparing them to local temperatures, as you might expect. It takes PCs of tree rings over large areas and compares them to PCs of temperatures over large areas. So you might have the Stahle tree ring network from US Southwest-Mexico, and temperature data from around the world, and then you find that PC#6 from the Stahle network correlates with PC#11 from the temperature data. Suppose the temperature PC#11 has large positive weights on Uzbekistan and Denmark, and large negative weights on Anchorage, Sicily and Baffin Island. How could this linear combination of temperatures physically determine tree ring growth rates on a Mexican hillside while the tree rings themselves exhibit no correlation to local temperature? In our Nature correspondence MB&H vociferously defended retention of tree ring data with no local temperature correlation precisely on these grounds: that they could find correlations to “instrumental training patterns”–fancy words for low-order PCs. We talk about this in E&E05.

So when you ask about whether a method is more “effective”, you need to spell out effective at what? There are lots of ways to mine for obscure correlations–but big deal. I see no simple answers about what researchers in the field ought to do. Beyond the PC vs mean question, there are umpteen modeling questions: lags? logs? quadratic? cubic? spline? regime-shifts? time-varying parameters? etc. Overall I think people should try lots of things, as long as they state exactly what they’re doing, identify their data accurately and don’t oversell their results.

Obviously the objective is to come up with an understanding of previous temperatures by use of proxies for periods that instruemnts were not used. There WAS A TEMPERATURE. It physically existed. We want to uncover what it would have been. Maybe what we want to know is “had we the same set of instruments accross the world in 1500, what would they have registered”. Of course, you can digress and debate how we should sum temps accross the globe’s surface, but its a digerssion since that issue exists even for a purely instrumental record, right now.

It’s still not clear to me what was plotted in the hockey stick. Was it “Pc1″? Does summation of all PC’s = average of the data series?

The issue of tree ring series correlation to world-wide temp changes seems like one that we’ve addressed seperately. At the heart of the issue is the need for foundational studies that a proxy is a relevant proxy. If a series can be PROVEN to teleconnect, FINE. Use it as evidence of the weather in a different location. If you just consider the foundational calibration of the proxies to be based on correlation to world-wide temps when instrumentation existed, that seems like a very dangerous thing to do. Likely to lead to data mining or circular logic or suppresion of true proxies. And (obviously) lacking in physical rationale for the proxy selection.

RE 33: The danger is that by looking at the complete set of possible correlations, you will find some chance ones to world-wide temp trends (just because there are so many to pick from) and then use them in the past. But they have no physical basis and it was just luck that they correlated in one period so it’s not reasonable to use them going further back. (Gotta be some good stats way to describe this concern.)

Re 35: I think you can make a good cautionary argument just based on the physical issue (and analogy to other places where one could do the same sort of silliness and how it’s been argued against…bet sociology has some good thoughts here). However an acid test is how the “worldwide climate field proxies” do moving forward. Look at correlation of the proxy moving forward (and don’t give me that guff about needing to sample the exact same trees…if it’s so sensative, then that just shows that the method is a poor proxy for other reasons).

TCO, econometrics went through a phase like this. The Granger and Newbold 1974 article on spurious correlation effectively savaged a whoe class of economics articles.

In terms of contributions, Granger and Newbold [1974] did not present a new theory of monetary policy or economic growth, but it remains a widely cited article. So there’s a role for critical articles. If I could get to a slightly theoretical explanation of spurious RE statistics that’s a little more highbrow than our empirical demonstrations,I’d be well pleased. But practically, the empirical demonstrations are more than sufficient.

Didn’t you hear me? Look at the sociology literature. Economics is evil. Business-related. Lotsa Chimperor tricks. Find a good liberal to the max, but well trained, sociologist like Leiberson (has a good book on spurious statistical inferences in sociology…not highly mathematical) and see if he has articles on this issue.

For that matter, I would think that this is such a motherhood concern that BHH or Fischer would have something here. Certainly the initial developers of meta-analysis might have some insights. Have you talked to Hunter and Schmidt?

Don’t worry, Mr. T (I can call you T for short, can’t I?) If you’ll look at the posting times you’ll see your message only went up a minute before mine. I’m a fast typer, but not that fast. I didn’t see your message until just now.

While it is a mathematical issue also, I think that looking at some examples of false correlations and how to avoid them would be useful here. This can be done by looking more into the softer motherhood type discussions. For instance, there is probably some writing on the danger of lacking a physical rationale and on the issue of starting with such a huge number of possible proxies, that some will correlate in a test period…but if there is no physical rationale, what’s the point…you could be feeding tea leave statistics and finding the one that matched for a test period. (Can also go look into hard core stats methods of describing this danger as well, I guess…)

TCO, my approach to data analysis is that there is no ‘best’ method and the best you can do is try to avoid stupid errors by looking at problems from different approachs to get a robust answer. Its not just false correlations. A simple correlation is not a promise of anything real due multiple issues such as ‘common cause’, and temporal reversals. And what is a physical rationale, but another theory in disguise. Being called a physical rational doesn’t excuse it from uncertainty so how can it be called on to support marginal/uncertain results. You see this false argument in the AGW literature all the time in the form e.g. hurricane frequency has not reached p=0.05, but the physical rational (i.e. models) say it should, so it is probably significant – bunkum.

I guess my point is that if you started with a hopper full of potential proxies (including things like alphabetical order of tiddlywinks champion’s last name), you could if you made your hopper big enough find some that matched for a given period. However, I expect that they would not really be good proxies for temp over a longer period. There’s gotta be some mathematical way of describing this. I think if you start with things with a physical rationale (tie it to local temp for instance), it’s “less likely to go wrong that way”.

1. I will post more on one of the later threads. I thought the Huybers comment was rather good and that the MM response was a bit “but A” in terms of not adressing specific comments of H, but rather wanting to refight the overall battle of MBH criticism. And as H had not tried to say that his article was more of a criticism than on the specific points, that seemed a bit off to me. The pers comms reflects a bit more light on the issue, but still I think Steve comes across here as a bit wanting to fight a battle of not looking bad rather then just unphlegmatically addressing issue by issue in terms of basic search for insights.

1. Can you explain the difference between your and Mann’s sensitivity analyses to set benchmarks for RE?
2. Why do you use white noise and he use proxies and what are the pluses/minuses of either?
3. If this is the killer analysis to finish off the debate on this topic, why do you do white noise rather then red and say that it might not be ok, might need to be reddened up, etc.?
4. Did you do this “22 series” stuff at the time of your original MM05 article? Or was this brought in afterwards? IOW, did H fail to do something that you and Mann had done (using a whole network) earlier or did he find an error and then you found another error (that both of you had failed to do last time)?