Wilson on Yamal Substitution

Rob Wilson has written in sharply criticizing me (Yamal Substitution #3) for a lack of a balanced presentation on the Yamal substitution, and, in particular, for not acknowledging the "clear statistical reasons (related to variance changes through time)" that he had provided me offline for why D’Arrigo et al 2006 made the Yamal substitution.
Also see herehere here here

I take any criticism from Rob very seriously and regret that he feels that I did not represent his position on this adequately. The post in question only discussed Briffa – it made no reference to D’Arrigo et al 2006. In a previous post on this topic , I had referred to Rob’s in passing to this argument where I said:

Rob Wilson has written me offline, attempting to justify the switch on the basis that the variance of the Yamal chronology is more stable than the variance of the updated Polar Urals chronology.

Because (inter alia) he’d sent the details offline, is not an official, had not made any public comments about the matter and had not mentioned that I could post them up, I referred to the position (with which I disagree), but did not present it.

Here is a more detailed presentation of Rob’s argument (which I do not agree with). I did not present this argument since it had been sent to me offline, but I have no objection to presenting and discussing it. I don’t think that it’s very convincing and, far from settling the substitution, raises as many questions as it answers. If it’s been relied on to make an important substitution, I think that it should have been clearly presented in the original article accompanied by an impact assessment. None of this should be construed as a criticism of Rob personally who is earnest and diligent, but who did not make all the decisions in respect to this article.

Reviewing the bidding a little. I first encountered D’Arrigo et al, 2006 last fall in connection with the IPCC 4AR review. It was unpublished at the time. I asked IPCC to provide the supporting data. They refused. This led to considerable correspondence which is a long story in itself, which I’ll write up on another occasion. D’Arrigo et al 2006 was published almost concurrently with Osborn and Briffa 2006, which took much of the publicity away from it. It’s too bad. D’Arrigo et al 2006 is a vastly superior paper. I published some first comments on Feb 11 focussing on bristlecones. In response to that note, Rob wrote to say that he used Briffa’s (2000) Yamal series because he "could not develop an RCS chronology that had homoscedastic variance through time using the Polar Urals data." He said that Briffa would not give him his Yamal raw data but "said that the Yamal series was a robust RCS chronology".

On Feb 12, I wrote back as follows:

Rob, what test did you use for homoscedastic variance? Did you get homoscedastic variances in the other reconstructions? Cheers, Steve

On Feb. 20, Rob wrote back:

re. variance this is a sticky issue and there is no definitive way to say if the variance over time in a chronology is right or wrong. In general, I try to develop chronologies which have as stable variance as possible. My decision is generally through eye balling the final chronology. For example, in the attached word file, you will see one of the RCS chronologies that I developed using the Polar Urals data. Clearly, the variance is not stable through time and RCS detrending has not done a good job in developing this chronology. Of course, using this chronology would greatly change the story of the past 100 years, but hopefully you would agree that this version would be wrong. Hence my use of Yamal, which at least had a roughly stable variance through time

Figure 1. Polar Urals Chronology. Rob’s Caption: One of the many RCS chronology versions that I developed. They all looked similar, although some iterations were better than others. The problem, I think, are the raw data around ~1000 and ~1400 with higher RW values, which needed to be properly processed to as not to bias the final chronology. The raw non-detrended chronology looks very similar to the above graph suggesting that RCS simply was not adequately removing the growth trend in the data.

Rob’s RCS version of Polar Urals looked somewhat like a version that I had previously posted up at CA.

Figure 2. My Version of Polar Urals Chronology

On Feb 21, I replied to him as follows:

Rob, here’s an RCS version [SM note: the one shown in Fig 2 above] that I got from the Polar Urals data using the best nls-fit to all the data. I applied a Goldfeld-Quandt test for heteroskedasticity (used in econometrics) to this series and to the Yamal series with a breakpoint at 80% through the series and got better results for the Polar Urals version than for the Yamal data. A p-value towards 0 implies heteroskedastic. This would not support the replacement. Cheers, Steve
#Yamal GQ = 1.4711, df1 = 399, df2 = 1595, p-value = 1.938e-07
#Polar Urals RCS GQ = 0.5621, df1 = 970, df2 = 241, p-value = 1

On Feb 21, Rob promptly replied:

I guess we need to agree to disagree on this subject. Attached are two 101-yr running variance plots for the RCS series I showed you yesterday and the Yamal series. I have never used the Goldfeld-Quandt test, so cannot judge the results, but from the attached figure, the Yamal series is clearly superior to the RCS version. I do not deny that there could be problems with the Yamal series especially at the recent end. But at the time of the analysis, I needed to generate a STD and RCS version for each record. As I could not generate my own version, the Briffa Yamal series was the next best option.

While this particular graphic may be an argument for using the Yamal version rather than the updated Polar Urals version, it hardly settles the matter, especially in Osborn and Briffa. Retention criteria are stated differently in Osborn and Briffa; they do not mention anything about variance stabilization as a criterion. So even if this criterion was used in D’Arrigo et al 2006, it does not mean that it was used in Osborn and Briffa. We have just learned that the Polar Urals Update was used in Esper et al, 2002.

If variance stablization is important, why wasn’t it applied in Esper et al.? If a variance stabilization test is used in site decision-making, why isn’t it mentioned in any of the papers? What is the test – the Yamal variance doesn’t look all that stable? Given that the Yamal substitute is the most extreme hockey stick in the entire selection, shouldn’t the impact of the substitution be pointed out? Maybe it would be relevant to look at the underlying Yamal data to see what’s going on – oops, can’t do that. Were variance stabilization criteria applied to the other sites? On what theoretical grounds does one expect tree ring site chronologies to have stable variance over time – financial series don’t have stable variances over time, hence the development of ARCH and GARCH methods (see Engle and Granger Nobel prize) – maybe stable variance isn’t an appropriate criterion? Even if it is, it should be stated and justified.

39 Comments

It is interesting to me to see words such as “eyeballing” and a general sense of lack of depth regarding statistical analysis techniques which, clearly, you are more of a content expert in than Wilson – or I (I write “clearly” based on comments like his “sticky issue” comment and his admission to never using the Goldfield – Quandt Test). This seems to be a recurring theme with the “climate science” orthodoxy – a lack of statistical rigor and the preference for “eyeballing” techniques that fit in more with the making of geological maps (e.g. dotted lines for uncertain boundaries) than an area such as this.

I also have to make a somewhat more personal note, commending you for keeping certain things off line in a gentlemanly manner which, sadly, is the exception to what seems to be the modern “rule.” It was well and proper for you not to previously make mention of some of the detail you now been essentially pushed to reveal here. Take pause and consider just what the sequence of events regarding this matter is really telling us about the motivations, morals and general levels of schooling in gentlemanliness which some of the orthodoxy seem to have. I’ll write no more as I do not want to come off being too judgmental, but will make no apology for what I have written here either.

#1 – I won’t take any credit on the offline matter. I find it difficult to determine what should and what shouldn’t be quoted and see many shades of grey. For eample, I’ve written to journal editors in their official capacities. Sometimes, they have stated that their response is confidential. What then? I understand the idea of confidential discussions in legal matters, but, in such cases, you also have to provide an on-the-record response, which can be relied upon if the confidential discussions are unsuccsessful. Too often officials refuse to respond on the record. I don’t understand what privilege attaches to official communications in the first place.

I’ve posted up all my correspondence with Crowley (for example) since he made a public issue of our correspondence, misrepresented it and the actual correspondence refuted it.

The situation with Rob is a lot different. In this case, Rob replied in a thoughtful way to the issue. However the response raises as many or more questions as it answers. I don’t see anything personal in this – but the dendro world tends to assume things that they have not justified or proven and too often are barely aware that they’ve even made the assumption.

It is not clear to me that the variance in weather at a particular location should be stable over time. So selecting proxies based upon their stability may be suppressing real information about the weather and climate.

Variance stabilization is a fuzzy science no matter whether we used statistical methods or ‘eye-ball’ the time series. However, it is addressed at least in dendroclimatology (e.g. reference below) something that could probably not be said for many of the other palaeo-disciplines.

The problem is that we have no knowledge of what the variance should be through time for a particular data-set. Was there greater variability through the MWP, LIA or recent period. These are precisely the sort of questions we want to answer.

Therefore, we need to be VERY careful in OVER processing our data. I could easily stabilize the variance of any time-series so that the variance is essentially constant through time, but would that be realistic – no!

In the case of the Polar Urals data, we KNOW that the raw ring-width data have strong biological growth trends and if we do not adequately detrend these trends, then we end up with a time series similar to the ones generated by both me and Steve (see above). These series are almost identical to the non-detrended raw mean ring-width chronology (perhaps you should show this Steve). This is simply not correct, even without taking variance into account.

Steve: I’ll plot up and post up the mean ring width version later in the day or tomorrow.

“14 February 2006According to Dutch researcher Michiel Helsen, annual and seasonal temperature fluctuations are not accurately recorded in the composition of the snow of Antarctica. His research into the isotopic composition of the Antarctic snow has exposed the complexity of climate reconstructions.Polar ice caps contain valuable information about the earth’s climate. Helsen investigated the extent to which meteorological data are stored in the composition of snow in order to improve the interpretation of deep ice cores from the Antarctic ice cap. He demonstrated that annual temperature variations in Antarctica could not be accurately reconstructed from ice core investigations. The conditions during snowfall are not representative enough for the average weather over an entire year.His research also revealed that although temperature differences over the entire continent of Antarctica have a major influence on the composition of the snow, there are strong spatial variations in this. Accordingly a simple conversion of the fluctuations in the snow composition to changes in the local temperature is unreliable.“

fascinating. We have no knowledge if variance should be (or is!) similar through historical time, yet the criterion is applied whatever. It also appears to be a “hidden” data manipulation, so there is no way we can tell if this criterion is applied throughout particular analyses, or merely used for particular data sets.

I can even imagine how you could use the same argument to justify the choice of the other series. If you looked at the variance through the calibration time period (i.e. 1800-2000), then you could claim that the Yamal variance changes more than the other series, and hence you are not using the Yamal series ! That would be an objective criterion 🙂

We also don’t know how the variance correlates with temperature; a minor detail.

IMHO, it’s time for the Team to drop tree rings and “move on.” It sure looks to me that it cannot be generally demonstrated that tree ring data are correlated to temperature. Most of the trees show very little correlation, if any. Some groups of trees show negative correlations. A few groups show good positive correlations, but this may be spurious because of CO2 fertilization effects, etc. Maybe they can “move on” to other proxies. But, considering comment #6, maybe there are no good proxies. But we have models….

re: this discussion (clear statistical reasons for Yamal vs. Polar Urals or vice versa), and in light of Dr Hoyt in 4 “selecting proxies based upon their stability may be suppressing real information about the weather and climate.”

Can and how would similar arguments be applied to the rejection of Keigwin’96 Sargasso (a proxy with clear MWP and LIA) in MBH and the IPCC TAR Chapter 2, vs. the chosen Keigwin and Pickart’99 Newfoundland (a proxy with cooler MWP and warmer LIA)? It seems to me that the choice of either of these two proxies also tends to define what ones position is on AGW?

RE: #9. Sargasso. An interesting thing to contemplate. An huge gyre, corresponding with an overlying stable, semi permanent area of atmospheric high pressure. Ever since the Atlantic settled after the closure of the Ithsmus of Panama, I reckon this regime has been in place. Given the strength of both the Bermuda High as well as the near stagnant nature of Sargasso itself, I would reckon that if anywhere on earth would be expected to average out noise and small excursions, Sargasso is par excellence. There are qualitative measures of the goodness of proxies, based on a combination of logic, tectonic configuration, ocean current patterns, atmosphering current patterns and naturally, the biological tell tale of all that Sargassum.

First, thanks Steve and Rob for having a civil and public discussion/debate over analysis that underlies, shall we say, various conclusions. Please do whatever you can to produce more of of same. It is indeed interesting.

Second, what variance? Are you talking about the variance of the series itself or the variance of the residuals of the series as a dependent variable in some regression or other procedure? (And third if the answer is the series, could you explain why homoskedastic is better than heteroskedastic for an individual series?)

#12. In this context, the issue is the variance of the ring width growth index (the "site chronology"). I don’t see any reason why the index should be homoskedastic and would not set that as a criterion. Rob has alluded to a link between the index between heteroskedastic series and inadequate adjustment for age in the trees, but I don’t see that the linkage is proved here. Maybe Rob can clarify.

Re:#8 – can anyone explain why, if tree ring data is such a good climate proxy, that one must “select” certain trees/sites? With some samples showing a negative correlation, what sort of assurance do we have that even those that do show a positive correlation over the calibration period have the same correlation to temperature over larger time scales? IMO, the whole tree-rings as a climate proxy is totally bogus without such an explaination, and even if one is forthcoming (which I doubt), why do not *all* trees show a positive correlation to temperature?

The 101-yr running variance plots suggests to me that you are talking about the variance of the annual series over a 101 year window – where this is the variance of the final processed series rather than some earlier raw data. Because 101 is odd this suggests it is a centred window that is used? (Not that this makes any difference here.) i.e. Var(t)=var(series(t-50),…,series(t+50)).

If this is the case, how are trends taken into account? For example, during a period when the series is trending, the variance will mechanically increase without it necessarily indicating anything about underlying homo- or heteroskedasticity.

Or are we talking about some sort of cross-sectional variance across trees in any given year? That would make more sense to me in judging the robustness of a processed series but… I don’t have access to the reference Rob provided – is there anything available on the Internet?

re: 15 (re: 11) I see no problems with Keigwin’96 Sargasso — neither did Moberg’05. MBH’98, ’99 and IPCC ’01 TAR rejected it though, and we all suspect a different reason than how they justified the omission/rejection (which Moberg’05 just plain ignored).

My apologies for taking this discussion off track — I know this is Yamal and Urals thread (I to am appreciative of the polite and open discussion between Wilson and McIntyre). But having been recently audited a class by Professor Calorie J. Thermos* (who among other insights, argues putting thermometers in ice water isn’t such a bad idea), I’m requesting a revisit with new “clear statistical reasons (related to variance changes through time)” eyes, to this threadhttp://www.climateaudit.org/?p=145

Re: # 19 and 21 (temperature variation) – The instrument record may or may not become toast. I guess that depends on how much it warms up. 🙂 But the higher variance in the earlier years is almost certainly a reflection of the number of stations that go into the regional or global average. That is, what one is seeing is the greater local deviations effect in the early years because there are fewer stations to average those deviations over. Our host discussed a similar issue (post???) with regard to the various proxy reconstructions with differing numbers of proxies.

Re:#16 (#8) — While I cannot speak for the dendrochronology crowd, there is an argument, albeit not a particularly wonderful one, for “selecting” certain variables. I should also note that everyone should be rightfully skeptical of procedures that select one variable over another. Anyway the argument proceeds as follows: suppose that tree ring width were truly explained by the following equation:

RW = B0 + B1*X + B2*Z + error,

where X is temperature and Z is some other explanatory variable(s) (usually not measured).

Then in data sets where there is lots of variation in Z, the omitted and unobserved variable(s), the correlation between Y and X will often be confounded. In order to better measure the B1 (the effect of X) one might look at those sets of Y’s and X’s in which the correlation is clearly evident as a proxy for cases where the variation of the Z’s is relatively unimportant.

A problem with this approach is that if Y is regressed on X (without Z), there is an omitted variable bias if there is correlation between X and Z. Of course, the researcher isn’t going to know which way the bias is without having a measurement of Z. But if one selects just those cases in which the correlation of Y and X is positive, it is almost certainly biasing the estimate of B1 upward. In the tree ring case, it would seem to be better to estimate the B1 coefficient from forestry studies in which there are good measurements on the Z’s, i.e. the “omitted” variables then use the coefficient estimate in the reconstruction. If those species with the long chronologies, however, do not have the forestry type analyses, I think that one has to live with the grand average effect, mixing the positive, negative, and nil correlations regardless of the result in order to keep the estimate unbiased.

But then, I am not a dendrochronologist, and hence, I’m only speculating on the reasoning (and problems) behind the selection of data sets.

I don’t have time pre-NAS to think about this issue in terms of a Demetris Koutsoyannis viewpoint, but I think there’s a connection. I’ll post David Stockwell on this, but, if you have long-term persistence, it’s certainly not obvious that the kind of variance stabilization that Rob is selecting on is something that you should necessarily be seeking. (I’m not saying the opposite, only that it’s not obvious, at least to me, and needs to be thought through.)

If those species with the long chronologies, however, do not have the forestry type analyses, I think that one has to live with the grand average effect, mixing the positive, negative, and nil correlations regardless of the result in order to keep the estimate unbiased

Exactly. But if you do this, the large variability due to z simply overshadows x, so you have to tweak things (cherry pick or play some statistical game) to get a relationship between y and x.

Re # 28 – jae, first I should have said “relatively unbiased” not “unbiased.” The problem isn’t so much that Z’s effect (say habitat competition) overwhelms the X (say temperature), but that temperature and habitat competition may be correlated and without including it (or an “instrumental” variable, IV, as a proxy for it) you always pick up the covariance of the two variables in the regression estimator on temperature.

Second, yes even if the omitted is essentially uncorrelated with the included, the variation in the omitted could overwhelm the effect one is searching for. Positive significance in such cases in the old-fashioned kind of spurious correlation: correlation that is simply coincidental. And that gets back to the worry about data mining, which has the proclivity to find just those pure coincidental correlations as easily as ones with a causal basis.

Re #27 Yes. I was offering the explanation in the sense of the “Classical Linear Model,” i.e. without serially correlated errors. For the case of proxies and temperature where we have serial correlated variables, running the Y or X regression without some form of correction (be it a full ARMA model or just an old fashioned Cochrane-Orcutt iterative, rho hat estimator), will produce varying degrees of spurious correlation /significance. However, ye old Ordinary Least Squares estimators are still unbiased with first order serial correlation. (Technical note for those who care: unbiased does not mean the expected value of the absolute difference of the estimator and its mean is zero. Thus, serially correlated variables case, say both variables are AR1, will increase the absolute value vis-àÆà➭vis the null hypothesis of zero effect.) What OLS estimators have is a much greater variance than the good, corrective estimators. So ala the Ferson case (data mining with amongst serially correlated variables), selecting only the positive correlations (the data mining) when there are serial correlations on both independent and dependent variables (the spurious correlation issue) will presumably reinforce the omitted variables effect of the same type selection by correlation. The magnitude might be kind of interesting to figure out when you get back from the conference on deconstructing reconstructing.

#25: It’s definitely not a linear relationship and much worse than what you guess. If you think about it for a few minutes the relation is more like

RW = A1*X*Y*(….)

because if the soil is infertile the tree dies, if it’s too dry the tree dies,… It’s not possible to get that kind of result from a simple linear relationship.

This whole episode illustrates the trouble people get into when doing a statistical analysis. Statistics isn’t magic. Statistics can’t extract information that isn’t there. Unless there is a prior understanding of the mathematical relation among all the variables affecting a proxy and independent control or measurement of the uninteresting variables, there is no way to honestly extract the variable of interest, temperature in this case. With all the knowledgeable people contributing to this blog, I have yet to see any indication that the relation between tree rings/density and tree growth factors(temperature, water, sunlight,fertility,…) is understood in any quantitative way. Wilson’s use of low variance is unjustified unless it can be shown, a priori, that it somehow keeps the uninteresting variables constant. Absent that, it’s just a form of cherry picking applied after the fact to get time series that have the desired correlation with temperature. In this context, both the Polar Urals and Yamal time series are irrelevant as thermometers since we don’t know all the conditions that controlled the tree growth.

RE: 32. I would add that there is ample, overt evidence that with the possible exeption of trees in well watered mid latitude places with lower annual variation in climate (pretty much, that means solely a Marine West Coast climate) and *perhaps* certain cases in well watered placed at similar latitudes with greater temperature range (e.g Humid Continental) there is no correlation whatever between tree ring width and the actual variation of temperature experienced at real sites. The US SW – probably either correlated with moisture or with some range of multvariate optima (my hunch is the former). Subarctic boreal – all over the map with postive and negative responders to “whatever” drives their growth. Humid subtropical – moisture. Mediterranean – moisture. Tropical climes – N/A.

Which is an entirely different level of ‘non-linear’ (you can’t even make first order Taylor expansion/log-linearisation arguments with this one). It requires temperature to be the limiting factor throughout the entire (or near enough) reconstruction history for them to be valid.

Most of you have missed the point
the high variance in my RCS chronology was an artefact of ineffectual detrending of the data.
I am sure I said that over processing of data was a bad thing.
Look at our paper and you will see that the variance of some of the time-series is not that time stable. This could be corrected for, BUT as you say, there is no rationale for it.
signing off.

Perhaps the first step is for me to understand the processing you do to the raw series better. I understand that it is to adjust for variable growth by age of the tree – anything else? Can you direct me to any references that would be accessible via the Internet? I’m also missing something on this detrending process – aren’t you ultimately looking for a series with a ‘trend’ i.e. the signal of interest? Is the concern with heteroskedasticity really a concern about trends rather than variability per se?

#36 -John S,
let me answer this for Rob rather than using up his limited time on this particular question. Ring widths decline as trees get older. Methods of age adjustment are a big topic in dendro literature, which uses the trade term “standardization”. I’ll mention 4 methods that you see from time to time. At a site, you will have N cores which start and end at different times, have different mean growth rates and ultimate ages and you’re trying to get a growth index representing the site.

1) An older method common in the 1980s was to fit a cubic spline to each core; obtain the deltas from the fit and average the deltas. Dendrochronologists in the 1990s observed that this removed any long-term information (although series formed in this way were used in the programmatic article Hughes and Diaz 1994, still cited, to criticize earlier work which did not rely on tree rings.)

2) The individual cores, viewed s time series, are heavily autocorrelated. Some dendrochronologists did ARMA fitting to individual cores (prewhitening), obtained the deltas from that and averaged them. Removing the autocorrelation in the cores tends to leave a chronology with little long-term variation (e.g. Stahle chronologies used in MBH98).

3) Some chronologists (e.g. Jacoby, Graybill) fitted each core with a generalized negative exponential (neg exp plus a constant) or, if this didn’t fit, a horizontal or negative-trending straight line. This is option 2 in COFECHA. It’s usually what is meant by “conservative standardization”. There are some curious numerical issues in how this is implemented in COFECHA, but they don’t have a big impact. Cook and Peters 1997 criticized this method as potentially producing biased end effects – their examples ironically were a bristlecone (Campito Mt) and Gaspe.

Virtually every chronology archived at WDCP as a *.crn series is one of the above 3 types.

4) The term “RCS” standardization describes a procedure which has become popular in the 1990s – calculating a generalized negative exponential fit for the entire site, and calculating the deltas for each core from the one fit. This results in more centennial scale variability. In some of the sites of interest to Briffa (e.g. Polar Urals), the cores are very short on average (median – 150 years) and so standardizing individual cores eliminates any changes in the population mean.

I think that there are some interesting statistical issues involved in dendro standardization. A couple of years ago – is it that long that I’ve been doing this – I experimented with doing ring width standardization using mixed effects methods and got some really excellent results. The advantage of this is that I was able to put some of these calculations in a more general statistical framework and show how the little ad hoc dendro recipes fit into this framework and how to use the diagnostics available from the mixed effects modeling for dendro purposes.

In mixed effects terms (say the nlme package of Pinheiro and Bates), “conservative” standardization is simply a type of nlsList fit; while RCS standardization is an nls-fit. I can do either chronology in a couple of lines of code and replicate standard results applying a more general statistical method. The dialectic between nlsList and nls modeling (or lmList and lm modeling) observed in mixed effects texts is exactly and I mean EXACTLY identical to the tension between RCS and conservative standardization in dendro.

In mixed effects terms, you have interesting issues: there is an individual effect as mean growth rates vary by tree; there is an aging effect; there is a random effect from annual climate (reserving temperature/precipitation issues); you have autocorrelation in the residuals; you’ve got every issue contemplated in mixed effects and then some.

In addition, the dendro data sets are large and rich. While the mixed effects software in 2004 could deal with combinations of these issues, it didn’t quite deal with all of them. However, you could get a lot done with what was there. It would be an interesting project to follow up on this. It’s a big project; it would work like a champ; but I’ve got overtaken with too many other things. It would be a great project for a grad statistics student.

However, since this issue has come up in the context of a “practical” issue, maybe I’ll re-visit it. So much to do, so little time.

I’m not familiar with mixed effects. Ny naive approach to this would probably involve using a Kalman filter or equivalent to extract the two ‘unobserved’ components at a given site. I’ll have a bit more of a think…