Wednesday, July 11, 2012

A
recurrent issue in statistical climatology is how to deal with
long-term trends when one is trying to estimate correlations between
two time series. A reader sent us the following question, posed to
him by a friend of his, related to our test of the method applied by
Mann et al to produce the hockey-stick curve in 1998

I
will no enter the question of why realclimate linked to Comment
published in Science by Wahl
et al. (2006) and not to our
response, both published side by side. Sometimes, actually
never, should one trust blog as the sole source of information.
Interestingly, the fact that 'the friend' was not aware that a
response did exist and was published in the same journal lead him
immediately to assume a dishonest behaviour. He/she did not bother
either to check by himself if the comment published in Science has
prompted a response. The 'confirmation bias' is present everywhere.
It is probably unavoidable, but a useful Chinese proverb may offer
some help: 'if you think you are 100% right then you are wrong with a
95% probability.

Instead
of delving in Chinese philosophy formulated in terms of IPCC
likelihoods, I will illustrate by this example how nasty trends
present in many climate records pose some challenges to the design
of regression models. The basic problem is that two series that
display a prominent trend will always be correlated, independently of
whether or not they are indeed physically related. In the press
release on the Science comment at that time (2006) we
showed a nice example of a completely false inference based on the
correlation between two trendy series: the Northern Hemisphere mean
temperature and unemployment in West Germany over the last decades.
Taking this correlation at face value, one could design a statistical
model that predicts the Northern Hemisphere temperature from the
unemployment figures, and this statistical model would even deliver a
nice value of a validation statistics that is commonly used in
climate reconstructions, the Reduction of Error. This diagnostics
places more weight on a closer agreement between the mean value
of reconstructions and target data more strongly than the correlation
between the two series, which in turn focuses on the agreement
between their short-term wiggles. Depending on how strong the interannual
variability relative to the long term trend is, the RE or the
correlation would provide a more faithful measure of the skill of the
estimation. In this example, the apparent agreement between both
time series is obviously an artefact, since temperatures and
unemployment are unrelated, but the problem illustrated here is
present in many attempt to calibrate proxy records over the 20th
century. A soon as a proxy record exhibits a trend, positive or
negative, it will display an apparent correlation to the global mean
temperature and thus it might be taken as an adequate proxy to
reconstruct the global mean also during past times. It may happen
that this correlation is physically sound, and thus correctly
interpreted, but when the series are trendy, one cannot be sure. The
relationship between proxies and climate is often not physically
obvious.

Mann
et al (1998), in their lengthy description of their reconstruction
method, mentioned at some stages that they had used 'detrended
variables ' to calculate some diagnostics of the skill of their method. We
interpreted, wrongly as it turned out, that they had detrended the
proxy and temperature series to calibrate their statistical model.
This was very soon taken as a proof that we had incurred in a
calculation error and that the the whole analysis was flawed. Our
response to their Comment showed that, in essence, to detrend or not
detrend the data did not make a material difference, and in both
cases, the method applied to produce the hockey-stick would
underestimate the long-term variations in most circumstances.
Interestingly, our colleague and friend Gerd Bürger had also
submitted in 2005 a comment to Science that raised very similar
questions. A more elaborated version was eventually published in Geophysical
Research Letters. But the journal Science thought in 2005
that Gerd's manuscript was not interesting enough to warrant
publication. A few months later, it has changed its opinion,
convincing me that for Science all authors are equal but some authors
are more equal than others.

The
stage was already set for prejudices to unfold and for the climate
aficionados to choose their preferred sides. The paper by von Storch
et al. (2004) was perceived by some as an attack to the hockey-stick
and, by the same token, to the larger corpus of anthropogenic warming
- something it was not. The mannistas and the anti-mannistas poised to
fend-off the forays of their respective adversaries into their own territory, independently
of the contents of the Wahl et al. Comment or of our response, which
quite likely very few people took the time to read.

This
little episode had, however, a positive ending: later on, I had the
chance to personally meet Eugene Wahl, one of the nicest scientist
you can imagine, both personally and professionally, and, ironically,
one of the most unfairly treated since Climategate.

Eduardo, will no enter the question of why realclimate linked to Comment published in Science by Wahl et al. (2006) and not to our response, both published side by side.

In fact, the realclimate article has a link you your response to Wahl et al. in Science, 2006. The reader you referred to just did not read properly.

However, I perceive the realclimate article a bit unfairly slanted because it blames you for not responding the the critique before the critique was published but you must have been aware of it anyways. RC seems to justify this extraordinary demand by the assumption that this critique of your paper invalidates its conclusion. Judging from your response however you see this differently and from that perspective there isn't any need by any standard to reply to an inconsequential little oversight.

I also perceive a bit misleading your press-release example (unemployment). It simply illustrates the highschool knowledge of correlation not implying causation. However, to extrapolate temperature from proxies we need not only causation but also a good idea about the nature of this causation. This, in principle, can not come from the timeseries themselves. Fortunately there is a huge body of knowledge concerning the physical, biological and chemical processes that relate various proxies to temperature (as opposed to the literature pertainig to unemployment as a function of temperature). So your argument for removing low frequency variation (linear trend) from the timeseries before regression, i.e. to avoid inflation of the validation measure due to correlation "by chance", seems misguided. The validation helps to select the best model among candidates, its value in validating the underlying assumptions is very limited indeed. Intuitively I seems right to me to leave the low frequency signal in the calibration, if the low frequency variation is what we are interested in. Conversely, removing the trend seems to rely on the assumption of scale-invariance of the temperature-proxy relationship.

What I would like to know: 14 years after pretty much the very first attempt of a muliproxy reconstruction (MBH98), to which your article referred, and six years after the exchange describe above, what is the state-of-the art for this problem? The linear regression methods, direct or inverse, with PCA or not, and particularly the results of Bürger & Cubasch (2005) tell me that the statistical methodology taken into consideration at the time was just very much ad-hoc and pedestrian. Do we have more powerful methods today and a better guidance about what method works in which case? Wouldn't this for example be a posterchild problem for Bayesian approaches? Any recommendation for a recent review paper?

First: I'll second hvw's question whether there is a recent review? I should know about them, but the only thing that comes to mind is Jason Smerdons WiresCC paper on pseudo-proxy work ( see here or here. There is the editorial by Hughes and Ammann and there is Tingley et al.'s "Piecing together the past: statistical insights into paleoclimatic reconstructions", but a "complete" review of the methodologies? Some insights may come from blogposts (lucia, SMcI, JeffID etc.) or the discussions surrounding Bo Christiansen's publications of the last years.

Tingley's BARCAST may be more powerful then the regression based methods. Which leads again back to Smerdon's publication page.

My second point: Consent about dissent. Yes, but it's kind of astonishing how quickly the scientific discussion becomes infested by emotions. One only has to look at the most recent spectacle. Interestingly it goes along trenches quite similar to Eduardo description above.

let us assume the p_t is a series of proxy-data, and f_t the geophysical variable of interest. Let us further assume that p_t and f_t are stationary random variables, which is with respect to p_t a nontrivial assumption (without statistical analysis makes little sense; one can weaken this assumption by going to quasi-stationarity or other complex constellations, but I have never seen this done).

When building a statistical link, then you assume that you learn something from the joint variability of the pairs (p_t and f_t). To do so, you must have several, or even better many samples of (p_t and f_t). Also, you should know how often a new pair tells you something NEW about the joint generating process. That is, how often is (p_t+1, f_t+1) essentially the same constellation which was already described by (p_t, f_t). In particular, you do not want to see unrelated trends in both variables. Unfortunately, statistics can hardly tell you if the trends are related or not, only if you refer to difference stationary time series analysis methods known from econometrics (Schmith, T., S. Johansen, and P. Thejll, 2007: Comment on “A Semi-Empirical Approach to Projecting Future Sea-Level Rise” science 10.1126/science.1143286). Thus, what matters is the number of degrees of independent sample pairs; the assumptions about the sampling process are a key element (whenever statements about the reality of a link of the variations is made).

Now, let's write p=p*+p' and f=f*+f', with p* being the archive for variations in f, and f* the archive for variations in p. [Symmetry here, because both forward and inverted regression are in use.] In case of a forward regression, it would be p*=alpha f + random error, with alpha = , and Var(p*) =alpha^2 Var(f)= ^2/var(f) = Corr(p,f)*VAR(p). Since the correlation is in all practical situations less than 1, we find VAR(p*)< Var(p). [The same way the other way around.] Independently if we use forward or inverted regression, we have Var(p*) .ne. Var(p) and Var(f*) .ne. Var(f). Which is obvious, because we have the nonzero contributions f' and p', which are part of p and f, but which do not leave traces on the the other variable, = 0 and = 0. (hope my calculations are complete.)

With statistical analysis, 100% of the variance of f, or of p, can not be recovered by screening p (or f). Some part of the original variability is lost, and lost for good, except if one could recover f' (or p'), which very likely is not just noise. The same applies when more sophisticated links are established, such as neural nets or whatever (methods, which need much more samples in general for leading to reasonably small estimation errors; please check).

An often used trick is to employ "inflation", that is to merely multiply the *-series with a suitable factor so that VAR(p*) = VAR(p). This implicitly assumes that p' = 0, or Corr(p,f)=1, which is obviously an invalid assumption. All proxies contain variations, which are not related to, whatever we want to take it as representative for, temperature, precip etc ... but to other influences, such as local environmental changes, ranging from bug contamination to land slides etc.

Another trick is to add complexity in the statistical method, in the hope that more simple minded people would not understand such methods and trust that the complexity would add reliability of the result. In general this is not the case.

In short: the problem that proxy-reconstructions tell us only part of what happened is an intrinsic property of the approach and can not be overcome by statistical analysis alone. A possible solution may be process-based modeling using proxy-data for constraining the dynamical modeling (c.f. data assimilation) - but on the other hand: what is lost, is lost. Proxies do not tell past states, but part of past states and variations.

Forgot two points:a) the link between f and p, proxy and geophysical data may not be stationary.b) Correlations in this business are often of about 0.7 and less, corresponding to 1/2 or less of the variance, or 1/2 or more of the variance remains "unexplained".

Maybe ensemble reconstructions, as described here:http://www.climate.unibe.ch/~joos/papers/frank10nat.pdf

The idea is, that when you have no chance to find the "best" reconstruction, you can obtain valuable information about uncertainties by using several methods and creating an ensemble of reconstructions.

Wallacer,the issue with the trends is actually a step previous to the 'correlation is not causation' meme. I would rather describe it as ' common trends are not correlation. All series with a long-term trend appear correlated, but tested properly that correlation is not statistically significant. The number of degrees of freedom is much less that the number of time steps

well, that link was included at some point later. I remember posting a comment there to make the realclimate readers aware of the existence of the response, but it was 'moderated'. Anyway, this is not really important now, after the years passed.

With the perspective of these few years - and independent of the issue of detrending- the problem of the underestimation of the variance has been confirmed by many other studies. A nice review was written by Smerdon just a few months ago (linked in comment 5), which if I remember properly does not include the recent application of Bayesian methods. Form the results that we are getting on other projects - still unpublished- I would say that the Bayesian Hierarchical methods still suffer from this underestimation. A previous attempt with Bayesian methods, including not only proxy information but also information about the external forcing, was published by Lee and others

As Hans explained before this can be a fundamental property of a large family of statistical models.

I would mention that other methods, based on local calibration of one proxy record with one instrumental temperature, based on inverse regression (also known as classical calibration, predictor is the instrumental variable, predictand is the proxy) show promising results. Bo Christiansen blogged here in the Klimazwiebel some time ago.

To leave the low frequency signal would be, in my opinion, justified if we are completely sure that the proxy is reacting to climate and we just wished to calibrate the proxy as accurately as possible. Unfortunately, this is not the case. There are are many proxy records around, but by no means all, that simply do not contain any climate signal. They have sometimes interpreted as a temperature signal, then later as precipitation signal, later as a mixed signal that flips in certain periods...etc. In other cases, for instance stalagmites, records from the same cave look quite different, and the experts here claim that you need a very developed mechanistic knowledge of the proxy to identify the best locations within a single cave. In the Mann et al (1998) study, clearly some precipitation records were wrongly interpreted as containing a temperature signal, something that was much criticized in the paleo community at that time.

I would say that most of us still have to learn quite a lot from professionals, but they dont show much interest, with some exceptions. One initiative was the workshop organized last year .

thanks for the pointers and extensive comments. Actually I find Tingley et al. (2012) quite informative. They give a nice overview about what is out there, from a Bayesian perspective, which makes it conceptually simpler. The paper is not at all rigorous or deep, but that makes it a very accessible easy read.

Some artice-comment-response are elucidating on a meta-level and nicely illustrate how people with a classical (frequentist) background can misunderstand those who try to advance a modern (Bayesian) approach. (Christiansen's (2012) LOC, Tingley's (2012)comment and Christiansen's reply).

The take-away for me is:1) Damn hard problem

2) State-of-the art, performing at least as good as everything else in the majority of evaluations and being currently practically applicable is RegEM (Schneider 2011)

3) This by no means implies that RegEM is "good enough", on the contary.

4) The way forward are Bayesian Hierarchical Models (BHMs). Because only this framework allows for a clear inclusion of all information available (e.g. spatiotemporal covariance structure of both, predictand and predictor) and proper uncertainty propagation. Most importantly, and referring to eduardo's last paragraph, BHMs can easily (well, conceptually easily) incorporate detailed, complicated (aka realistic) process-level models, and this to me seems the proper way to constrain the uncertainty of proxy reconstructions. As opposed to simulate uncertainty assessment through GCM driven pseudo-proxies which are constructed so that they behave in a way that is doable in the chosen statistical framework (aka linear) but don't incorporate (and possibly even contradict) the empirical process knowledge available.

5) To actually do 4) you need Bayesian statisticians, paleo-people, climatologists, a good scientific programmer and a huge cluster all working nicely together. But I am pretty sure that projects into that direction are underway.

thanks for the pointers and extensive comments. Actually I find Tingley et al. (2012) quite informative. They give a nice overview about what is out there, from a Bayesian perspective, which makes it conceptually simpler. The paper is not at all rigorous or deep, but that makes it a very accessible easy read.

Some artice-comment-response are elucidating on a meta-level and nicely illustrate how people with a classical (frequentist) background can misunderstand those who try to advance a modern (Bayesian) approach. (Christiansen's (2012) LOC, Tingley's (2012)comment and Christiansen's reply).

The take-away for me is:1) Damn hard problem

2) State-of-the art, performing at least as good as everything else in the majority of evaluations and being currently practically applicable is RegEM (Schneider 2011)

3) This by no means implies that RegEM is "good enough", on the contary.

4) The way forward are Bayesian Hierarchical Models (BHMs). Because only this framework allows for a clear inclusion of all information available (e.g. spatiotemporal covariance structure of both, predictand and predictor) and proper uncertainty propagation. Most importantly, and referring to eduardo's last paragraph, BHMs can easily (well, conceptually easily) incorporate detailed, complicated (aka realistic) process-level models, and this to me seems the proper way to constrain the uncertainty of proxy reconstructions. As opposed to simulate uncertainty assessment through GCM driven pseudo-proxies which are constructed so that they behave in a way that is doable in the chosen statistical framework (aka linear) but don't incorporate (and possibly even contradict) the empirical process knowledge available.

5) To actually do 4) you need Bayesian statisticians, paleo-people, climatologists, a good scientific programmer and a huge cluster all working nicely together. But I am pretty sure that projects into that direction are underway.

HvS, #6Another trick is to add complexity in the statistical method, in the hope that more simple minded people would not understand such methods and trust that the complexity would add reliability of the result. In general this is not the case.

Such evil tricks might actually happen and I am certainly among the "simple minded people" who do not understand purposely obfuscated methods (but I am not so simple minded to automatically assume they work). But obfuscated methods that do not work will have no impact in the long run. What worries me more is the opposite, and you actually point to a nice example of this attitude: A valid statistical criticism (Schmith et al.(2007)) of a study that is flawed by an oversimplistic statistical approach is handwaivingly discarded.

Sustainable use of KLIMAZWIEBEL

The participants of KLIMAZWIEBEL are made of a diverse group of people interested in the climate issue; among them people, who consider the man-made climate change explanation as true, and others, who consider this explanation false. We have scientists and lay people; natural scientists and social scientists. People with different cultural and professional backgrounds. This is a unique resource for a relevant and inspiring discussion. This resource needs sustainable management by everybody. Therefore we ask to pay attention to these rules:

1. We do not want to see insults, ad hominem comments, lengthy tirades, ongoing repetitions, forms of disrespect to opponents. Also lengthy presentation of amateur-theories are not welcomed. When violating these rules, postings will be deleted.2. Please limit your contributions to the issues of the different threads.3. Please give your name or use an alias - comments from "anonymous" should be avoided.4. When you feel yourself provoked, please restrain from ranting; instead try to delay your response for a couple of hours, when your anger has evaporated somewhat.5. If you wan to submit a posting (begin a new thread), send it to either Eduardo Zorita or Hans von Storch - we publish it within short time. But please, only articles related to climate science and climate policy.6. Use whatever language you want. But maybe not a language which is rarely understood in Hamburg.