Re: Edward (#1), I think that the Comment and the Reply are both going to be published in issue 106(6) on Feb. 10, 2009. At least, that’s the issue number at the bottom of the pdfs on the PNAS website (Comment and Reply)

Pardon? Then how do they know whether the HS blade should be turned upwards or downwards in the end? I’d have been thinking that the *sign* would be the foremost important thing about any “signal” (and especially its trend)….

Re: Arthur Edelstein (#2)
“Multivariate regression methods are insensitive to the sign of
predictors.” — MBH
Cor(X, -Y) = -Cor(X, Y)
So if your predictor is upside-down, all that does is flip the sign of its estimated coefficient.

That makes sense, but at some point in building the proxy would not the sign matter?

One must remember that replies to comments on journal articles are allowed significant latitude by journal editors (the articles were peer reviewed afterall) hence Mann et al are allowed to slip away without seriously addressing any of the points raised by Steve and Ross.

Reply to McIntyre and McKitrick:
Proxy-based temperature
reconstructions are robust
McIntyre and McKitrick (1) raise no valid issues regarding
our paper. We specifically discussed divergence of ‘‘composite
plus scale’ (CPS) and ‘‘error-in-variables’ (EIV) reconstructions
before A.D. 1000 [ref. 2 and supporting information (SI)
therein] and demonstrated (in the SI) that the EIV reconstruction
is the more reliable where they diverge. The method
of uncertainty estimation (use of calibration/validation residuals)
is conventional (3, 4) and was described explicitly in ref. 2
(also in ref. 5), and Matlab code is available at http://www.meteo.
psu.edu/mann/supplements/MultiproxyMeans07/code/
codeveri/calc_error.m.
McIntyre and McKitrick’s claim that the common procedure
(6) of screening proxy data (used in some of our reconstructions)
generates ‘‘hockey sticks’ is unsupported in peerreviewed
literature and reflects an unfamiliarity with the
concept of screening regression/validation.
As clearly explained in ref. 2, proxies incorporating instrumental
information were eliminated for validation and thus
did not enter into skill assessment.
The claim that ‘‘upside down’ data were used is bizarre.
Multivariate regression methods are insensitive to the sign of
predictors. Screening, when used, employed one-sided tests
only when a definite sign could be a priori reasoned on physical
grounds. Potential nonclimatic influences on the Tiljander
and other proxies were discussed in the SI, which showed that
none of our central conclusions relied on their use.
Finally, McIntyre and McKitrick misrepresent both the National
Research Council report and the issues in that report
that we claimed to address (see abstract in ref. 2). They ignore
subsequent findings (4) concerning ‘‘strip bark’ records
and fail to note that we required significance of both reduction
of error and coefficient of efficiency statistics relative to
a standard red noise hypothesis to define a skillful reconstruction.
In summary, their criticisms have no merit.
Michael E. Mann1, Raymond S. Bradley, and Malcolm K. Hughes
Pennsylvania State University, Walker Building, University Park,
PA 16802
1. McIntyre S, McKitrick R (2009) Proxy inconsistency and other problems in millennial
paleoclimate reconstructions. Proc Natl Acad Sci USA 106:E10.
2. Mann ME, et al. (2008) Proxy-based reconstructions of hemispheric and global surface
temperature variations over the past two millennia. Proc Natl Acad Sci USA 105:13252–
13257.
3. Luterbacher J, Dietrich D, Xoplaki E, Grosjean M, Wanner H (2004) European seasonal
and annual temperature variability, trends, and extremes since 1500. Science
303:1499–1503.
4. Wahl ER,AmmannCM(2007) Robustness of the Mann, Bradley, Hughes reconstruction
of surface temperatures: Examination of criticisms based on the nature and processing
of proxy climate evidence. Clim Change 85:33–69.
5. Mann ME, Rutherford S, Wahl E, Ammann C (2007) Robustness of proxy-based climate
field reconstruction methods. J Geophys Res 112:D12109.
6. Osborn TJ, Briffa KR (2006) The spatial extent of 20th-century warmth in the context
of the past 1200 years. Science 311:841–844.
Author contributions: M.E.M., R.S.B., and M.K.H. wrote the paper.
The authors declare no conflict of interest.
1To whom correspondence should be addressed. E-mail: mann@psu.edu.

I note that, of course, Mann did not cite any mainstream statistical literature in support of his claims that

…The method of uncertainty estimation (use of calibration/validation residuals)
is conventional (3, 4) and was described explicitly in ref. 2 (also in ref. 5)

and

McIntyre and McKitrick’s claim that the common procedure
(6) of screening proxy data (used in some of our reconstructions)
generates ‘‘hockey sticks’ is unsupported in peerreviewed
literature and reflects an unfamiliarity with the
concept of screening regression/validation.

Yes, common in badly done studies by MBH and friends but condemned by mainstream and extremely knowledgeable statisticians (Wegman 2006, Joliffe 2008) as worthless. Oh and MM03 and MM05a and b are in the corpus of peer-reviewed literature and have never been refuted, except by meaningless handwaving by the Hockey Team.

Oh yes and this little gem

They ignore subsequent findings (4) concerning ‘‘strip bark’ records
and fail to note that we required significance of both reduction
of error and coefficient of efficiency statistics relative to
a standard red noise hypothesis to define a skillful reconstruction.
In summary, their criticisms have no merit.

…refers to the so-called “Jesus Paper” of Wahl and Ammann where the authors invent a new statistical metric of extremely dubious provenance in order to give the deeply flawed Mann Hockey Stick a notion of statistical skill that it clearly does not deserve.

(I just tried to post this over at RealClimate but hit some errors… Steig posted this comment in response to a question about the non-archiving of full runnable source code.)

“A good auditor doesn’t use the same Excel spreadsheet that the company being audited does. They make their own calculations with the raw data. After all, how would they know otherwise if the Excel spreadsheet was rigged? … – eric”

This is misleading at best. I spend a lot of my time “auditing” (we call it Quality Control and do it pre-emptively) software used to analyse clinical trials data in the pharmaceutical industry. As eric suggests, the ideal method is to do a blind dual-programming exercise – ie, independently write a new implementation of the same algorithm to see if you get the same results. However, this is only possible if the original algorithm is documented in sufficient (ie enormous) detail. As writing documentation is a lot of work, the actual source code is often the best documentation available.

When, as is often the case, the dual-programming throws up different results, what do you do next? The only way to make progress is to examine the original code in order to determine exactly what algorithm was used. It would be impossible to determine what algorithm was used, and hence whether it can be considered correct or not, without access to the original code.

“Multivariate regression methods are insensitive to the sign of
predictors.” — MBH

This looks like a very strange statement. Does anyone know exactly what it is intended to mean? Is it perhaps that the actual arithmetic does not distinguish between positive and negative values – i.e. it works perfectly whatever the input data? One would hope that any adequately written software would be able to take care of any numerically valid input.

Suppose you are trying to predict stock prices (SP). You might consider a consumer confidence index (CCI) and find that there is a reasonable correlation (following Mann, by considering the correlation coefficient r; in this example you would want r a lot closer to 1 than to 0). By linear regression you calculate the positive constants a and b in the equation SP = a + b*CCI, which then allows you to calculate SP (predict) SP when you know CCI beforehand.

But suppose you wanted to take it further by looking at other measures and came across a consumer pessimism index (CMI) which goes down when people are less pessimistic. If you did find a correlation it would be expected to be negative (SP up when CMI down, SP down when CMI up). A good correlation would have r close to -1. Linear regression would give the positive constants a’ and b’ in the equation SP = a’ – b’*CMI. Now the coefficient for CMI (-b’) is negative.

Now suppose that CCI has a hockey stick shape (is pretty constant for some time then suddenly increases). The predicted SP will have the same shape, with a sudden turn upwards. Under the same circumstances you would expect CMI to have the same shape but to suddenly turn downwards. But because the coefficient (-b’) of CMI is negative a downturn in CMI will predict an upturn in SP, just like CCI.

If you are confident you have a good grip on the relationship between SP and CCI and CMI you might be prepared to put some money on the correlation. But what about the London Metals Index (LMI) which goes up when metal prices generally go up. An increase in LMI might lead to an increase in SP (stronger business conditions?) or maybe the opposite (higher input prices, lower profits?). Never mind, just look for a correlation using the actual data over time. If there is a good positive correlation (r close to +1) then OK, it seems to give useful information, like CCI. If it gives a good negative correlation (r close to -1) again OK, like CMI. If it gives some intermediate value of r (you would have to decide what positive and negative values of r you would use to define “intermediate”) you would conclude that LMI is not much use for your purpose of predicting SP.

As I understand it this is the starting point for the filtering by regression method. You look at lots of indices like LMI and select those that pass the regression test with a “good enough” vlue of r. I’ll stop there and see if there are any comments on whether I’m on the right track and whether the analogy is a useful one.

Re: davidc (#49), This argument only works if you have no prior reason to expect a + or – regression coefficient. In that case least squares will give you the same fit, the only difference will be whether the coefficient is positive or negative. But if you have earlier argued that series X is a good predictor for Y because theory predicts a + correlation between them, and then you get a – coefficient, so you instead switch in -X as your predictor, you can’t turn around and appeal to the algebra of least squares regression to defend your methods. If you expected a + coefficient on X, getting a positive coefficient on -X refutes your expectation. If you suddenly switch around and call -X your predictor you are being inconsistent, even if your regression program doesn’t object.

I’m struggling with this too. The argument seems to be that the regression method doens’t care about sign. But the regression method is just math.

I thought that one of the arguments of client scientists was that statistically knowledgable non-climate scientists like statisticians, economists, and engineers couldn’t apply their statistical know-how to this area because they don’t understand the underlying physical processes.

Are they now saying that the underlying physical process (positive or negative proxy response) doesn’t matter because the algorithm doesn’t care?

Ross, I agree that correlation without a mechanistic understanding is a poor basis for prediction. I’m just trying to explain why negative correlations are just as good as positive ones, as that appeared to be misunderstood. In my analogy I (personally)would expect SP to be positively correlated with CCI, but weakly (not investment grade). But if I saw data showing a strong positive correlation (with CCI leading of course) I might just be tempted. On the other hand if CCI showed a strong negative correlation I would be intrigued but wouldn’t be putting money on it. And if I had two separate indices CCI1 and CCI2, one with a strong positive and one with a strong negative correlation, the negative correlation would merely undermine any confidence I might have in the positive correlation. But if you took these two series as simply numbers, without attention to their meaning, the negative correlation would support the positive one.

The other thing I was heading toward was trying to explain with an analogy what I understand to be the basis of filtering (data mining?) using regression. The example with LMI as an index is intended to illustrate the case where intuitively (to me anyway) the correlation could be either way. Nevertheless, if the correlation was very strong either way it would be tempting to take note of it. And if you had lots of these indices (say 1,200) with strong correlations, either positive or negative, you might be forgiven for thinking that you had developed a powerful tool.

But the example above with CCI1 and CCI2 shows the potential danger here. It could be that half your correlations are telling you to have no confidence in the other half. But in data mining by regression, as I understand it, any series will do. If it’s available, use it, the regression will tell you if it’s irrelevant. So you have no idea whether whether your data consists of pairs like CCI1&2, both with positive correlations, both with negative correlation or with opposite correlations. Looks like p=0.5 to me.

Maybe because my statistics is weak, but I can’t see a statistical way out here. It seems to me that you simply have to know what your data is all about.

If you filtered with a high value of r^2 you might be able to handle this problem by inspection of survivors (checking negative correlations for whether they undermine a positive one). But if you set a very low r^2 (or r, which I think Mann used) you will end up selecting nearly all the original data set. Then, if you want to understand what is really going on you need to look at (not compute) a large number of interactions (something like 1200! ?).

It also seems to me that Propensity to Hockeystickiness increases as r decreases. Has anyone looked at that?

Steve: Jeff Id did some interesting and original posts on this topic at his blog in September in the wake of Mann 08.

It also seems to me that Propensity to Hockeystickiness increases as r decreases. Has anyone looked at that?

It would, if you had a situation of a few good proxies, mixed with a lot of random proxies, and a temperature record that had enough ups and downs to effectively screen.

The situation is that we have a temperature record that is virtually up since 1900, and forms the blade of a hockey stick with high r2 correlation on both the good proxies, and a significant portion of the random ones.

This is because the information from the temperature record is inadequate, and information from an independent source is needed, such as basic chemistry, in order to be confident your proxies are good. Because the temperature is basically a straight line, even out-of-sample tests are over confident due to the autocorrelation.

Its not an unusual situation that there is not enough information to do what you want to do. Dendros seem to think thats not fair.

I don’t believe the issue is with the sign of the predictors in the regression per se, it is the inconsistency of the sign of the predictors for a given class of predictors or to the physical theory. If “tree rings” are generally positively signed wrt temperature, then all tree rings need to be positively signed.

The major advantage of this response by Mann is that their arrogant nonsense is now citable, like a moth pinned to a board. The most damning part of the response is the list of citations. They can only quote their selves as to why they do what they do.

I too noticed that the only references Mann cited were himself and other ‘team’ members. One would think that if your use (or misuse) of statistics was being called into question you would cite a statistical authority…

Multivariate regression methods are insensitive to the sign of predictors, since you get the same result by flipping the predictor and its weight.

But that’s not the issue. Mann’s algorithm picks different signs for the same predictor depending on which period is being reconstructed. So for one period a predictor is a temperature proxy, for another it is an anti-temperature proxy. And yet in both cases the predictor is supposed to have significant skill. Absent extra argument which Mann does not provide, a predictor cannot have significant skill and be both a temperature proxy and an anti-temperature proxy.

In the past I figured these guys understood these kind of problems, and just chose to ignore them. But from Mann’s reply I am starting to think they are simply not smart enough to understand the problems with their approach. And they’ve successfully built themselves a self-referential citation island which means they can quote themselves as authorities without ever having to step outside the hall of mirrors.

In the past I figured these guys understood these kind of problems, and just chose to ignore them. But from Mann’s reply I am starting to think they are simply not smart enough to understand the problems with their approach. And they’ve successfully built themselves a self-referential citation island which means they can quote themselves as authorities without ever having to step outside the hall of mirrors.

Full agreement. M&M, once again you’ve performed a distinguished public service by getting this =snip – on record in the professional literature. No one with even a passing acquaintance with statistical methods could read Mann’s reply without wincing. He’s – snip

I would love to know if these folks have still been unable to understand that (and why) their methodology generates sticks out of Brownian noise. This is a matter of a simple computer program that most 10-year-old geeks are able to write themselves, perhaps after a day. Is that really a topic that should remain controversial in the research community for nearly a decade, or many decades? Something is simply seriously wrong here.

I suspect simple statistical (and perhaps general) innumeracy — which is pretty common in the geosciences. Which is one reason why he has gotten away with this silly BS for so long, I think.

Re: mugwump (#22),
Apart from the fact that the sign matters for screening, the most hilarious part of the statement (#2) is the fact that the sign does matter for the CPS-method used in the paper (that’s why there is flipping implemented for negatively correlated proxies in the code). So the statement is like saying “all cars have an ABS” when you are yourself driving a Lada of 1970’s.😉

Check. Actual (Reply to McIntyre and McKitrick: Proxy-based temperature reconstructions are robust):-McIntyre and McKitrick’s claim that the common procedure (6) of screening proxy data (used in some of our reconstructions)
generates ‘‘hockey sticks’ is unsupported in peerreviewed literature and reflects an unfamiliarity with the concept of screening regression/validation.

“The comments are nonsense” even though they are perfectly logical.

Check. Actual:-McIntyre and McKitrick (1) raise no valid issues regarding our paper.
-The claim that ‘‘upside down’ data were used is bizarre.
-In summary, their criticisms have no merit.

“This was explained in supplement A1” which was never made available to anyone.

Check. Actual:-The method of uncertainty estimation (use of calibration/validation residuals) is conventional (3, 4) and was described explicitly in ref. 2
(also in ref. 5), and Matlab code is available at http://www.meteo.psu.edu/mann/supplements/MultiproxyMeans07/code/codeveri/calc_error.m.

“The symmetry of transient variable analysis has been fully accepted in the literature” in a vague non-published study from 4 years ago which actually said something completely different.

Partial check. Actual:-Reply to McIntyre and McKitrick: Proxy-based temperature reconstructions are robust
-McIntyre and McKitrick’s claim that the common procedure (6) of screening proxy data (used in some of our reconstructions) generates ‘‘hockey sticks’ is unsupported in peerreviewed literature
-McIntyre and McKitrick’s claim that the common procedure
(6) of screening proxy data (used in some of our reconstructions)

I think the only key thing everyone missed is the used of the word “bizarre”.🙂

It’s funny but I cannot bring up RealClimate now. It says I am forbidden, that I don’t have permission to access the server. I have not even tried to post over there for a long time. What is going on? Do they keep a list of IP addresses of people who post there who also post on ClimateAudit? Or do they just have a hiccup with their server? Is anyone else having this problem?

Re: M. Villeger (#37), Correction, As for the reference: it’s in “Drôle de Drame” film by Marcel Carné ( adaptation and dialogue by Jacques Prévert, 1937). The actor Louis Jouvet played both roles hence my mistake.

On this point, if you are suggesting that Steve MacIntyre be regulated by an oversight committee, and have his auditor’s license revoked when he breaks ethical rules, then we may have something we can agree on.–eric]

Any immediate resolution of the issues with the Mann et al. reconstruction between those Canadians and the team will depend more on the thinking person’s view of arguments we see presented on the blogs. I am probably biased, but MM present what I think are legitimate criticisms of the reconstruction. The reply consist of declarative statements that MM are wrong, and instead of dealing directly with the criticisms, the team uses references to the team’s peer reviewed literature – which, if using or prescribing the same criticized methods, merely provides evidence for the view of a Wegman cluster and where the methods are not judged/discussed separately from the issue of how many team players use or endorse it.

While the letters are too short (by journal rules) for a truly informative and detailed discussion, I think the content and approach by writers can be informative.

Well, they seem irritated but the “lightness” how they interpret the inconsistency away is slightly “unbearable”. Well, there’s no problem for them if two 95% estimates diverge – we can always say that it’s OK and moreover we can choose whichever gives us more convenient results. Right?

This kind of thinking is just so sloppy. If there are two methods to calculate something, including the confidence interval, the case of non-overlapping of their predictions, within their own estimate error margins, is just a bad problem – something that is unlikely to happen by chance if both of these methods are “in principle” fine. They seem totally undisturbed by any inconsistencies.

Also, I would love to know if these folks have still been unable to understand that (and why) their methodology generates sticks out of Brownian noise. This is a matter of a simple computer program that most 10-year-old geeks are able to write themselves, perhaps after a day. Is that really a topic that should remain controversial in the research community for nearly a decade, or many decades? Something is simply seriously wrong here.

“A good auditor doesn’t use the same Excel spreadsheet that the company being audited does. They make their own calculations with the raw data. After all, how would they know otherwise if the Excel spreadsheet was rigged? … – eric”

And then you get criticized for getting it wrong because you didn’t run the calculations correctly.

Am I correct that Mann’s evidence to back his side relies on his own work.

However, their finding that the spatial extent of 20th-century warming
is exceptional ignores the effect of proxy screening on the corresponding significance levels.

addressing just those issues about screening that Mann said are not addressed in
the peer reviewed literature. I have always thought Gerd’s comment, that accounting for
the screening blows out the CI’s, is just another, more statistical way of saying
screening produces hockey sticks.

Its ironic that the very paper he references to support screening as a valid procedure,
started a documented dispute over the reliability of the procedure, over the consequences
of screening. Issues with screening are in the peer-reviewed literature.

Re Steve # 25,
It took me quite a while to figure out that this is a reference to Jean S’s comments on the 12/30/08 thread M&M Return, in which readers were challenged to make predictions about the wording of the reply by Mann et al.

For some reason the links to the comments by Bill Illis etc are coded to this thread (#5071) rather than the earlier one (#4757), and so do not work correctly.

The statistical wrangling is over my head, but I do notice in the citations that Mann uses the time-honored Climate Science Circular Citation and Argument From (My Own) Authority, which as we all know, is unimpeachable.

I would like to understand this business of “Multivariate regression methods are insensitive to the sign of predictors.” I gotta admit that, taken at face value, it reads like pure B.S. Anyone like to take a shot at enlightening me?

T=a*P+e=(-a)*(-P)+e
In linear regression, your job is to fit the coefficient(s) (“T” and “P” are known, coefficient “a” and noise “e” are unkown), so it “does not matter” (mathematically) for your fit what is the “true sign” (i.e., what is the physical meaning) of your predictor(s) (P). So if you originally obtained the coefficient “a” for the predictor “P”, you get the same fit by first flipping your predictor (i.e., take -P), and then calculating your coefficient (it’s now -a).

This is why you want to “screen” data beforehand. Essentially you are then fixing the sign of your coefficients by allowing only predictors with correct sign of correlation. Mann’s statement is so blatant (should I say “bizarre”?) for at least the following reasons:
1) it is literally true (obvious), but it does not apply to the CPS method used
2) because of this property, in general, you need to screen your proxies to get rid of wrong correlation predictors (or talternatively to fix the sign of your coefficients). Now, e.g., (anyhow corrupted) Tiljander series have wrong correlation (positive, should be negative), i.e., they should not pass screening but they do.

The reply-letter of Mann et al. 2009 is astonishing. Apparently Mann wants us to believe that Steve cannot read the NAS panel report. The flat denial of factual errors which was recognized by both NAS and the Wegmann report is actually a little creepy.

As I understand it, Mann says the inclusion of Tlijander series “doesn’t matter” because it was not included in the verification step. But it was included in the temp reconstruction step! The weighting and coefficients given to this series is based on data that includes instrumental data as part of the proxy, so the part of the results Tiljander gives is simply wrong. Correct me if I’m wrong, I’m not very good with peas under thimbles.
Steve:Upside down Tiljander was in the no-dendro that “proved” that bristlecones didn’t “matter”; Graybill bristlecones (not Ababneh’s update) were in the no-Tiljander that “proved” that Tiljander didn’t “matter”.

Ammann and Wahl 2007 in their running text made a seeming concession on this issue in respect to our point on RE statistics (but then introduced their Texas sharpshooting test as an override):

Rather than examining a null model based on hemispheric temperatures, MM05a,c
report a Monte Carlo RE threshold analysis that employs random red-noise series modeled
on the persistence structure present in the proxy data (note, noise here is meant in the sense of the ‘signal’ itself, rather than as an addition to the signal). They argue that random iterations of even a single red-noise series based on the first PC of the North American treering proxy set in MBH can systematically generate a mean offset in the resulting “reconstructed” 20th century calibration period temperatures compared to the immediately preceding verification period. Because the MBH method employs unitless proxy series in a regression against the primary PCs of instrumental surface temperatures, the calibration will not “care” if such random mean differences in the (red-noise) proxy series are positive or negative (MM05a). Rather than averaging to zero, all differences in mean will be treated as an inherent structure of the data and they will be used as if ‘oriented’ in the same way, and thus the verification significance thresholds at the 95 and 99% levels can be expected to be well above zero. It is important to note that such a situation, however, does not occur in the case when the predictors are direct temperature series (with associated sign) that are compared against another temperature series (e.g., hemispheric mean values), and equally not when composites of proxy time series are scaled against a hemispheric (or other)temperature series in “Composite Plus Scale” (CPS) methods.

Will there be a post on the MBH reply? I realize there has been a big harry distraction but based on even my limited experience with multi-variate statistics there seems to be a lot to explore in this response.

The Mann paper says ” The screening process requires a statistically
significant (P 0.1. As I see it you could screen either way but P0.1 (nearly eveyone can join in) will produce completely different results (or is it a feature of the method that it “doesn’t matter?”)

This is strange. Now when I go to PNAS to view MM09, I get the page where they want to charge me $10.00 to read it. I do not understand why sometimes I can view it and sometimes I am asked to pay for it. How do they decided what is open-access and what is not? Do they use the Heisenberg uncertainty principle?