Mann and his bristlecones

Gavin Schmidt and others have claimed that the M08 usage of the Tiljander sediments didn’t ‘matter’, because they could “get’ a series that looked somewhat similar without the sediments. They’ve usually talked around the impact of the Tiljander series on the no-dendro reconstruction. But there are two pieces of information on this. A figure added to the SI of Mann et al 2008 showed a series said to be a no-tilj no-dendro version, about which Gavin said that it was similar to the original no-dendro version, thereby showing the incorrect M08 use of the Tiljander series didn’t “matter”. However, Gavin elsewhere observed that the SI to Mann et al 2009 reported that the withdrawing the Tiljander series from the no-dendro network resulted in the loss of 800 years of validation – something that is obviously relevant to the original M08 claim to have made a “significant” advance through their no-dendro network.

To better understand Gavin’s seemingly inconsistent claims, I re-examined my M08 CPS emulation – I had previously replicated much of this, but this time managed to get further, even decoding most of their (strange) splicing procedures. As I’ve done for MB98, I was able to keep track of the weights for individual proxies in the reconstruction – something not done in Mann’s original code, though obviously relevant to the reconstruction. This was not a small project, since you have to keep track of the weights through the various screening, rescaling, gridding, re-gridding steps – something that can only be done by re-doing the methodology pretty much from the foundations. However, I’m confident in my methods and the results are very interesting.

The first graphic below shows the NH and SH reconstructions on the left for the AD1000 network for the two calibration steps considered in M08: latem ( calibration 1850-1949) and earlym (calibration 1896-1995) for the “standard” M08 setup. On the right are “weight maps” for the latem and earlym networks. (The weight map here is a but muddy – I’ve placed a somewhat better rendering of the 4 weight maps online here.) The +-signs show the locations of proxies which are not used in the reconstruction. Take a quick look and I’ll comment below.

First, there are obviously a lot of unused series in the M08 with-dendro reconstruction. Remarkably, the exclusions are nearly all dendro series. Out of 19 NH dendro chronologies, 16 dendro chronologies are not used; only three NH dendro chronologies are used: one Graybill bristlecone chronology (nv512) in SW USA; Briffa’s Tornetrask, Sweden and Jacoby-D’Arrigo’s Mongolia, all three of which are staples of the AR4 spaghetti graphs. Only one of 10 Graybill bristlecone chronologies “passes” screening.

In other words, nearly all of the proxies in the AD1000 network are “no-dendro” proxies. I.e. , the supposedly improved “validation” of the with-dendro network arises not because of general contribution of dendro chronologies to recovery of a climate signal, but because of the individual contribution of three dendro series with the other 16 series screened out.

Secondly, the reconstructions are weighted averages of the individual reconstructions. The latem and earlym reconstructions don’t appear at first glance to have remarkably different weights, but have noticeably different appearances. In the NH, the earlym 20th century is at levels that correspond to levels that were precedented in the MWP, while the latem reconstruction has higher values in the 20th century than the MWP – BUT a marked divergence problem. This divergence problem results in a very low RE for the latem version (about 0), while the earlym version has a RE of 0.84. The earlym SH reconstruction has MWP values that are much higher than late 20th century values, while the latem SH reconstruction has MWP values that are lower than late 20th century values.

Values of the latem RE statistic appear to be an important determinant of Mannian-style “validation” – more on this later.

As a fourth point – back in 2008, I’d noted that the M08 algorithm permitted the same proxy to have opposite orientation depending on calibration period and that at least one proxy did this. Note the Socotra (Yemen) speleothem in the weight map. This has opposite orientations in the two reconstructions – something that seems hard to justify on a priori reasoning and which appears to have a noticeable impact on the differing appearance of the two reconstructions.

In the SH, there are obviously only a few relevant proxies. The big weight comes from Lonnie Thompson’s Quelccaya (Peru) data, with other contributions from Cook’s dendro series in Tasmania and New Zealand (the Tasmania series being an AR4 staple) and from a South Aftican speleothem.

Other NH series include Baker’s Scottish speleothem (used upside down from the orientation in Baker’s article), Crete (Greenland) ice cores – an AR4 staple, the old Fisher Agassiz (Ellesmere Island) melt series (used in Bradley and Jones 1993), the Dongge (China) speleothem, the Tan (China) speleothem. A number of these proxies have been discussed at CA.

Notilj Nobristle
One of the large issues in respect to MBH98-99 was the impact of bristlecones. Eventually, even Wahl and Ammann conceded that an MBH-style reconstruction did not “verify” prior to 1450 at the earliest without Graybill bristlecones. However, for the most part, the Team avoided talking about bristlecones, most often trying to equate no-bristle (or even no-Graybill) sensitivities with no-dendro sensitivity. Over-generalizing criticisms of bristlecone chronologies to criticism of all dendro chronologies. M08 adopted the same tactic – discussing no-dendro, rather than no-bristle (which was the actual point at issue.)

I’ve done experiments calculating M08 style CPS reconstructions with first no-tilj and then with no-tilj no-bristle. At this point, no=tilj should be the base case for an M08 style result – as there is no scientific justification for including this data in an M08 style algorithm: it doesn’t meet any plausible criteria for inclusion.

Below are results for the no-tilj no-bristle case. At first glance, the shape of the recons looks fairly similar to the M08 case. In detail, there are some important differences: for example, the divergence problem in the no-tilj no-dendro latem reconstruction is much more pronounced than in the M08 reconstruction where the huge ramp of the Tilj sediments and the bristlecones mitigates the divergence problem considerably.

These differences arise with relatively little difference in the relative weights of the other proxies.

Mannian Splicing
M08 has a very unique methodology for splicing reconstruction steps – one which you definitely can’t read about in Draper and Smith. First they calculate RE statistics for latem and earlym reconstructions. In the figure below, I’ve plotted latem and earlym RE statistics for the different steps under three cases:
(1) M08 from their archive, shown as a line
(2) no-tilj (using my emulation of M08) shown as + signs.
(3) no-tilj no-bristle shown as “o”. As noted above “nobristle” in this context only involves one series (nv512.)

This sharp decline in latem RE statistic ends up affecting the rather weird M08 “validation” method. From the earlym and latem RE statistics, Mann calculated an “average” RE statistic – this is another ad hoc and unheard of method. If the “average” RE statistic is above a benchmark that looks like it’s about 0.35 (note that this benchmark is a far cry from the benchmark of 0 used in MBH98 and Wahl and Ammann 2007 – one that we criticized in our 2005 articles – more on this on another occasion), the series is said to “validate”. If the addition of more data in a step fails to increase the average RE (and the average CE), then the earlier version with fewer data is used. This is “justified” in the name of avoiding overfitting, but this is actually an extra fitting step based on RE statistics.

In any event, the reason why the no_tilj no-bristle ( and afortiori, no-tilk no-dendro) fails to “validate” prior to AD1500 or so is simply that the latem RE statistic becomes negative due to the divergence problem – without the Tilj series and nv512. (I haven’t studied the EIV/RegEM setup, but I suspect that the same sort of thing is what’s causing its failure as well.)

In a way, the situation is remarkably similar to the MBH98 situation and bristlecone sensitivity. One point on which Mann et al and ourselves were in agreement was that the AD1400 MBH98 reconstruction failed their RE test without bristlecones. (And also earlier steps.) In the terminology of M08, without bristlecones, they did not have a “validated” reconstruction as at AD1400 and thus could not make a modern-medieval comparison with the claimed statistical confidence.

Ironically, the situation in M08 appears to be almost identical. Once the Tilj proxies are unpeeled, Mann once again doesn’t have a “validated” reconstruction prior to AD1500 or so, and thus cannot make a modern-medieval comparison with the claimed statistical confidence. (BY saying this, I do not agree that his later comparisons mean anything.; however, they don’t “matter’ for the modern-medeival comparison.)

“First, there are obviously a lot of unused series in the M08 with-dendro reconstruction. Remarkably, the exclusions are nearly all dendro series. Out of 19 NH dendro chronologies, 16 dendro chronologies are not used; only three NH dendro chronologies are used: one Graybill bristlecone chronology (nv512) in SW USA; Briffa’s Tornetrask, Sweden and Jacoby-D’Arrigo’s Mongolia, all three of which are staples of the AR4 spaghetti graphs. Only one of 10 Graybill bristlecone chronologies “passes” screening.
In other words, nearly all of the proxies in the AD1000 network are “no-dendro” proxies. I.e. , the supposedly improved “validation” of the with-dendro network arises not because of general contribution of dendro chronologies to recovery of a climate signal, but because of the individual contribution of three dendro series with the other 16 series screened out”

I’m having trouble parsing what you mean here?

Steve Mc: We’ve heard a lot recently that the M08 reconstruction with tree rings is supposed to be all right – despite the no-dendro problems. The odd thing is that only a few tree ring series are added to the no-dendro network to make the with-dendro network. M08 CPS methodology “screens” out the vast majority of tree ring series. There are only a few additional series in the with-dendro network: so why does it “validate” better than the no-dendro network. The answer is not immediately clear and I’m experimenting here.

What is clear is that the “improvement” in with-dendro results doesn’t result from properties of the 19 NH dendro series potentially available in this step, but from the addition of only three series to the no-dendro network: a Graybill bristlecone, Tornetrask and the D’arrigo-Jacoby MOngolia series – all of which are standard in AR4 reconstructions. Underneath all the calculations, we’re still getting a weighted average of the same old data.

Thank you; I’m very grateful for your work. I think it should be repeated that the immense amount of reverse-engineering you do would be unnecessary had the Teamsters archived data and method, as is required for review and publication. How would a graph look using acceptable data, and using open-source methods one could read about in Draper and Smith? Flat, slight upward slope since the last Ice Age (now trending downward slightly), normal cyclical peaks and valleys over the millenium? Is there acceptable, calibrated, ancient data?

Steve - the archiving of data and methods for Mann et al 2008 was acceptable. The methods were weird but that’s a different issue.

Do I have this right?
Mann took many series of data. Picked those that fit the preconceived notion he had, even if they needed to be inverted. Scaled the ones left so that when added together they produced the plot he wanted. Or have I oversimplified?

A bit oversimplified. They didn’t have to match the full hockey stick. He picked out the ones that showed an uptick at the end. When you average together data that has an uptick at the end, the early part can average out to zero either way.

I’m sorry Steve, but that’s exactly what it looks like when you back off and take the view from “10,000 feet”. All the statistical bafflegab is just there to confuse and disguise the fact. Until you can exactly define “validate”, for instance, you don’t “know” what he did.

The Tiljander proxies, and certain other ones as well, are best seen as “structured noise” without climate information–at least, without climate information, given the way they are employed in the Mann08 reconstructions.

So instead of looking at what happens to “the complete reconstruction” when Tiljander is removed, it might be helpful to look the other way.

Mann08 has a particular candidate reconstruction, and then tests it by adding “structured noise” proxies to see what happens. In the Tiljander case, the shape is a flattish noisy blade extending from 200 AD to ~1720, at which point it starts to ascend in an uneven, noisy parabola. Each of the four proxy has a somewhat different shape, with XRD having the least pronounced post-1720 rise.

One assumes that the addition of diffent-shaped noise would lead to different “test results”.

Steve: this is almost undecipherable. I’m ready to read something and this is not it. With all due respect, your opponents should not engage with this until you can state your points. Nor should outside readers make judgements.

You start by posing the whole concern in the context of debates with Gavin on blogs, rather than with the paper itself (in contrast, the Tilj post was much better). I mean who cares what Gavin says on a blog? I guess maybe y’all do. But I’ll get interested when you rebut the paper, not someone commenting on it, who’s not even an author and just in a blog.

You show graphs and then talk about other things, right away. I kept looking at the first graph and the paragraph after and trying to understand how that figure supported that paragraph. And hoping that later on, the figure would be discussed.

The figures are all vaguelly labeled. I’m not being picky, Steve. It’s enough to have a tricky technical situation AND disagreeing parties, but to mix on poor figures and captions? label every color, every line. Write a long figure caption. Read a book on how to do that. I wouldn’t normally get pedantic about this on a blog, but your figures are always headscratchers! I can follow Id or Watts or anyone else. Not you. For years. Yes, if I spend a bunch of time and read the entire article and then relook at the figures, going back veryt often, I can usually get it. But that’s unsat.

You don’t refer to specific sentences or figures in M08. In some cases, I’m concerned, you are calling something, “the M08″ when it’s not (can’t tell for sure).

The doing stuff in CPS is hard for me to understand, since only WMP EIV was touted.

The aha about 3 proxies being excluded vice 19, didn’t really even get your “aha” point. Not even arguing a technical point. Just couldn’t understand your logical discussion.

You refer to divergence, but don’t define it (qualitatively OR mathematically).

There’s probably other headscratchers.

Oh…you say you successfully emulated the Mann program…but then don’t show a difference plot for your code versus theirs.

Bunch of other places, where you couldn’t help but sigh about some aspect of M08, but it’s not explained in this post and not supportive of the conclusions here. Doing that is taxing on the reader and unfair to the opponent. And it really makes the whole thing a meta-bloghorhea mess.

Oh…and afortiori. Drop the non-standard English Latin. It’s pompous and distracting to the reader trying to figure out your graphs.

—————————

There might be some good points in there, Steve. The RE stuff sounded like you might be onto something. But a huge pain in the ass to read.

—————————

Oh…and before you dismiss me for being too rough on you or wanting to talk form over function, look at the comments you’ve had so far. Has a single person come to grips with any technical issues in your commentary? You’ve got some lower level attaboys and questions and Moshpit has a shorter “huh”.

By way of counterpoint; the Latin is fine, used appropriately, and in moderation. If you have difficulty understanding it, consult Google; if you have difficulty remembering it, write it on the back of your hand.

Personally I find your solipsistic demands that the post is simplified to pander to your idleness are “pompous and distracting to the reader” trying to follow the discussion.

But I’ll agree with one thing; the early/late NH/SH sections of the graph could do with similar treatment to the weight maps, i,e. display at higher resolution.

A quick blink image / fading shows how the curves respond to the inclusion / exclusion of proxies quite clearly but, at least to my eyes, it’s more difficult to determine any *significant* difference at a glance from the separate figures.

“With all due respect, your opponents should not engage with this until you can state your points. Nor should outside readers make judgements.”

One of the principles of open science is ‘release early, release often’. Its a fundamentally different approach from the ‘hide-the-decline-style’ science practiced elsewhere.

Steve is letting other know what he working on and that he found something curious. Others will now contribute. I think you can contribute is a more positive way if your criticism were more specific.

Here’s an example of how specific criticism works in open science: in your comment you say, “[a]nd hoping that later on, the figure would be discussed”. This is not a complete sentence as it does not contain an independent subject in the main clause. In open science, I would let you know that there was an error, you would simply correct the mistake and thank me for pointing it out. In hide-the-decline-style science you would deny making an error and then try to discredit me by making a non-specific attack like ‘mpaul’s bizarre assertion is known to contain multiple fundamental errors in both fact and logic and is emblematic of the kind of rhetoric we see coming from the fossil fuel funded denialists’.

Scientist doesnt believe in open science. He doesnt believe in hide the decline science, but he is very tradition bound. Doesnt like the seminar approach.

Personally I sense that steve is DEEP in the bowels of this maze of mann work and the short hands he uses are tough to follow without pictures.

Steve: you’re right that I’ve spent a LOT of time on this lately and am coming up for air only from time to time. To use your phrase, working on Mann’s material is like sticking needles in your eyes. That’s presumably why realscientists don’t bother.

I think you make some worthwhile points in your criticisms of Steve McIntyre and Climate Audit. Probably my biggest “however” comes from considering the context.

Feynman in “Cargo Cult” paints an idealized picture of Scientists, constantly questioning their assumptions and challenging reigning theories, their own favorites, and others.

With respect to the dominant paradigm on the climate of the past millenium or two — “Hockey Stick” has been embraced by AGW Consensus scientists, so I’ll use that shorthand — I see a great deal of explicit and implicit acceptance of both the methods and the results, within the Climate Science community. I see very little dissention from this orthodoxy (though there are a few people publishing “contrarian” pieces).

There are a few possible explanations for this Climate Science consensus on paleoclimate.

1) The science is settled. On careful examination, objections turn out to be poorly founded, or restricted to trivial issues.

Then there’s everything else.

2) Scientists are mostly engaged in their own subspecialty, so nobody is really looking critically at key aspects of the consensus.

3) The imperatives of the paleoclimate career track argue strongly against contrarian-ness.

4) Groupthink.

5) Whatever is slipping my mind at the moment.

McIntyre has burst uninvited into this happy area of climate science and upset it. We can provide our own descriptions and assign our own judgments to “upset it.” The description holds in a number of ways, some flattering to McIntyre and some not so.

Among the things he has demonstrated — at least to my satisfaction — is that the explanation that runs,”the science is settled” is false, in important respects.

In many ways, AGW Consensus paleoclimate reconstructions have more in common with Ptolmeic epicycles than with The Origin of the Species.

Much as professional climate scientists may loathe McIntyre’s demonstrations that paleoclimate reconstructions rest on a foundation of sand, this is a great service to Science, as Feynman idealized it.

For this to be true, McIntyre doesn’t have to be a nice person, or be mostly or always right, or speak in iambic pentamater. If his Climate Audit posts are sometimes correct on important matters–and they are–that is sufficient to get scientists in the Feynman mold to take notice, jumping at the opportunity to correct, improve, and expand their work. Thesis, antithesis, synthesis.

That has not happened and is not happening.

The explanation of this phenomenon is important, and is elided by McIntyre’s critics–you included.

On to the specifics of “What McIntyre Done Wrong” —

McIntyre has offered that Climate Audit posts are more along the lines of a Journal Club presentation than a polished publication pre-print. That squares with what I see. Yes, he could do better, be clearer, organize his posts more clearly, label his graphs. And so much more.

But these plaints mirror many skeptics’ condemnations of RealClimate.org in the comments to Keith Kloor’s recent Colide-a-scape interview of Gavin Schmidt. “RealClimate would be a better site if its moderation policy was fair, and while I’m on the subject let me say…”

Gavin had the right of it — RC’s owners and bloggers are accomplishing their mission as they see it. It’s their printing press; they like the product and see no reason to change these policies. If you don’t like it, don’t read it. Better yet, start your own blog and attract your own audience.

Goose, gander. Subscriptions to CA cost no more and no less than subs to RC. What do you suppose McIntyre’s hourly wages for CA posts work out to. At RC, we know what happens to comments that begin, “I think that you’d better adhere to my following detailed advice on what to blog about, how to compose your posts, how to structure your arguments, when you may inject humor…” At CA, at least these remarks make it past moderation. But why should McIntyre’s response be anything more than, “Noted. Please follow your muse when you start your own blog.”

Much of the science under consideration at CA is clearly within your grasp and within my grasp — witness our productive exchanges on Tiljander in “The No-Dendro Illusiom” thread. If you’re inspired to put in the time and effort to improve the analysis and explanations of the issues under consideration: there are surely productive avenues to pursue, now that you have composed and submitted a list of the shortcomings of the latest CA post.

You start by posing the whole concern in the context of debates with Gavin on blogs, rather than with the paper itself (in contrast, the Tilj post was much better). I mean who cares what Gavin says on a blog? I guess maybe y’all do. But I’ll get interested when you rebut the paper, not someone commenting on it, who’s not even an author and just in a blog.

Because what Gavin says on a blog, and what RC has been saying all along, end up being abused by media, environmentalists, Al Gore, etc as the gospel truth. Steve is saying there’s something they’re not telling you, that’s all.

Addendum1: Early CPS without dendro didn’t pass regardless. So who cares wrt no Tilj. I don’t see how what you are doing pressure tests what sentences in the paper or what figures in the paper. (Am open here, but honestly concerned, asking.)

Addendum2: The comment about the extra fitting step was interesting also.

Steve: whether anyone “cares” about this stuff is up to them. I’m trying to figure out how it works. It is not my view that RegEM/EIV is “right” in any sense. Or that this should preclude examining CPS. I’m trying to understand why these things do not “validate” prior to 1500. It’s interesting that both EIV and CPS fail “validation” at about the same time. I’m in a position to analyse the CPS breakdown, but not presently in a position to analyze the EIV breakdown. I suspect that reasons may prove parallel, but don’t know that.

Prior to this analysis, I didn’t realize that there were such large differences between the late-miss and early-miss RE statistics. It is by no means obvious that you can average these two to make one “passing” average. Maybe the late-miss failure should be interpreted as the model failing and the early-miss results as the RE equivalent of spurious correlation.

I guess he should just do the RE for the full period. I’m not sure that this would be exactly the arithmetic average, but it makes sense that it would be something in between the two values (of the sub-periods). And the sub-periods overlap and also I have no idea the exact algebra. But if one is a good match and the other a poor match and we look a the entire situation, what do we know? Something intermediate, no?

Steve: you can’t do an RE for the full period. YOu should know that by now. RE depends on calibration-verification. Burger and Cubasch insightfully observed that, if you use RE for choosing which “flavour” to pick, that this is then part of the calibration process and no longer part of the verification/validation – an excellent and irrefutable point that Mann and others have simply ignored.

1. I thought that was your point about the extra fitting step?
2. If you average the two, you are imprecisely (since the algebara doesn’t work that way) just sorta doing RE over the full period.
3. I don’t really trust calibration/validation anyhow. Need true out of sample data to really validate (like the economists say, wait another 50 years, sucks…but so be it).
4. Screw validation. Make the calibration the best it can be. Break eggs, make ommelette.

I understand validation: you take only half of the information for fitting, and judge the results (I guess some sort of weighting factors) with the other half.

Now temperatures are well measured only in the latest decades, I guess. Fitting on 1850-1949 is actually using comparatively bad temperature data, compared to sattelites that we have now.

Would it not be more appropriate to fit the proxies on modern data and validate them on older ones? Moreover, one could compare the differences in the resulting weights by fitting with modern (e.g. 1940+) vs. old (1850-1949) data and discuss the robustness of the fitting.

One step further – make the fitting on stepwise more and more data (e.g. first 1850-1900, then 1875-1925, then 1900-1950 etc…), and see how the weighting factors change. If it is robust, they should converge and stay pretty constant. Otherwise they will move “randomly”.

Of course, that is less Auditing, and actually more something that the researcher may do…

Can Steve or someone else please clarify what is meant by “validated” and “skillful” in the context of these reconstructions? I presume these terms have precise meanings, but I’ve done scientific statistics for 20 years and I haven’t encountered them before.

Steve: these are somewhat idiosyncratic Team terminology. Wegman had never heard the term “skill” in this sort of context.

Mann doesn’t really define his usage, but in practical terms, it means that the RE statistic is “significant”. There are no tables for RE significance. Mann appears to have done simlulations that supposedly yld a “95% significant” benchmark of about 0.35. In Mann’s case, if the early-calibration (late-miss) recon has a RE of 0 and the late-calibration (early-miss) recon has an RE of 0.8, the “average” is 0.4 and this is deemed “95% significant”. Don’t ask me to justify this or provide a statistical theory for it.

Can Steve or someone else please clarify what is meant by “validated” and “skillful” in the context of these reconstructions? I presume these terms have precise meanings, but I’ve done scientific statistics for 20 years and I haven’t encountered them before.

For slow folk like me, RE is the reduction of error statistic, a non-standard technique defined and discussed here. And see David Stockwell’s fun toy model here, nicely illustrating some of the drawbacks of this approach.

As a side comment, Mann et al.’s habits of inventing novel and often half-assed statistical tests, and using them to “validate” their (apparently) pre-conceived notions, is one of the most repellant features of climate “science”. It will fall of its own weight… sometime.

I’m very curious about the appropriateness of using trees for this purpose. My understanding is they’re dormant for pretty much half the annual cycle. They’re subject to the relative beginning of spring which means they begin to renew each year when weather permits. We have no way to know when that renewal starts nor when it ends. There are other things that affect growth such as availability of sunlight, water, and infestations. I’m pretty much convinced the tree rings are not capable of telling us what we wish they would.

Bitter cold winters, for example – what do they look like vs mild winters? How do tent caterpillar infestations look in the record?

It leaves me thinking this is interesting but not helpful information for analyzing paleoclimate. Particularly the summer vs winter weather conditions.

Well scientist, as a science student at university, I care about the “no Tilj”. In fact I care quite a bit.

I care that the Hockey Team has not fully released their data and their methods.

I care that they conduct their pusedo-science behind a wall built of closed network, obfustication, obstruction and double talk, making personal attacks on anyone that questions them.

If they would just release all the relevant data and code, we would not have to depend upon people like Steve McIntyre to decode the Hockey Team’s nonsensical mish mash of inverted logic and twisted statistics.

I care that there are people like Steve McIntyre who are trying to move the state of the art of Climate Science forward, despite the attacks that he continually suffers.

I care that the Hockey Team’s results are not repeatable by others, because if the results are not reproduced, it is not science. That’s the first thing we learn in our Year 1 Ethics of Science course.

“If it cannot be reproduced, then it is not science.”

When I graduate, I hope to pursue a career in Science. I care about the damage the Hockey Team is doing to the reputation and respect of the scientific community with their approach to their use of data, methods of analysis and ethics concerning healthy and rational questioning.

So, scientist, I care quite a bit about “no Tilj”, because it says everything there is to say about the level of openess, integrity and trustworthiness of the people on the Hockey Team.

In my career, I will be tainted with actions of the Hockey Team and I resent that.

Well said, orkneygal, agree 100% and I’m an old geezer over 50 years after graduating in chemistry. Although I’ve now forgotten most of the stuff I learnt then, I haven’t forgotten the need for professional and scientific integrity. Go get ‘em lass.

This may be slightly OT, but i continued to be bothered by the Team’s concept of weighting. Whether they weight PCs or reconstructions, I am concerned by the use of weighting without a clearly stated, and justified, criteria for the weighting.

It strikes me as surgical cherry picking: use a tiny bit of this, a bit more of that, most of this, and all if that.

I think this concept of “methodological cherry-picking” has had an airing here before. Every method / methodological choice will have it’s limitations; keep bumping up against those limitations in the right combination and who knows how unreliable the result may turn out to be?

This doesn’t *have* to be conscious, particularly, but if you “know” what result is “correct”, your bias will favour the magic combination that reveals the “truth”.

The weighting is actually required to make any sense out of the data. They call the initial weighting calibration, because they have to scale tree-ring widths, mollusk shells etc. to degrees C, then they have to scale the proxies to the amount of area they covered- gridding etc.

However, you get no argument from me about the cherry picking, because of the methods they use.

If you only weight to convert the proxies to temperature and then to area covered, ok, but I don’t think that is what happens. I think (and for sure with PC methods) proxies are also weighted by how good their correlation is with temperature (local or global) during the calibration period–this is where mining for hockey sticks really gets it’s oomph.

Steve: one thing in favor of M08 relative to MBH98 is that the area-weighting in their CPS method prevents the bristlecones from running away with everything. My guess is that the EIV method (teleconnections) enables much greater weighting for bristlecones, but just a guess.

Yep, good intuition. to be checked of course. But makes sense. I also would be a little concerned about how the EIV handles a lot of proxies in the same location, which may be non-independant (for instance the 4 Tilj proxies, where LS and DS are really confounded with XRD). EIV seems a lot like old skool MBH in terms of the global teleconnection training and all that.

Really haven’t studied this stuff in detail, even in the “read a lot on CA” sense of studying in detail. So my “thoughts” are at the speculation/hunch stage. And are based on ignorance.

1. I’m sorta uneasy with EIV in general because of the global climate field. Mann talks about doing EIV with local proxy matches and hemispheric ones, but I kinda didn’t read to see that he tested everything…and I don’t think he really reports in detail. In general, my concern with EIV is similar to the concern with MBH98, that the regressions may make us prone to fishing and data mining. Intuitively, I think that some sort of direct area method or select proxy method is more likely to give us a higher quality Bayesian guess. However, it’s probably good that SOMEONE is trying these very complicated and more tenuous approaches. So if we didn’t have Mike, we would have to invent him. Just that we should take EIV* with a grain of salt.

2. The issue of flipping recons is just another way in which EIV is more fishing for signal, versus having a strong physical argument. So when we evaluate EIV, we need to keep that in mind. Doesn’t mean no one should try very complicated methods. Just we should keep the results in mind with that concern. And to the extent that we can somehow (I don’t know how) figure out if the aphysical math-heavy approach is better or the more conservative approach is, we should do so.

– snip – [SM – meandering complaining ]

3. The different periods, different signs is even more of a concern physically, but is basically similar in concept to the issue discussed above.

4. I guess if you feed enough grist into the mill, you should not be surprised that some of it gets turned upside down or even upside down in different periods. The question comes if you can just live with that and it evens out, or if you should set up additional screens for it. I don’t know.

*I’m not even a statistician, so when people use terms like CPS or EIV, I’m a little in the dark. Might as well call them red and green. I think perhaps the bigger issue is not the kind of regression, but that one uses the global climate field and one uses local. (Yes, he does talk about running local climate WITH EIV, but I don’t get the sense this is what he did for most of the actual repeorted results.)

2.5. Also, helpful to do some estimates of when and how flipping occurs, how much impact it has on the result, etc.

Steve – different orientations for the same series is exotic, but is in all steps because the Yemen speleothem is affected. See the change from red-to-blue. My surmise is that the switch of this one series has a noticeable contribution – perhaps the major contribution – to the latem and earlym reconstruction difference and thus despite being an exotic circumstance impacts results. it’s on my radar but not in the near term.

Craig,
I agree with you that a case could be made for weighting proxies based on their corellation with temperature. This would make good sense if the study uses a vewry large number of proxies. However, when the study uses a very small subset of available proxies, there has already been a significant weighting by exclusion. Further weighting of the members of a small subset should not be necessary if the criteria is temperature correlation. This leaves me to draw the conclusion that Mann is firmly convinced that some data is “better” than the rest and thus should be weighted more strongly than the rest of the data.

Any weighting by correlation ends up with a suppression of historic signal. You might think that the ‘correlation’ with temperature would cause an amplification but the suppression wording happens because of rescaling after the data is averaged.

There is no case to be made for weighting proxies according to correlation, because that is just the mathematical cherry picking which must be avoided. It causes a mess if you reject data which doesn’t do what you want.

Again, these proxies are scaled by standard deviation, how do you ‘weight’ them into temperature if you don’t do something?

I do not agree that weighting should be done based on degree of correlation with temperature. My preference would be to screen proxies–if no good drop them (but drop all of a certain proxy such as speleothems, don’t keep some and toss some unless metadata show a particular proxy to be compromised, such as Tiljander is). If ok, average them. This prevents cherry picking of red noise series.

I do not agree that weighting should be done based on degree of correlation with temperature.

I couldn’t agree with you more. However, I don’t think that it eliminates the use of correlation among the proxies themselves as a useful agent for weighting the proxies in forming a composite series which can then be linked to a temperature series. One would expect a common influence to possibly induce a correlation structure in a set of proxies.

IMHO, one of the shortcomings of the methodology used by Prof. Mann and others in the paleo business is the fact that they have basically ignored the relationships among the proxies themselves over time and between the individuals proxies and the reconstructions (again, over time) when doing validation.

Roman
To combine proxies you still need a physical theory of why they should be combined – not just that they are correlated. In addition, suppose you have 5 proxies of the same type in a tight regional area. Excluding one or two of them because they do not correlate with the local temperature seems to me to inflate the calibration statistics. Surely you must decide ex ante on which proxies to include by virtue of your physical theory, then you can combine the proxies and test the combined proxy against the local temperature? Surely this is what you would have to do if you took multiple cores from the same tree?
The intercorrelations of the proxies then becomes a measure of the robustness of the particular proxy.

What I wrote does not disagree with any of your points that you need a physical theory to justify the use of particular proxies. For me, that would be a given.

If you have multiple proxies in a given area, there are still conditions spatially local to particular proxies which can make some of them react more or less strongly to temperature than others – for trees, soil type, moisture conditions, shade, etc. Even the variation due to which core you happen to get can be such a factor.

I did not say that proxies should be excluded, but rather different weights could be used based on how well the proxies relate to each other, not only in the thermometer period but throughout the entire lifetime of the proxies. The proxies would be combined first and the end result calibrated to the temperature as in CPS. I have some looking at some methods for doing this.

Roman:
Thanks for the reply. I guess I am still uncomfortable with the notion of the weighting process ex post the calibration, unless there is a confirmation that the relationship with T is constant across the entire calibration period. What do you do if the relationship is much weaker in the second half of the period. Isn’t this one of the many problems with the weightings of the BCPs?

In my opinion, the relationship among the proxies has to be unchanged throughout the entire lifetime of the proxy set and not just within the thermometer era. This is consistent with the uniformitarianism principle as well as common sense. Deviations in the correlation structure would raise questions about the reconstruction and its relationship with T.

As well, IMHO, there is generally insufficient examination of the relationship between the individual proxies and the reconstruction itself. It is possible to estimate how strongly the recon depends on each proxy and to examine whether that dependency is the same in different time periods (similar to the concept of residual analysis in regressions) as an evaluation of the “robustness” of the reconstruction.

Steve: this is the essence of the Brown and Sundberg approach, that I’d discussed in quieter times. Proxy reconstructions are an interesting academic question – it’s too bad that academics in the field aren’t interested in academic questions.

Like you, I do not agree that weighting can be done on the basis of correlation with temperature. This is because temperature is a proxy for other properties, like irradiance and heat content. As a hypothetical from dendroclimatology, consider trees near the tropics at high altitudes having the same average annual temperatures as trees nearer the poles at low altitudes. However, the growing conditions, such as day length, would upset comparisons between the two, irrespective of correlations with temperature.

I’m moving to the position where no proxy reconstruction is valid before the instrumented period. In part, this is because the instrumented period carries errors too large for calibration to be accurate over earlier times. I’m preferring the historic reconstructions noted in almanacs and the like. Seen this way, the alarm question is not so important.

Importance arises when the rate of a valid change in a global climate factor is greater than the rate at which it could be ameliorated, if that was deemed a desirable path. It is therefore important for Steve, your good self and your colleagues to show if or when needless alarm has been generated from old data inferences like proxies. However, the last 30 years have not produced alarm in my mind and 30 years is a relatively good subsample.

Yes, I’ve arrived at that position too. I’m not yet entirely convinced that no temperature proxy is useful for actual metrics prior to instrumentation, but I’m “moving forward” [:)] somewhat reluctantly because I’m really loathe to disregard this evidence, but the historical data does have some comparative use

For example, there is a lake in Japan (I would need to Google “varve shales in Japan” to name it now), but the summer detritus is algeal bloom, which only occurs at a known temperature

Bit of simplification for you, scientist, as you are clearly struggling. Earlym and Latem are two separate reconstructions, with different validation periods. They produce widely differing output, and there is clearly a question mark over Mann’s method of producing a weighted average without any apparent justification for the weighting or even for the combination of the two with the unsatisfactory effect of blending the calibration and validation periods. Remember he has previous form for splicing measurement and projection data, so it is understandable if some posters here assume he has chosen his weighting in order to rescue recalcitrant data.
The accusations of cherry-picking are not off-topic, as, for those of us unimpressed by Mann’s commitment to statistical integrity as shown in the inversion of the Tiljiander proxies and in some of the emails, his selection of non-standard methodology, and his foggy criteria for inclusion and exclusion of the 19 dendro datasets just amount to more of the same.

Also, I don’t think Steve said the exclusion of the 19 dendro sets was foggy. It followed from the stated rules. Maybe it was a surprise to him, but I don’t see how that was a contradiction of anything in the paper. If Steve, thinks it’s an important result, he can show percent dendro of the recon as a function of time, Jeff Id style. And discuss it.

scientist and David S,
No, it was a good post all the way through. Some people tend to forget that Mann has admitted he is not a statistician. The only question I have of David S is related to “Remember he has previous form for splicing measurement and projection data…” I am aware of Mann splicing measurement and proxy data, but not projection data. Are you talking about a different paper or was that a typo?

There are multiple strange things going on in Mann’s methods. Thanks for peeling it back.
1) Going from earlym to latem, both NH and SH switch from a cooling trend from 900 AD to 1700 AD to a flat relation. This makes no sense and indicates a problem.
2) the averaging of the RE stat is very problematic and statistically meaningless
3) Using RE to screen effectively confounds RE (which itself is poorly known)
4) using Tiljander
5) switching sign of yemem speleothem
6) others (such as the splicing itself, which seems totally unjustified to me).
If there is just 1 iffy item, you can evaluate the effect of this iffy item, but when there are this many, it just becomes handwaving.

Well this post isn’t about all of them. It’s not the post for re-arguing Tiljander or sighing about it. It’s the post for latem/earlym RE stuff.

I’m not clear that their is any splicing. There is stepwise reconstruction. IOW, you use the more abundant more recent data to give a better understanding of more recent temp And intutitively that makes sense TO USE IT, if you want the best guess (if you had to bayesian “bet”) on the temperature. I guess maybe graphically, he ought to show it on the line, where the steps occur.

It’s not, actually. My point is around how do you combine surveys that don’t map the same extent. I would bet some serious money, that even though I can’t express it mathematically, that my fundamental insight is correct. We can get a real Bayesian in here to show it in math and such.

Imagine, you landed on the East Coast of the US and wanted to know how Indian fierceness varied as a function of E/W axis.? You send out different expeditions. Four make it to the Appalachians. One to the Mississippi. One to the West Coast. Now, you don’t get to make any more expeditions. But you have to draw the best graph you can o Indian fierceness versus longitude. Would you ONLY use the one West Coast expedition? Would it give you the best “Bayesian bet” graph of the function? And heck, I’m understanding that the expeditions varied in time of year, quality of explorer, random chance of Sioux raiding party patterns, lattitude crossed at, etc.!

If a Martian from space, came down and said “Earthlings, give me your BEST estimate of global temperature fro 1000-2000″, would you only use proxies that extend all the way back?

The truth is that we know the last 30 years by sattelite pretty well. By thermometers for a 100-200 (with improved assurance as we get closer in time. And by proxies with less and less assurance as we go furhter back in time.

The way to represent that picture most effectively, to give the best picture on what we know about temperature is with some sort of trace that combines the more certain and recent series with the less certain but longer back series. As such, you get some “best guess” line going back from now until 1000 (or whatever the limit is)…and the uncertainty limits just get bigger as you go further back.

Steve: as so often, you have the most trouble with the things that you think that you know but don’t. Mostly because you talk a lot without doing the homework. If you are estimating MWP temperatures, then to estimate your confidence levels, the only relevant proxies are the ones that go back to the MWP. SPlicing data that is irrelevant for the purposes disguises problems that may arise.

If you have inconsistent information about the tribes in the Rockies, then the fact that you have a lot of information about tribes in the Appalachians doesn’t resolve the inconsistency. The only way to do that is with better information – a point that I’ve made for a long time.

Hunches and intuition are of course valuable, a good starting point, but in science one must then TEST the hunches, not just assume them and never look back. That is what has been happening in paleo work–possibly (intuitively) reasonable approaches are adopted without ever testing them, or testing them too gently.

Kuhnkat: No. Actually I’m not. Really. These are separate issues and I appreciate your discerning the difference in them. Reread my post again, especially the part where I talk about widening uncertainty as you move back.

> Imagine, you landed on the East Coast of the US and wanted to know how Indian fierceness varied as a function of E/W axis.? You send out different expeditions…

Imagine two groups landing on the East Coast. In sending out their expeditions into Indian Country, both make the same mistake… detailing an expedition to Southern Finland.

Group A, being Postmodern scientists, are pleased to include the Finnish findings: they fit nicely with the narrative on Indian fierceness that Group A already believes.

Group B, being disciples of Feynman, quickly recognize that Finland is inhabited by Finns not Native Americans (the concept isn’t hard to grasp, actually). They acknowledge that the expedition to Finland has to be removed from the analysis.

The boosters of Group A assure me that their group includes the Best and the Brightest–the Smartest Guys in the Room. I should definitely heed their analysis, and ignore Group B.

I dunno, though. Something’s doesn’t seem quite right with this picture. Even though I can’t quite put my finger on it.

First the paper was a reconstruction of temp over time. NOT a hypothesis test of your little pony.

Second, the way to combine poor evaluation of Indians in the Rockies with those in the Appalachians is to show a best guess curve and then show WIDER uncertainty levels in the Rockies. And no, duh, that higher knowledge about the Appalachians does not move the curve in the Rockies (except if it allows you to throw out some bad expeditions, maybe.)

Third, I asked you a WHILE AGO to specify exactly what you were talking about with splicing and such, so that we can at least come to grips with this issue. As we are doing here, imperfectly. You blew it off. I suspect you want to use the term splicing instead of the term stepwise reconstruction because of the connotation with the ‘splicing of instrumental temps into proxy series’ where there was false labeling (calling something a proxy that wasn’t). Of course this was a different issue.
snip
P.s Please keep the Voice of God out of my rectangle.

I would advise the TCOesque blog participant(s) to took these exercises that Steve M performs for what obviously this post is meant to be: a sensitivity test. It is probably a lot easier for us untrained or less trained in the methodologies used in reconstructions to understand.

I sometimes judge that those who would question the validity of these sensitivity tests are simply doing what they feel comfortable doing and that is to see some sense and validity in the what the authors of papers publish in reputable journals, and further because they themselves do not feel confident in passing judgment on the work. It therefore becomes an issue with these blog participants of who do you believe: a published document of someone criticizing it on a blog. The more difficult it is for these blog participants to understand the basics of these sensitivity tests the more likely that, through shear frustration, they will tend to question the blogger doing the sensitivity tests.

Of course there are those partisan participants on both sides of issues who will tend, without a reasonable understanding of the technical parts of the arguments, to merely agree with those who conclude with that which they agree.

I think the weakness in the view that one must publish to truly and legitimately make a reasoned argument or criticism, is that a reply to a blogger’s criticism like Steve M’s, if he even were to make a slight error and perhaps not one pertinent to what he was attempting to show, it would be jumped on by the authors and their defenders in a minute. The criticisms and counters to criticism will actually come faster in blogging than in peer reviewed literature. Furthermore, for those who do not have good confidence in their understanding of the works and methodologies and therefore depend on point and counter point exchanges, that discussion happens on the blogs in something closer to real time.

I think what Steve M has done here allows a reader to get some good insights into what the authors of this paper actually were showing and not what perhaps they would have preferred to show. It also tells us something about the “robustness” of the paper’s conclusions and can provoke (eventually) some rather straight forward observations by the likes of a Gavin Schmidt that would, in my estimation, not have been otherwise made.

Re: averaging the RE statistic. Let us say we have a model for some population response. It is “validated” for men but not for women but we must apply it to a population combined. Can we average the RE statistic for the men and for the women and conclude that this average RE is ok so we can use it?
Steve: I think that this analogy applies. The idea of trying to “validate” a model through an “average RE” is very odd and it will take a while to think it through. This is another example showing the pitfalls of trying to develop “novel” statistics on a contentious applied statistical problem.

I like this post by CL as well. I believe this is from “Superfreakonomics”: Statistically speaking, the “average” adult human being has one breast and one testicle. How valid is that statistic in describing human beings?

The crux is you can eyeball the data and easily conclude that most of it is noise, there are some mwps, and some compromised hockey sticks but there isn’t enough to give us a picture of the entire world – not even close. The Mannomatic II whittles down to about 20 weighted proxies. Imho there is no statistical procedure published or unpublished, novel or established, Bayesian or frequentist that can make a silk purse out of this sows ear. This of course is why nobody else did it before Mann and why Mann’s new methods should have been treated with much more suspicion.

If NH and SH have different RE, rather than averaging them, the proper thing to do would be to compute RE (or whatever) for the global data. I bet that if you do this the RE fails, vs the fake test of averaging the 2 RE scores.

Why does early-miss NH have no divergence problem but late-miss NH has it big time? Is this because proxies with divergence issues simply failed early-miss validation so are not included in that recon at all?

Is this also the reason why early-miss NH recon looks far more like the -snip- Loehle recon, with a MWP that exceeds the CWP?

Is the answer to these questions known, or are the components of these recons unknown? It’s not clear from Steve’s text, to me at least.

I too would like to see almanac-type and historical evidence included – and “calibration” methods considered for these. And in all fairness, kudos for at least attempting to calibrate dendros. But we all know there are things like medieval and Roman grapes in northern England, more northerly ancient treelines, higher altitude farming, etc

Steve: This is a blog. I’m more interested in the process of how things work and analysis of this methodology is a work-in-progress. There are many other sites where you can find short summaries, if that’s what interests you.