When is an Upper Confidence Limit a Lower Confidence Limit?

An odd question, you say. It’s not something that usually expect in a statistical study. But hey, this is the Hockey Team, with statistics by Frame of climateprediction.net.

Here’s Figure 1 of Hegerl et al. A typical Hockey Team spaghetti graph- lots of overlaid series, so it’s hard to see what any individual series is doing, but decent artwork. Here I want you to look first at the red line – which is the CH-blend "long" together with the grey confidence intervals.

Supplementary Table 1: Reconstructions CH-blend short and CH-blend (long) for 30-90N decadal mean temperatures and their 2.5 to 97.5% uncertainty ranges (in separate file). The uncertainty range accounts both for uncertainty in the amplitude of the reconstruction, and sampling uncertainty (in separate file), the 2.5% and 97.5% amplitude uncertainty for CH-blend is 0.66 and 1.83 times the amplitude given in the best guess, that for CH-blend (long) 0.68 and 1.91 of its best guess amplitude.

Note first a curious discrepancy between the confidence intervals described in the SI (2.5-97.5%) versus those in Figure 1 (10-90%) ranges. I don’t know right now how they calculated these confidence intervals. (I still haven’t figured out MBH99 confidence intervals, nor has von Storch.) Perhaps the difference is a typo, perhaps it’s something that I don’t understand right now.

Anyhow I went through the exercise of plotting up the CH-blend long together with its upper and lower confidence intervals from Supplementary Table 1, yielding the figure below. The red line is the CH-blend, more or less matching the red line in Hegerl Figure 1. The blue line is the lower confidence interval (2 sigma) and the black line is the upper confidence interval (2 sigma). Generically, the shape of the blue line envelope seems to match the confidence interval envelope in Hegerl Figure 1, indicating virtual identity between the 2.5% confidence interval of Supplementary Table 1 and the 10% confidence interval of Figure 1.

However, there is something much more intriguing: look at the blue line in say the 17th century. Now look at it in the 20th century. Somehow the upper and lower confidence intervals have crossed over. I double checked and this is definitely in Supplementary Table 1 and not an error by me. Maybe I’m misinterpreting something, but it’s definitely unusual in any statistical studies that I’ve ever seen for the upper and lower confidence intervals to change places. At the crossover, they claim remarkably narrow bands.

Plot of CH-long from Supplementary Table 1.

The same thing happens with CH-blend. The crossover gets washed out in the greyscale plot. If anyone can explain what’s going on with these confidence intervals, I’d be much obliged.

46 Comments

The upper and lower confidence intervals also swap places around the maxima near 1400 (probably all three of them), near 1550, and near 1750.

The confidence intervals don’t track with constant amplitude with respect to the data, either. Two sigma seems to vary with the year, and |-2 sigma| looks differently valued from it correspondng |+2 sigma| almost throughout the plot. Very peculiar. I don’t understand it, either.

It looks to me like all those confidence intervals are are a scaled version of the data itself. In other words, if you take the data in standardized units (0 = mean, 1 = mean+1 std. dev, etc.) and just multiply them by, say, 4 you get something like the blue line. If you multiply them by say 0.5 you get something like the gray line.

When the sign of the data changes, then what I described reverses. Visually, I think what I described is a very good match.

What do you get if you calculate upper_bound/data and lower_bound/data and graph it? Two straight lines parallel with the x axis?

Conceptually, shouldn’t the +/- 2 std dev variation be determined for each individual data set before a principal components methodology is used to determine which data sets best represent a global trend [assuming, of course, that the PC method could be applied…].

Where a data set is comprised of, perhaps, 0.000001% of the trees that have been in a particular forest in the last 1000 yrs, it seems odd to allow a particular data set to be disproportionately weighted in the final output — and only subsequently try to make up for it by adding error bars.

I’m not a statistician, but I’d think that it would make more sense to run a series of high-low scenarios. EX: for three data sets of time-dependent values A, B, and C, each with their own std dev = s, PC would be done on the following data sets:

Assuming a robust PC method, the resulting “spaghetti graph’ of outcomes would approximate upper and lower confidence, if not accurately, at least more realistically than simply multiplying the weighted outcome from small data sets by 0.66 or 1.83.

Steve, I think it is something like Nicholas (#2) said. If you take their SI data, then the 2nd column (“CH.long zon”) multiplied by 1.622 is almost perfectly their 3rd column (“min 2 sd”), the maximum absolute difference being 0.0126 (MAE=0.0045)! The same way if your multiply the 2nd column by 0.575 you get the 4th column (“pls 2 sd”) maximum absolute difference being 0.0075, and mean 0.0028!

Nicholas is right, I just tested it, the linear regression between the grey and red line is 0.576 +/- 0.001 and between the grey and blue line is 1.622 +/- 0.001. Note the nearly perfect linear relations.

I just hope for Hegerl et al. that what they have put in the graph and the auxillary material is simply a wrong dataset. It is the only reasonable explanation. Otherwise it doesn’t make sense: the datapoints that show no anomaly (value =0.0) have zero uncertainty. You can also see that in the excell-file (years 1401, 1426, 1428, 1566, 1570 … etc.).

re #9: I think #8&#9 are a nice example of independent verifications of the fact (#2) 😉

re #10: Well, another example of hockey team statistics.

re # 11: From “Acknowledgments” of the letter:

Author Contributions G.C.H. developed and implemented the method to estimate sensitivity and to calibrate proxy records, T.J.C. provided the reconstruction of past forcing and developed the CH-blend reconstruction, W.T.H. performed the EBM simulation, and D.J.F. derived the prior estimate of sensitivity from instrumental data 1950–2000. G.C.H., T.J.C. and W.T.H analysed the results.

Thanks for this. The above diagnosis by Nicholas, Jean S and Jos is definitely right. It’s as though they estimated confidence intervals in a univariate regression with a 0 intercept by multiplying the fit by the upper and lower limits of the coefficient. This deserves a little thought. Compare this to the following text:

The uncertainty range accounts both for uncertainty in the amplitude of the reconstruction, and sampling uncertainty (in separate file), the 2.5% and 97.5% amplitude uncertainty for CH-blend is 0.66 and 1.83 times the amplitude given in the best
guess, that for CH-blend (long) 0.68 and 1.91 of its best guess amplitude.

Here’s a quick summary of coefficients obtained by linear regression of the various versions:

I just can’t understand the concept of these confidence limits. Multiplicative factors that lead to zero intervals at values of zero?? Completely absurd. Does it also mean that the entire probablity function collapses at that point (not just the +/- 2s.d. range?). Wierd indeed.

Another question. What is with the massive skewness. Is that addressed in the paper. One thing is for sure, the skewed nature of the apparent probability distribution function is a boon for someone who migh want to say that they are extremely confident that it was very warm a few hundred years ago, but it may have been a lot colder.

And finally, what is with the positive shift on the series?? The NAture SI data confirms that the CH-blend (long) reaches a max of 0.11 around 1400. However, just eye balling the graphic at the top of this thread clearly show a max of more than 0.3 at that time. What am I missing here???

One thing is for sure, the skewed nature of the apparent probability distribution function is a boon for someone who migh want to say that they are extremely confident that it was not very warm a few hundred years ago, but it may have been a lot colder.

Both figures are titled “Plot of CH-long from Supplementary Table 1.”. That confused me for a second there. I think the second one should read “CH-blend” not “CH-long”?

Yes, this all smacks of either a very silly mistake or someone who completely doesn’t understand what they’re doing. If a basic mistake like that was made, what other mistakes might there be in the data processing?

I’ve only just spotted this thread (having had a few beers with Lenny in the pub last night I thought I’d pop back here to climateaudit.org to see what was happening here). Just a factual correction: the Hegerl et al paper isn’t really “the Hockey Team, with statistics by Frame of climateprediction.net”. I had a pretty minor role, really – just getting the GHG attributable warming for the last 50 years (building on a GRL paper from last year). The interesting and (I think) significant thing about Gabi’s paper – and Von Deimling et al – is that they try to use historical data in conjunction with the recent past to try to place some bounds on climate sensitivity. This “multiple streams of data” approach (in James Annan’s nice phrase) is a new kind of tool in trying to quantify sensitivity. Personally, I think that there’s quite a few issues to be ironed out (or at least discussed) with these techniques (independence/appropriateness of data streams and models; relative weights of the data sets used, etc) but multiple data streams are a promising idea, at least. But that’s just a personal view, and not one I’d claim is necessarily widespread in the climate science community (as is the view that though equilibrium sensitivity will remain tricky to quantify, it doesn’t really matter because there are other, more relevant things (like the transient response) that are easier to quantify).

Dave, Thanks for checking in. I’ll take a look at the GRL paper from last year.

I agree that using historical data to test for climate sensitivity is a good idea. I’m not convinced that sensitivity to CO2 forcing can be equated very readily to changes in solar forcing, but I’ve not investigated the issue. I’m certainly not convinced that Hockey Team proxies necessarily shed a great deal of light on past forcing.

Last year, I requested a list of the proxy data sets used in Hegerl et al and gor rebuffed. Unfortunately they are not listed in the Nature article. Could you provide a list of the sites?

At some point,I’d be interested in looking at attribution issues and it would be nice if you and your coauthors made up an exemplary due diligence package for this paper so that people interested in the topic could follow through exactly what you’ve done with original data and code. This is impossible at present.

In passing, the confidence interval estimation in this article looks highly flawed. Can you explain this to readers here?

You say you’re “not convinced that sensitivity to CO2 forcing can be equated very readily to changes in solar forcing”. I’d be more skeptical about volcanic forcing giving the same sensitivity as GHG forcing. I know there has been a bit done on this (symmetry of response to different forcings) over the years. I know some folks in Reading have looked at it – you could try http://www.met.rdg.ac.uk/~aer/new_response.html and the papers they refer to. I’m sure there are other papers on the topic, too, if you want to burrow down into it.

As for the proxy data sets – I don’t have them. Not my thing, really. As a general rule I don’t release data or code without the P.I.’s approval, so I’m afraid you’ll have to try the Duke folks again. And as for the confidence interval thing – I’m afraid I’ve had my fingers bitten recently re: commenting on papers on which I was a co-author but not lead author, so I’m referring comments back to lead authors, these days.

But basically no one has actually quantified climate sensitivity by measurement?

Hi Louis,

It’s a curious thing. Basically we can observe things (in current climate) that seem to scale with lambda(=1/S) but not with S itself. This means if we get something a bit like a Gaussian on our present day observable, we get something a bit like an inverse Gaussian (with a fat tail towards high values) when we try to infer S. The implication is that we do a better job of ruling out low S than ruling out high S (using current data alone). As I said above, while I think that’s kind of interesting (and explains the very different upper bounds on S in published studies), it doesn’t necessarily matter very much.

That’s fine. But you really have a duty to pursue the matter yourself. Your name is on the paper. We don’t need an answer from you. But you ought to be bugging your co-author’s for answers (internally). Either that, or you should not list yourself as a co-author. Just take acknowledgements for your services instead.

#23. It sure looks to me like there’s a god-awful clanger in the confidence interval calculations which is a corrigendum waiting to happen. By reading this, you’re now aware of a potential problem in a paper that you are a coauthor of. In similar circumstances, I would be proactive and seek reassurance from my coauthors that the matter was not a problem; if it was a problem, I would request that they issue a corrigendum voluntarily.

Bender: well it’s not like someone dropped a decimal place or anything like that. They actually used the wrong function to calculate the error. They’re using a function like “error_range = [ reconstruction_value / 1.1, reconstruction_value * 1.1 ]” rather than what I would imagine is more correct, something like “error_range = [ reconstruction_value – 0.1, reconstruction_value + 0.1 ]”. Hence the cross-over problem, if the value of the reconstruction changes sign, their error range bounds swap around. But, this indicates to me that they did not calculate the error correctly at all. If so, how much weight can we place on their conclusions?

This seems like more than just a simple mistake to me. It seems.. fundamental.

The dividing/multiplying rather then adding/subtratcing might be a simple error or a sign of poor ability. In any case, it makes one want to examine that entire part of the calc more carefully. If someone messes up dividing vice subtracting, want to check to see that they know how to calc the amount of SD as well.

But I’m not trying to dwell on the error or make it a gotcha game. My real chapped ass comes from them NOT CORRECTING the simple error. People in this game, seem way too defensive and it inhibits clear corrections. And that…is just morally wrong.

I meant “simple” in terms of “attributable to a single mistep in the calculation” and “solvable” and “not requiring a heck of a lot of commentary”.

The question asked in the title of this thread was “does this make sense?” You could try to figure out the relationship between the two curves, but my point is that reason alone tells you there is a problem: confidence intervals CAN’T crossover because error is always positive and a confidence interval is a mean ± error. Crossover is ALWAYS a sign of a calculation error. No need to speculate further. You need to know the calculated SDs and whether the method used to calculate them is correct. Then you will have the correct confidence interval. THEN you can proceed to consider the altered significance of the findings. Get on with it.

Why putting always the cart before the horse? Waste of time & energy.

“Fundamental” error? What does that mean? That they don’t know how to add? Or calculate a SD? More likely that they are simply not putting the energy into preparing supplementary material that you wish them to. That’s not a fundamental “error”. A fundamental difference in culture perhaps. They will learn … when they recognize they are being subjected to a higher level of scrutiny than ever before.

bender: Well, OK, I don’t want to start an argument over it. I just can’t think of what simple mis-step could have been taken to cause a problem like this one. Getting the confidence interval wrong in some minor way, OK, I understand that. But can you explain to me what you think they could have done wrong to get this kind of a result?

Re #35: Copy-and-paste error moving columns around in Excel? I have no idea. All you have there are three columns of data, and no equations linking them. That’s not a spreadsheet, it’s a list. Somewhere, there’s a spreadsheet (or Fortran code) that holds the answer. Meanwhile, I’m at a disadvantage on the specific importance of this, not having actually read the paper. So I’d best get it.

I think it is wise to keep on them on the issue of confidence envelopes on these time-series, by the way. Strategic, even. These are very hard to generate with any kind of credibility. They love to avoid the issue because the more honestly you look at the data, and actually model the uncertainty, the wider they get.

Most of these guys like collecting stuff in the field. They don’t like consulting with statisticians so much. You can well imagine the thrill of working for weeks with guys like Wegman (no slag intended) to get that confidence envelope ‘just right’. It’s like a detention instead of recess.

bender: ah, great points. Yes, I think if it is an error, it must be a substitution error, because I just can’t think of how it’s possible to make a mistake with a valid confidence interval formula and come up with something that looks like that…

On Hegerl et al. (2006).
The abstract is misleading. “After accounting for [SOME OF] the uncertainty in reconstructions and estimates of past external forcing, we find an independnet estimate of climate sensitivity that is very similar to those from instrumental data.”

It was me that inserted the qualifier “SOME OF”. This is based on their own p. 1031 statement “We also note that model uncertainties (beyond those we account for) potentially affect all estimates of climate sensitivity.”

Question: When are they going to get around to including uncertainties BEYOND THOSE that paleoclimatologists typically choose to account for?

Question: Why wait to do this? Why not do it before policy gets carved in stone?

*Supplementary Table 1 and its heading have been replaced on 5 February 2007. This is a corrected version of the previous Table S1, in which only the scaling uncertainty was included; in the corrected Table S1, scaling and sampling uncertainties are now included.

However, there is only a single series containing year, mean, upper and lower confidence level. Since this series starts in 1505, my best guess is that this is only CH-blend, while CH-long disappeared.

Happily, the confidence bands make more sense now – no crossover there. And a little regression and plotting shows that the upper/lower confidence limit no longer are simply multiples of the mean.

It’s so laughably and obviously wrong – recall Phil Jones alleged statement recently that past climate is unknowable with much precision? – how could the original figure possibly have passed review? The most reasonable guess is a confirmation bias amongst the authors, reviewers, etc. “Yeah, sure, looks about right, makes for a great story.”

Q. Had the correct confidence intervals been used in the original graphic, would this have killed the manuscript on Nature’s doorstep? My guess is: yes.