"Worse than We Thought"

I’ve just collated ERSST v3b and ERSST v2 versions for the tropics and compared ERSST v3 and ERSST v2 versions. CA readers were undoubtedly ready for adjustments, but, using the technical terminology of the leading climate journals, the adjustments were “worse than we thought”. ERSST v3 lowered SSTs before 1880 by up to 0.3 deg C.

Figure 1. Tropical SST: ERSST v3(B) less ERSST v2

The reference article for ERSST v3 is here, which states in the abstract:

Improvements in the late nineteenth century are due to improved tuning of the analysis methods

Update: I’ve amended this article in light of Ryan O’s comments, with which, on further reading, I agree. The difference between ERSST2 and ERSST3 shown in the above graphic is related to the information in SR Figure 2 shown below. You can sort of see the 0.3 deg C shift in the 19th century in this figure, though it’s not explained. SR Figure 2.

The changes that precipitate the differences are changes in parameters for the Smith-Reynolds “extended reconstruction of SST” (ERSST) algorithm, summarized in the Table below: SR08 Table 1.

What is a bit surprising, I guess, is that 19th century SST estimates are so volatile in respect to the parameters. The analysis reported uses a hurdle of 2 months per year (was 5); 2 years per LF (was 5); 25 deg area (was 15 deg)

58 Comments

In other news, either Nature or Science or Reader’s Digest, I forget which, reported that the newest climate models are “better than we thought”, noting a “remarkable” low-frequency fit between ERSST v3 and GFDL Model CM2.1.

.
*rolls eyes*
.
I would say that you have to be kidding me . . . but I know you’re not.

I think that’s what it says. But it’s possible that they’ve done something else. It’s a typical climate science article – no Supporting Information providing details or source code for exactly what they did.

It’s a typical climate science article – no Supporting Information providing details or source code for exactly what they did.

And I suspect that any requests for this information will be stonewalled. Which would leave us with nothing more than an unsupported assertion shielded from the scientific method.

I have a question. We recently learned that the USCHN temperature data is considered to be accurate despite the fact that 90% of the monitoring stations are sited in ways that fail to meet basic scientific standards. This is because they use a super duper, all-knowing computer algorithm which fixes all the bad data. I assume that the details of how the algorithm works have not been made available for review by outsiders. Is that correct?

If so, that would be another failure of the scientific method. Unsupported assertions unfettered by the inconvenience of real science.

Steve: To my knowledge, USHCN has not published source code; their adjustments are described in cursory publications. While I obviously object to the lack of transparency, readers have to continually remind themselves, as I remind myself, that lack of transparency does not in itself invalidate the results. As to temperature increases, there is considerable evidence of temperature increase. The increase is not a concoction of USHCN adjustments. Having said that, I think that USHCN would do themselves and everyone a service by simply archiving all their methods and code in a completely transparent way and letting people check their adjustments.

I’m still trying to get an understanding of the facts, but I think we are talking about different (albeit related) “adjustments”. I believe that before Anthony Watts started his project, the daily temperatures for each site were adjusted before recordation by a method that used the temperatures from other sites of various proximities. Anthony’s project then began to demonstrate that the vast majority of sites had been producing bad data. Even assuming that it is possible for surrounding sites to be used to “correct” the data of a bad site, they now had a much bigger problem — if 90% of the sites are bad, and they don’t know which ones are bad, or by how much and under what conditions, it would appear that they are trying to use bad data to correct other bad data. Yet, weren’t we recently told that their algorithm can take all this garbage and weave it into gold? That’s one heck of an achievement that serious people ought to take with perhaps a tad bit of skepticism. Without resort to the code, that’s not possible. We have an assertion of an extraordinary capability without anything to support it. Why would any scientist trust it?

And the biggest issue with this is not whether temperatures have risen or not (although obviously the extent does matter). The issue goes to scientific competence. Quality scientists not only avoid the pitfalls which come from abandoning the scientific method, they also refuse to rely on the work of those who do. The biggest outrage regarding the hockey stick was not the shoddiness of the work. It was the acceptance by other scientists of a study which overturned all previous scientific understanding on the subject without so much as a raised eyebrow (no audit, no replication, no nothing). It would be an outrage, even if Mann’s work had eventually turned out to be flawless.

It isn’t that the lack of transparency automatically makes the work wrong. It’s that it makes the work and the asserted “findings” untrustworthy. It’s a failure to use the scientific method. No scientific method? Then it’s not reputable science. It hasn’t reached the threshold that quality scientists ought to demand, if it is going to be considered reliable.

Re: stan (#24),
“Steve: To my knowledge, USHCN has not published source code; their adjustments are described in cursory publications. …”

The problem is that this, while it may not invalidate results, takes the discourse out of the realm of a scientific discourse. One of the immensely harmful confusions of our time is the confusion of scientific and legal argument. Denying information to a “jury” is legal argumentation. Denying information to your scientific peers is a practical guarantee of being embarrassed. The fact that you might be well-intentioned and convinced of the correctness of your interpretation is no defense against shutting out your peers from relevant disciplines and refusing the advice of experts where your own understanding is less than expert (e.g. most scientific investigators vs. professional statisticians).

OK, it wasn’t Reader’s Digest. It was International Journal of Climatology. One of Stephen Schneider’s … debaters, Ben Santer, wrote:

In summary, considerable scientific progress has been made since the first report of the U.S. Climate Change Science Program (Karl et al., 2006). There is no longer a serious and fundamental discrepancy between modelled and observed trends in tropical lapse rates, despite DCPS07’s incorrect claim to the contrary. Progress has been achieved by the development of new TSST , TL+O, and T2LT datasets …

One of the ways that this discrepancy was “resolved” – and we’ve not talked about this yet – was to eliminate the comparison of satellite temperatures to GISS and NOAA surface temperatures. CCSP had used GISS and NOAA. Santer eliminated them, replacing them by the “independent” series: ERSST v2 and ERSST v3. Santer:

The other two SST products are versions 2 and 3 of the NOAA ERSST (‘Extended Reconstructed SST’) dataset

GFDL CM2.1 was one of the models in the Santer comparison. You;d think that it must have done pretty well in a low frequency comparison of models and observations.

Seems like “The Ministry of Truth” have been hard at it again. In Newspeak this might read as “NOAA ERSST results malreported doubleplusungood-rectify”
To paraphrase the great George Orwell “He who controls the data, controls the past. He who controls the past, controls the climate.”

This is what is called “generating data”. I used to regulate nuclear power plants, and we once had one of the reactor vendors try to do this, because it was not possible to measure some important parameters during an experiment. They used a sophisticated CFD model to “generate data” that was used to validate a large systems model.

I am trying to grasp what this new work even if valid would really tell us. The key question surely is the AGW issue. The old and new tropical SST estimates differ prior to 1880, but our greenhouse emissions did not begin to rise until well after this, from around mid-20th-century. There would seem to be little change in the SST estimates for that period.

Ah, but they say they use “improved statistical methods”, and “Most of the improvements are justified by testing with simulated data.” so how could there be any doubt?
So the adjustment for 1880 is -0.3C. And what is the v3 SST anomaly for 1880? Yes, about -0.3C.
A useful project would be to collate all these examples of ‘adjustments’, including
1. Land temperatures
2. Sea surface temperatures
3. Satellite troposphere temperatures (adjusted upwards)
4. Sea level rise (eg Cazenave 2008, raw satellite data shows no rise, until a ‘correction’ obtained from modelling, is applied).
Other examples?

They are trying to crumple, spill coffee on, and generally distress the TANG Memo saying Bush was AWOL to make believe it was written on a seventies era typewriter, when everyone knows there wasn’t such a thing as Times New Roman font with nerling, before Windows 98. < just on the off chance there is someone reading Climate Audit who would like a non climate world analogy.

Isn’t this an attempt to conjour the tropical troposphere agw fingerprint?

On this and other famous blogs I have made the accusation that some of the fine texture wriggles in atmospheric CO2 levels are added cosmetics to make the graphs more believably like the Mauna Loa data. It is hard to imagine such fine detail preseved from Barrow to the South Pole (where suddenly the phase shifts 180 degrees or so) when CO2 is described as a well-mixed gas.

Here was have a worse problem, where the graph shown first above is derived from a synthesis of a low frequency trends and an added high frequency set of wriggles. As admitted,

To make our study more realistic, HF variations from observations are added. The HF observations are from a combination of the optimum interpolation (OI) SST over oceans and from the Global Historical Climate Network (GHCN) over land, for the recent period (Reynolds et al. 2002; Peterson and Vose 1997).

Further, from Smith 2008-

Thus, the same HF anomalies are repeated with a 10-yr cycle over the 1860–2000 period. To simulate random errors in the test data, the variance from the base period is scaled by random noise to-signal variance ratio estimates.

Is this science or is this visual graphics with the intent to decieve? Ask that qiestion of the poor statistician who one day uses these values for a power spectrum analysis, without knowing that they are partly synthetic.

As a point of clarification on piling on here: as always, I don’t want people to editorialize about policy and that sort of stuff. Of course, there are consequences – that’s why we’re interested. If people want to object to the mixing of models into data, that’s OK; I just don’t want to have a lot of commentary about the world economy or generalized comments about AGW. I understand the concerns, but editorially they can easily make every thread look identical – so try to connect things to this particular data set.

This has been going on for a long time now. We don’t even know how much the raw data has been adjusted.

Ocean SSTs were always a problem to start with since the records are so sparse especially for the earlier time periods and for areas outside the main shipping routes.

But we have no idea how much the global temperature record has been “up-trended” by the adjustments. At least, the NCDC has confirmed the trend in US temps has been adjusted upward by 0.425C since about 1920.

Effectively we have to rely on just the satellite temperatures from now on.

If the purpose were to reconstruct SST in the past “just for basic science” then the circularity would not exist. BUT if you use models to reconstruct the SST and then test the models against this reconstruction (for purposes of proving global warming)…why is that not obviously a problem?

Something else that I’ll try to get to sometime – remind me if I forget – the GISS TRP data set is a bit of an outlier relative to CRU, NOAA and ERSST. It would be worth checking to see if GISS models verify better against GISS TRP data (and HadCRU against HadCRU TRP data.)

The article doesn’t prove the efficacy of their new changepoint adjustments – it cites their own article: Menne and Williams, J Clim, in press, as statistical authority. What bothers me, here as with Mann, is the use of some hot-off-the-press and poorly understood statistical methodology developed in-house for a controversial applied result.

I haven’t seen the paper in question yet – maybe it’s brilliantly documented and an R package implementing the procedures has already been archived at CRAN. But I’d be astonished if that were the case.

Perhaps I’m missing the point somewhat, but from the way I read your graph, these latest adjustments cool the LIA, and don’t impact much else.

One can quarrel with the methodology, or the serial check-kiting of the references, but in the large scheme of things, this doesn’t really change much, does it? It’s not like that graph just happens to track C02 emmissions or anything like that.

You’re on the right track here… but the key is that this is non-CO2 sensitivity. If we take it at face value (and I’ll leave it in the capable statistical hands of Steve, Ryan and Jeff to show how bad an assumption that is), then there’s a huge warming that cannot be attributed to CO2.

Eyeballing figure 2, from 1890 to 1950, the anomaly went from -0.6 to -0.15, or 0.75°C/decade. From 1970-2000 it went from -0.2 to 0.05, or about a 0.83°C/decade.

So fine, say it was cooler back in the late 1800s, but please explain how something that was completely natural back then is now obviously the result of man.

I see little purpose in trying to convert using CO2 impact as a metric for every discussion.

What we’ve learned here is something quite different – that seemingly innocuous parameter selections in ERSST have a large impact on SSTs before 1880. Why is this? What is it exactly in the interaction of data and parameter selection and methodology that leads to this enormous and unexpected impact? And what are the lessons of this for related applications of the methodology (which include studies that we’ve spent a lot of time – Steig et al 2009, Mann et al 2008.).

But the “worse than we thought” comment that they used implies just that: CO2 is a much worse problem than we even imagined it was. I was just pointing out the flaw in their logic.
Steve: “They” didn’t use ‘worse than we thought’ in this context. I used it sarcastically to comment on the methodology. And until we understand the instability in the SST estimates from parameter selection, it’s pointless speculating on its downstream impact and I don’t want to spend time or energy on such speculations at this time or on this thread.

Hm.
.
I read their method descriptions, and I do not believe they incorporate any model data into the observations. My read of their methods is the following:
.
1. They extract the LF component from the model.
2. They sample the LF component at various rates to determine the MSE associated with each rate.
3. Based on this, they determine the minimum sampling rate below which the reconstruction algorithm will damp the values toward a zero anomaly.
.
They then perform a similar analysis using synthetic HF data based on actual observations from 1982-2001.
.
I do not believe any of the model data is then reincorporated into the observations. The point of the study seems to be focused on determining the sampling cutoff below which the reconstruction error widens dramatically. In the case of V2, it had a higher cutoff, resulting in the pre-1890 reconstruction being damped more than V3.
.
So that alleviates one of my concerns. Without seeing the reconstruction statistics nor a full description of how the reconstructions are done, I now have an additional (and greater) concern about the reconstruction methods generating bad trends. Prior to this, I didn’t realize how much of the SST data was reconstructed instead of actual data. The data sets are already synthetic.
.
And with my only other encounters with “reconstructions” being the Mannian and Steigian varieties, I will express a deep reservation about the accuracy of the SST reconstruction – especially since they practically admit that not enough “modes” (I’m assuming PCs) are used in certain areas:
.

That is because the HF analysis will
damp anomalies in regions where too few modes are
chosen for the analysis. For example, the Niño-3.4
(5°S–5°N, 120°–170°W) area SSTs may be slightly
damped early in the twentieth century, as discussed in
section 2c (V. Kousky 2006, personal communication;
see Fig. 8).

.
An additional concern is that when the error gets too high, the global averages are apparently calculated by simply truncating the offending data:
.

Damping of large-scale averages may be reduced
by eliminating poorly sampled regions because
anomalies in those regions may be greatly damped. In
Smith et al. (2005) error estimates were used to show
that most Arctic and Antarctic anomalies are unreliable
and those regions were removed from the globalaverage
computation.
…
Using the default SR05 parameters, the merged global
MSE is minimized when the 25° region has at least
35% sampling. Using the improved tuning discussed
above, the MSE is minimized when there is at least 20%
sampling. The improved parameters yield a lower optimal
sampling for global averages because they produce
a less-damped analysis in the presence of sparse
sampling. However, even with the improved parameters
the MSE for global averages can be reduced by
omitting some sparsely sampled regions.

.
This would seem to cause potential biases when just using the “global average” for subsequent analysis, since the spatial content of the average changes with time. To me, this seems to be simply a method for reducing the uncertainty ranges merely by truncating the “bad” data, rather than simply admitting that the sampling is so sparse that large uncertainties are appropriate.
.
On the HF side, they repeat the 1982-1991 (half of the sampling period) data over and over again throughout the 1860-2000 test period – yet they do not discuss any kind of correction for spatial or temporal autocorrelation. If you repeat the same data over and over again, I would expect some significant corrections would need to be applied. The article doesn’t mention any, and that, too, is concerning.
.
I also do not understand their satellite bias adjustments yet. That will require some additional reading. Something seems not quite right about it, but I cannot put my finger on it yet.

Re: Ryan O (#38), you could be right about model data not being used. If models aren’t used, then it seems like a very poor idea to open the explanation of methodology changes with an exposition of model properties. But maybe it’s just execrable exposition.

There is a connection back to Mannianism from this literature. MBH cites Smith-Reynolds “optimal interpolation” as the inspiration for Mannian “climate fields” – which are nothing more than principal components. The reduction of the Steig AVHRR data set to three PCs is very much in the same tradition – so your perception of a linkage here is very astute.

Equally some of the puzzling features of ERSST that are unclear from the articles may seem a bit clearer if the methods are interpreted in Mann-Steig terms as you suggest – an excellent idea.

Re: Steve McIntyre (#42), When I read the abstract, I thought the same thing you did. It wasn’t until the second read of the LF part of the paper (because I didn’t understand it the first time) that I realized that the model data is just used as a test target.
.
Regardless, I’m still wondering where their verification statistics are. They obviously ran multiple reconstructions, yet the article mentions no verification statistics at all. Is there an SI associated with this with the statistics tabulated?

Re: Steve McIntyre (#50), Fantastic. That is EXACTLY what is happening.
.
Jeff threw up the verification stats post on tAV, and not only can you see it in the reconstruction PCs themselves, you can also see that effect in the RegEM verification for Steig. RegEM simply ignores some of the actual station data because truncating to 3 eigenvectors does not provide enough flexibility for the fit.
.
I added the 2003 and 2004 Smith papers to my collection for reading as well. Hopefully the earlier ones have good descriptions of the reconstruction methods. It would be interesting to replicate their work and perform the same kind of sensitivity analysis we did with Steig.

Mike B,
.
Beyond the shift in the trend (and the focus always on the trend, not the absolute temperature values) there is also the effect on examining the models.
.
If I’m understanding correctly, using this same technique would make -any- model provide better hindcasts. Where a hindcast is just using your model to predict what should have happened historically and comparing that with what did actually happen.
.
Picture what would happen with a clearly crazy model. Assume it predicts temperatures falling off a cliff at 5C per century monotonically. Using the techniques outlined above, all the early -physical-measurements- would be strongly adjusted upwards. To better accord with the model. Voila – instant ice age data.
.
Not just a -model- that is -predicting- an ice age. But the raw data themselves have been massaged into becoming pretty darn convincing predictor of an upcoming ice age entirely on their own.

You know, I just had another thought that has me puzzled at the moment.
.
The concern with the sets is the uncertainty associated with the spotty historical raw data. This revision was attempting to quantify the minimum required sampling rate to provide an accurate picture of SSTs during periods of spotty coverage.
.
So to test, they took a coupled ocean model – but not an unforced one – to provide a sampling testbed:
.

This coupled general circulation model (CGCM) simulates
the large-scale climate signal using variations in forcing
by greenhouse gases, aerosols, and the best available
estimates of solar radiation changes (Delworth et al.
2006).

.
So the first puzzling aspect is why they would take a forced model to simulate a period where the forcings are presumed to be negligible. Why not use a control run?
.
The second puzzling aspect is that the coupled ocean models have known problems with reproducing regional characteristics. For example, one of the big problems is the strange massive cooling off the east African coast that happens sometimes during spin-up (not sure if GFDL 2.1 has this specific problem or not). So if you’re already not going to use an unforced model, why not simply resample actual data to determine the appropriate cutoff?

CA readers were undoubtedly ready for adjustments, but, using the technical terminology of the leading climate journals, the adjustments were “worse than we thought”. ERSST v3 lowered SSTs before 1880 by up to 0.3 deg C.

One more application of Hansen’s Golden Rule of climate science: Old data are always too warm, while new data are always too cold.

I’ve amended the head post in light of Ryan O’s comments. As Ryan observes, the models were used to change certain parameters in the algorithm. What’s surprising is the size of the impact of these parameter settings on the results – amounting to up to 0.3 deg C in the 19th century – something that readers should keep in mind when presented with arbitrary parameter choices in things like Steig.

Re: Steve McIntyre (#48), The size of the change has me wondering, too. That’s an awfully big jump. Methinks one might be able to generate a “64 flavors of SST” like Burger and Cubasch’s analysis of MBH.

For reference, here’s what I removed from the head post in amending it in light of Ryan’s comment:

As I read the article, here’s what they’ve done. They state that GFDL CM2.1

simulates interdecadal signals with characteristics similar to the observations where data are available. However, shorter period signals are not simulated as well by this climate model.

In areas where data is spotty, they appear to coerce the “observations” to model output, described as follows:

“output is filtered to extract the model LF component. This is done using the 15-yr LF filtering described by SR05. Briefly, the LF is computed by first averaging anomalies spatially over 15° latitude–longitude moving areas and then annually. The smoothed annual averages are then median filtered using 15 annual averages to produce the LF anomaly analysis. Because the model outputs are complete, there is no damping of this test LF output. These model LF anomalies are used for the 1860–2000 test period.

They continue:

To make our study more realistic, HF variations from observations are added. The HF observations are from a combination of the optimum interpolation (OI) SST over oceans and from the Global Historical Climate Network (GHCN) over land, for the recent period (Reynolds et al. 2002; Peterson and Vose 1997). To form the merged complete data, the OI SST anomalies are averaged to the monthly 5° grid boxes for 1982–2001 and merged with the GHCN 5° monthly LST anomalies. These data fill nearly all monthly 5° grid squares within 1982–2001. The remaining unfilled grid squares are filled using linear spatial interpolation of the anomalies from their nearest neighbors.

In other news, either Nature or Science or Reader’s Digest, I forget which, reported that the newest climate models are “better than we thought”, noting a “remarkable” low-frequency fit between ERSST v3 and GFDL Model CM2.1.

‘The adjustments do not just cool the late 19th century. They also increase the amount of warming since then. More warming implies greater climate sensitivity. Thus, lower temperatures in the 19th century result in higher temperature predictions for the 21st century.’

This is what I am wrestling with (#10). Would this work if valid imply increased climate sensitivity to human emissions? If so, why would the putative extra warming be essentially prior to 1880, well before our emissions rose substantially? Wouldn’t the new work even if valid only imply a higher estimate of natural variation in the late 19th century?