RegEM Impact on Peninsula Correlations

There has been a good deal of discussion regarding the correlation between temperatures at various locations throughout Antarctica. Several people have looked at the relationship between correlation and distance by creating graphs linking the two. IMO, one of the difficulties in interpreting these is that they are affected by a variety of factors, including the shape and topography of the continent and by the fact that the place is completely surrounded by a large pool of water.

I think that it is informative to pick several locations and to see how the AVHHR series at that location is related to all other locations. I selected two points: the tip of the peninsula (Steig series 1) and the obvious interior point: the South Pole (grid point 1970 is the closest).

For a selected site, after calculating the 5509 correlations, we graph them using a color scale to represent the correlation (as usual red is positive and blue is negative and white areas represent zero correlation). The location of the grid site is represented by a green +. Keep in mind that these are correlations measuring relationships between temperatures at the grid point, not positive or negative trends.

First, we take the latest revelation from Steig, the cloud masked AVHHR data. The grid point is on the tip of the peninsula.

Several things stand out in the graph. Obviously, the region immediately adjacent to the grid point is strongly correlated, but what is somewhat surprising is that the correlation drops off fairly becoming negative while still in the Western Antarctic area. The relatively low correlation continues to the rest of the continent.

Next, we take the original reconstruction: ant_recon.txt. This was supposedly reconstructed from the previous data using RegEM and the manned surface stations:

The correlation has strengthened dramatically in the Western Antarctic so that now the pattern exhibited by the reconstruction at the tip of the peninsula seems to be reflected by the entire west. As well, the Eastern portion has now become more strongly inversely correlated with the peninsula.

I have also looked at the two other reconstructions (detrended and PCA) created by Steig as well as the looking at the South Pole and how its temperatures correlate with the rest of the grid points. These can be found at my statpad site . The R script can be found in a Word document here.

52 Comments

The change in sign of West Antarctic correlation from negative in the original Comiso data (Steig “raw” AVHRR) to positive after application of the Mannomatic is quite intriguing, given that the principal reported “discovery” of Steig et al is previously unidentified warming in west Antarctica, the knock-on effect of which was to move the Antarctic P/L statement from a loss to a profit.

Beautiful colors that crisply show how crazy the negative correlation – vaguely seen already in the distance/correlation histograms applied on the Steig 3PC data – becomes with RegEm.

With these linear methods, it is pretty much inevitable that a large portion of the coefficients in front of the spatial patterns will be negative: linear algebra doesn’t really allow you to segregate negative and positive numbers so their representation is comparable to 50:50.

On the other hand, in actual physics (and the cloud masking reality above), such highly negative correlations between temperatures are very awkward, essentially because of the second law of thermodynamics. The heat always tries to flow from warmer to cooler places, introducing a uniformly positive influence of one place to another.

That doesn’t mean that negative correlations can’t occur. But it means that they’re pretty much statistical flukes.

At least in some approximation, systematically negative correlations between temperatures at two spots contradict the second law of thermodynamics. ;-) So this is also a point where atmospheric physics is inevitably “nonlinear” because it protects the reality against large negative correlations.

The comment above has exceptions: ENSO brings warm weather somewhere and cool weather elsewhere. But that depends on rather complicated flow patterns of the wind, humidity etc.

I have added two more plots to the post by request. These correspond to the grid point at the Casey station (coordinates: lat = -66.3, lon = 110.5) on the opposite side of the continent from the peninsula. The results:

Cloudmasked AVHRR:

Original reconstruction:

The pattern is interesting in relation to the tip of the peninsula. In the data case, there is little correlation with Casey’s grid point. In the rexonstruction, that part is strongly negatively related.

Roman — In the the original graphs, did you for some reason subtract out the row means as well as the (month-wise) column means before computing the correlations? This would be sufficient to make the correlations average spatially to 0, so that any positive correlations would have to be offset by equal and opposite negative correlations somewhere else.

(However, this couldn’t account for the surprising results in #5. How can the correlations from right to left be so strong, if the correlations from left to right die quickly and even change sign?)

No, I didn’t do anything funny with the data. Just a straight correlation calculation.

The correlations in the initial post were from all grid points compared to the single point at the peninsula tip. The correlation was strong for points on the peninsula but was negative for points on the opposite side of the continent. In #5 , the single point on the opposite side is still negatively correlated with the peninsula tip. But one can not infer from the first graph what may occur in the middle area of the second graph when the relationship between the points in that area and the first comparison point is relatively weak (as is apparent in the first graph).

Also — How can you correlate the Steig AVHRR data with the 5 oceanic surface stations that Steig used but which are not in Antarctica? These are Macquarie, Grytviken, Campbell, Signy, and Orcadas — see Steve’s map at http://www.climateaudit.org/?p=5296.

I used the minimum distance to find the best grid location so the points off the grid simply used minimum distance to find the correlation and took the nearest grid coordinate. I’ll clip the islands from the point plot.

I’ll do some more peninsula area plots but Esperanza is one of the poor correlation plots and is in my previous post. You can see why. There’s a lot of data and it’s getting pretty interesting trying to sort out just how good this data really is. The satellite data on the east side is really good.

Here’s a new question. If the data in the peninsula correlates poorly with AVHRR, what would keep it’s patterns locked to the peninsula during RegEM.

Dear Johan i Kanada, I would also like if others were addressing this basic question about the scientific interpretation of the negative correlations.

They can always occur by chance – although correlations occurring by chance are unlikely to get close to -1. But otherwise, I think that their physical possible origin is limited.

A region of Antarctica can’t warm up simply by taking the heat from another region of Antarctica. A natural or artificial gadget able to achieve such a goal would be called “perpetuum mobile of the second kind” and there are good reasons to think that it doesn’t exist. ;-)

On the other hand, if the two negatively correlated regions are both viewed as an effect of a third region, their common cause, the correlation may exist. For example, the typical direction of warm winds may be redirected from region A to region B. Because the redirection is an important part of the weather, A will often get cooler at the same moment when B will get warmer.

Such things can exist but another question is whether A and B can be persistently “complementary” in this way. More often, both A and B can be both complementary or “together” as fighting for the wind with another region C. If they’re equally likely to be complementary as “together”, the correlation will tend to be zero once again (or positive, from proximity, heat exchange etc.).

ENSO – El Nino and La Nina – is an example of a weather spatial pattern where negative correlations are likely to exist. That’s because there exists one dominant degree of freedom – let me call it the ONI index. The temperature anomaly of Pacific equatorial waters drives everything else, influences winds etc., which bring characteristic warmth/coolness/humidity/drought to all places on Earth.

But whenever there are many degrees of freedom, the correlation will tend to be close to zero, and especially strong negative correlations will be absent. I suspect that it should be the case of Antarctica, too – there are many equally important degrees of freedom and no preferred “pan-Antarctic” degree of freedom. For example, there’s no equator in Antarctica ;-) and the winds in all directions are pretty much equally likely near the pole – all of them go to North. :-)

There will also be times when one of a correlated pair is in darkness and another is in light, 180 degrees apart at the same time. How do you do a TOBS adjustment on that?

The projections showing coloured correlation coefficients are pretty, but should one not look at both the purpose and limitations of a correlation coefficient? They work best when all other things are equal, but the Antarctic is a beaut place for things not being equal. If you divide the satellite scene into cloud and no cloud, what do you do with the driven snow and ice that can move very quickly, many meters deep, producing whiteouts like caused the Mt Erebus aircraft disaster?

I have not been to Antarctica, but I worry about correlation spanning the continent when most of the weather stations are around the perimeter. If there are such things as “good correlations” here, I’d expect them to be in a circle roughly at the land:sea interface. That’s also roughly the way the circumpolar weather systems move, so you get lagged correlations that are better than hopping opposite via the Pole.

Finally, BTW, the high quality data for Macquarie Island show essentially no temperature change to max or min over the last 40 years. While there are reasons for excluding it from the Antarctic dataset, there are also strong needs to explain while it has remained of uniform temperature in the face of relentless warming from omnipresent GHG.

I have been using a variant of the plot function which I believe may have been originated by you. It has been changed and adapted by several people including myself. The legend (which is what I believe you are referring to)is in the last two lines.

For purposes of this comment, I have replaced lines of spaces by a series of dots because WordPress removes all but one space when you post. The pattern of spaces appears to place the printed end values in the correct places.

The part to produce the colors (which came along when I acquired the function) can also be simplified:

Good eye! I use the first call to the map to ensure a black background in case there are missing values. A white background would be indistinguishable from 0 correlation. To replace the outline, I call the second time. “add= T” guarantees that it won’t erase previous content.

Roman’s observation in this post seems pretty fundamental to the Steig project. I double checked his calculations and got identical results. Here are my plots using a slightly varied color scale.

I agree that the correlation between the Peninsula and West Antarctica gets reversed from what the Jeffs call the “Comiso data” (Steig raw”) to the RegEM’ed data.

In the scatter plots of correlations, we’ve seen that many spurious correlations are created in the Mann-Steigian RegEM and I suspect that these might not be randomly distributed – it looks like a lot of them might be coming from this one geographic region.

Steve, Roman,
Although this is a statistics oriented site (and a good one), could you please comment on the physical interpretation (or the need for one) of negative correlation between temperatures in different location, such as în this case in Antarctica?

Re: Johan i Kanada (#20), my guess is that the strong negative correlations of the Steig recon are artifacts of the Mannomatic and the weak negative correlations of the “Comiso” data are non-significant. Thus there’s no point seeking physical explanations at this time.

I think that the next project is to see what happens with a (say) 16-PC RegEM recon (as opposed to nibbling at 5 PCs) and see what one of these correlation maps looks like. My guess is that the relatively strong positive correlation between West Antarctica and the peninsula won’t survive well.

There’s another very important issue here that we haven’t discussed so far. Everyone’s more or less assumed that the recently archived Steig “raw” data was actually used in the RegEM. Do we know that?

RegEM returns infilled values for missing values and original values where available.

Ergo, the post-1982 Steig recon is rank 3. It HAD to have been reduced to rank 3 PRIOR to RegEM – or alternatively, they made a mistake under their reported method and inadvertently used the penultimate matrix the one just BEFORE the non-missing values are filled back in.

As always, it continues to be a guessing game with the Team. But I’m 99.999% sure that you can’t get from A to B using the reported method.

I have been trying to reconstruct the results of Steig’s “conventional PCA, using 3 principal components and 15 predictors” using the AVHRR data from his web site, but so far it is turning out to be another “puzzle”.

I have some questions:

Are the data for the 15 manned stations in the file Data.tab pretty much what he would have used? Two of the stations (Belgrano I and Byrd) have no observations during the post 1982 period).

What exactly is the climate science definition of a “conventional PCA”? From what I have read elsewhere, I would take that to mean that the 3 PCs would be calculated from infilled (with zeros) station data anomalies (thereby allowing the use of the two previously mentioned stations). Then a regression would be done with the PCs as predictors and either the individual grid sequences or 3 PCs calculated from those sequences as response variables. Either way, you would get the same rank 3 result for the final reconstruction.

However, I have not been particularly successful in matching their results. I tried some other variations e.g. using all the stations (excluding the two non-overlapping ones) to predict PCs extracted from Steig’s AVHRR data set with no positive results. I will continue to look at other possibilities to see what I can uncover, but some answers to my questions would certainly be helpful.

Climate scientists – Mann, Tamino, Wahl and Ammann (who IPCC relied on in their last word on MBH) say that Mannian decentering (with its strong data mining properties) is “conventional PCA” – just an “alternative” convention, so saying that something is “conventional” cannot be said to preclude anything. Think Rocky Horror Show.

Ergo, the post-1982 Steig recon is rank 3. It HAD to have been reduced to rank 3 PRIOR to RegEM

I may be wrong, but I think that, once a decision to use 3 PCs had been made, the reduction of the satellite data to rank 3 prior to RegEM is consistent with Steig using the methodology indicated in Mann et al’s 2007 paper “Robustness of proxy-based climate field reconstruction”:

“[19] To further insure regularization of the procedure, the predictand (in the current study, the annual mean ‘‘instrumental’surface temperature grid box data which are temporally complete over the calibration interval but entirely missing prior that interval) is represented in the data matrix X by its leading M PC time series, where M is small
compared to the total number of nonzero eigenvalues of the calibration period covariance matrix. This step is performed
only once, at initiation of the RegEM procedure. M represents the number of distinct modes of variance resolvable from noise, and is objectively determined by the eigenvalue-based noise/signal separation procedure described
earlier, applied to the predictor data set over the calibration interval at the initiation of the RegEM procedure.
The predictand in the end is then reconstructed through the appropriate eigenvector expansion, using the
M reconstructed PC series.”

If I understand this correctly, the whole of the predictand reconstruction (not just the infilled missing data) will then have rank M.

In Steig’s case the predictand data is the satellite observations (or the AWS data, for the AWS reconstruction) and the predictor data is the manned surface station observations. Steig presumably did not (and could not) utilise this PC pre-processing procedure with the AWS reconstruction as the AWS data, unlike the the satellite data, is far from temporally complete.

It seems far from clear that Steig’s justification of the use of only 3 PCs (as being the number of statistically separable PCs) in the above procedure (i.e., setting M=3) is valid, even assuming that Mann et al’s method is valid if M is selected appropriately.

Folks, if you look back at the Chladni patterns arising simply from spatial autocorrelation, the constrast between the Peninsula and west Antarctica that one observes in the real AVHRR data can’t be picked up in the first three eigenvectors. Probably just a coincidence :)

RE Roman post and #5, Steve #18,
It would be useful to see these plots from two or 3 additional points as well, one centered in W Ant (say 240E, 80S), and 1 or 2 others in E Ant. say Vostok and S Pole.
I suspect that on top of the geometrical Chladni patterns, these plots would demonstrate that the TransAnt Mtns really do separate WA from EA, but that the correlations within these regions are not nearly as high as ant_recon.txt, with its 3 EOFs, assumes.

Thanks! Your plots of correlations from “site 817″ (240, -80) demonstate that while W.Ant. is well-correlated with itself in the unsmoothed cloudmaskedAVHRR.txt data, it is essentially uncorrelated with the Peninsula, and in particular with the tip of the Peninsula, where a lot of the manned stations are. Yet Steig’s ant_recon.txt has it strongly correlated with the Peninsula, because Steig’s reduction of the matrix to rank 3 greatly extends the reach of near perfect correlation, except for across the TransAnt range, which registers well in EOF2.

Steig (& Mann’s) choice of k = 3 was therefore critical for spreading Peninsula warming throughout W.Ant., and is not as benign as Nic L indicates in #35. I’m guessing that just a few more EOFs would have divorced W.Ant. from the Peninsula warming, and left them without a paper.

Re: Hu McCulloch (#36),
I wasn’t indicating that Steig’s choice of 3 PCs was benign – far from it, as per my statement “It seems far from clear that Steig’s justification of the use of only 3 PCs … is valid”.

I was simply saying that once he had chosen to use 3 PCs, applying Mann et al’s reconstruction procedure would result in the satellite reconstruction having rank 3 for all data points (and not just missing ones).

I agree that the dubious use of only 3 PCs, resulting in strong correlations over extended distances, is a key issue. So far as I can see the effects of using a small number of PCs are exacerbated by the concentration of stations in the peninsula, with the application of Total Least Squares to unscaled data pulling the reconstructed data closer to the trend in the peninsula than would be the case if stations were evenly distributed. Hence the much reduced trend when gridded station data is used instead, as Jeff Id has shown.

Here’s one which fits on this thread. I did a all of the plots of surface station correlation to the post 1982 and pre 1982 satellite data. The results were a little different than I expected. Peninsula trends may not be the issue at all.

Roman, Jeff, Jeff, Ryan, and all others working on replicating Steig’s ant_recon.txt —

It’s been a week and a half now since Steig released the intermediate cloudmaskedAVHRR.txt file, but no news for a week on progress. Is everyone working on their taxes (as I should be :-) ), or are new complications arising?

Also, should I just thank Steig for providing this intermediate file and let my request for complete data and protocols rest, or is there important raw data that Steig should still provide that is not really available from NSIDC (such as 2004-2006)?

yep, for example, in the UWisc data considered by Jeff C, the relations between cloud and satellite life cycle were extraordinary. The Comiso AVHRR “raw” data is net of cloud adjustments but there’s no data on Comiso clouds matching the UWisc data. Without data, this question is pretty much a dead end. Comiso’s adjustments for this study are different than UWisc; blanket statements at RC that the “code is available” are untrue.

You are probably right. However hopefully it can at least be narrowed down further. Jeff C. is exploring the differences between NSIDC AVHRR and Steig AVHRR – with particular interest in those grids corresponding with surface stations. Hoping to identify where and when cloud masking was done and what the masked values were. Also looking for any clues for possible role of surface stations in calibration for infilling. Ryan O. has also been studying cloud masking and says he hopes to post on calibration of infilling as well.

Then, does 10dC clipping plus straightforward time and spatial aggregation essentially yield Steig’s cloudmaskedAVHRR.txt file, or are there other protocols and intermediate data sets we should be insisting on?

Dunno. Hopefully, when Jeff C. and/or Ryan O. posts on this, we will see if it is as simple as adding the +/- 10C rule to the NSIDC AVHRR data.

To some extent, Jeff’s making progress by assuming that some of Steig’s statements on methodology are untrue and seeing where they go. We’re a considerable distance away from replicating his results, despite all the effort from a number of people.

Ironically Steig was a member of the PR Challenge (as was coauthor Mann) and Trouet coauthors Frank and Graham. While it would be nice to cooper up data and code for old studies, at least one would expect the PR Challengers to obey the PR code of conduct and not add to the mess, as Steig and Trouet have done.

My taxes aren’t due until the end of April so I have been ignoring them. ;)

I have been trying to cobble up my own R script for TTLS and RegEM from first principles but I haven’t quite got it right yet. It has lead me to notice that the truncation to relatively few principal components seems to amplify those series which are the most variable and suppress series which have less fluctuation in them. This effect is not a result of any trends in the various series since re-ordering the rows of the data matrices (i.e. shuffling the times of all of the series) should have no calculational effect on the values of the PCs. Choosing too few PCs can lead toreconstructions with trends similar to those exhibited by the more variable series.

With regard to Steig, I have no data that I would ask for at this time, but I do have some questions as to the specific way his methods were applied. For example, the AVHRR data scraped by Ryan and the Jeffs had missing values including several missing months. However, the version posted on the web site (which was claimed to be the data fed into the Regem procedure) was complete. Exactly how was the infilling done? An alternative approach could have been to relegate the missing months to be reconstructed by Regem at the same time as the pre-1982 reconstruction was being done. This apparently was not what they did.

When the cloud masking adjustments were done, was the extreme data (outside of +/- 10) merely cut back to these limits or was it removed completely and infilled along with the missing months?

Were three PCs extracted from the latest posted data and used in Regem or was something different done and then a second unmentioned procedure applied to mke the final reconstruction decomposable into 3 pieces? It seems that if it was the former, we would have already seen a duplication of their results (or have I missed something?). This would assume that we know exactly what the station data they used looked like and how or when any infills may have been applied to that as well.

I tried running the script and it hicpped on me in midstream ( – could have benn my fault). Anyway, I did some reading up on the methods while I was working on the stuff and it helped me to get some insight on the methods.

At the same time, I found some ways to make the program more efficient. Wrote a script to identify and group similar patterns of missing values cutting down the number of times ttls needs to be called.

The Comiso AVHRR “raw” data is net of cloud adjustments but there’s no data on Comiso clouds matching the UWisc data. Without data, this question is pretty much a dead end.

Nature’s policy still requires Steig to provide all the data and “protocols” necessary to replicate his study. Which data precisely is missing? CA contributers have been slowly downloading gigatons of data from NSIDC, but is this complete, or does it only run up to 2004 as per the NSIDC webpage? Did he somehow get newer data that is not yet available from NSIDC?

Then, does 10dC clipping plus straightforward time and spatial aggregation essentially yield Steig’s cloudmaskedAVHRR.txt file, or are there other protocols and intermediate data sets we should be insisting on? Are there some months and 50km regions with no after-masking data at all that must have been filled in by an as yet undisclosed protocol? Or is it obvious how to do this?

Dunno. I haven’t personally had any access to NSIDC data which requires an application protocol. Given that Ryan O has done so, I’m waiting to see how he does before thinking about it further.

We’re also a bit stuck on the next step. The Jeffs are experimenting with the algorithm under the assumption that Steig’s methodological description is untrue – i.e. that they did not, as claimed, apply RegEM directly to the AVHRR+surf data, but reduced the AVHRR data through some undisclosed method, perhaps PC. This is speculation on their part, pure and simple. I’m following their experiments, but it would definitely make sense to ask Nature to require PR Challenger Steig to provide further particulars on this step at some point fairly soon.

Regarding taxes, I just paid my county property taxes today. In California, county property taxes are due on April 10th. Then the Feds and state want their share of the loot five days later. It’s an ugly month.

The Jeffs are experimenting with the algorithm under the assumption that Steig’s methodological description is untrue

Not sure I’d call it untrue, just deliberately misleading (not that that is any better). In the methods section Steig has the following sentence.

We use an adaptation of RegEM in which only a small number, k, of significant eigenvectors are used

I think adaptation is the key word in that sentence. His adaptation is not running RegEM with regpar=3, as that is a standard use. His adaptation is using three eigenvectors as the input to RegEM. He could have just come out and said that, but the sentence above is a work of art. It obfuscates, but at the same time allows the claim that the method was revealed.

Re: Layman Lurker (#46), Regarding the whole cloud mask issue there are three things I have found interesting from studying the UWisc data set.

1) There is a clear discontinuity in cloud cover between NOAA-14 and NOAA-16 (Dec 2000/Jan 2001). I have traded emails with those responsible for the dataset at UWisc and they were very candid in stating that the NOAA-16 data has problems and they don’t know how to fix it. I wonder how Comiso dealt with this.

2) The average continental cloud cover ranges from roughly 45 to 80% depending upon the season. West Antarctica and the East Antarctica coast are particularly prone to clouds, more than 75% of the days would be thrown out as cloud covered. How does Comiso deal with this when calculating monthly means? Does he simply calculate a mean using the remaining days of data or is some sort of infilling algorithm used?

3) The areas Comiso shows warming appear to correlate with areas of heavy cloud cover. This could be entirely spurious, but it is interesting.

Re: Hu McCulloch (#44), Regarding the +/- 10 deg C climatological mean filter, this also would appear to cause a large percentage of the remaining days to be thrown out. The UWisc dataset has a fair number of *months* that exceed +/- 10 deg C. Think how many days must exceed the threshold to have months exceed it. I’m trying to get a better handle on how many days that might include by doing some statistical analysis on the monthly data. My initial estimate is that a 10 deg C *daily* filter would be roughly equivalent to a 5.8 deg C *monthly* filter. It is still very preliminary, but it means we could be down to only 10 – 20% of the daily data being used to calculate monthly averages for wide areas of the continent.

If Comiso was listening, I’d request his cloud mask files and his final masked daily data prior to calculating monthly means. I’d like to see just how many actual days of data were left after all the masking and filtering. I’d also like to know how he worked around the data contamination and diurnal drift issues that appear to affect NOAA-16.

I doubt I’ll ever see it. They released the satellite data (albeit a filtered and infilled version), I expect any further requests will be seen as unreasonable. Jousting with jesters, so to speak.