…-omatic Correlations

Update Mar 28: Here is Luboš version replacing my much less pretty monochrome version showing the spatial decorrelation of the “Comiso” version of the data recently archived a couple of days ago by Steig.
Figure 1. Spatial Correlation for Sample of “Comiso 2009” Antarctic Gridcells

Jeff Id has compared this to corresponding surface stations at his blog and is complimentary to the Comiso versions; I haven’t had an opportunity to check this.) Luboš also shows the following version of my monochromatic figure below showing the spatial decorrelation of the RegEM’ed version of Comiso data (as reduced to 3 degrees of freedom.)

This shows the effect of the RegEM on spatial decorrelation rather nicely. See Luboš here.

Having said all that, the quantity of ultimate interest is an Antarctic average, which has only 1 degree of freedom.

Following is a post by Jeff C at Jeff Id’s blog that I’ve reproduced here. The first two graphics (as they observe) are from CA, scatter plots comparing station correlation to distance, showing the remarkable difference between the decay rate of correlation for actual surface stations and the Steig AVHRR recon. Jeff C has extended this to the UWisconsin AVHRR versions and to the just-released Steig “raw” data. Jeff C’s graphic shows negligible spatial decorrelation in the Steig “raw” data. Update (12.00 pm Eastern, 11 blog Mar 27: I’ve re-done this with my own script and wasn’t able to replicate Jeff’s results. Here is my version and script.
[monochrone version replaced by Luboš version shown above]Update: 1.15 pm Eastern 12.15 blog Mar 27. The two Jeffs have withdrawn their calculations – Jeff C failed to calculate anomalies prior to correlations and Jeff Id made a variable error. Errors can be made; that’s the purpose of due diligence.

Previously we have looked at scatter plots to get an understanding of how well correlated the data appears over distance. Below is a scatter the raw Antarctic surface data. This contains no infilling, just actual measured data from occupied surface stations.

This plot above was originally calculated by Steve McIntyre. Note how the correlation is virtually 1 at 0 km, with a gradual decay as distance increases. This is what we would expect to see as stations closer together should have better correlated climate than stations far apart.

This plot, also provided by Steve, is a distance correlation for the satellite era (1982-2006) of the Steig 3 PC reconstruction used in the Nature paper. Note how correlation remains at 1 for some cell pairs at distances out to 3000 km. This seemed suspicious and led many of us to believe that the reduction to 3 PCs had caused a spatial smearing of the data.

This plot above is of the NSIDC AVHRR data from the University of Wisconsin website that Jeff and I have been processing for the past few weeks. Note that the “cone” is quite a bit wider than the surface data, but the distance correlation looks reasonable. Some of the cell pairs are still rather well-correlated at long distances, but we don’t see the values of 1 we saw on the Steig reconstruction.

This is the shocker. Here is the distance correlation plot for the Steig cloud-masked data released today. This data set has been presented as the satellite data used as the input to the reconstruction. If it were truly “raw” (or minimally processed) satellite data, we would expect to see a plot similar to the NSIDC plot immediately above. Instead, we see that every single data pair has a correlation of greater than 0.5!! Data from the peninsula is highly correlated with data from the East Antarctica coast and the interior despite the surface data showing nothing of the sort.

Why would this data set have such a high cell to cell correlation? I’m speculating here, but Steig talks about “enhanced cloud masking” where daily data points that exceed the climatological mean by +/- 10 deg C. are considered cloud contaminated and discarded. From my experience with the NSIDC AVHRR data, a huge number of data points would be affected by this threshold, perhaps as much as 50% of all points. If a simplified infilling algorithm was used to replace those points, high correlation might result. Regardless, this plot appears to show that the cloud-masked data set is highly-processed and suspect.

When I first ran this plot I thought it must be in error. I checked my code line by line and have repeated the results multiple times. I still find it hard to believe.

——–

Jeff Id

I’ve spent several hours verifying this post and have independently verified the results using my own code. Jeff C’s code used a subset of every 5th value in the grid (due to matrix size R can’t handle the full matrix). My independently written version used a random subset method which was derived from SteveM’s original sat correlation.

What it means:

The concept of this paper was to use spatial information to insure proper weighting and location of individual surface stations across the antarctic. The surface stations are the lowest noise measurement of atmospheric temperature and show a particular correlation pattern which we can consider “natural” (the first graph) . This is the pattern you would expect to see in any data representing antarctic temperature. The 3rd graph is the NSIDC dataset and represents spatial correlation of the publicly available cloud masked data from the same instruments as processed by the NSIDC. There is a wider spread of the cone angle as compared to surface station data which is expected due to the increased noise level in the dataset, but the key is that there still is spatial information available. The last graph however has correlations pegged at almost 1 for the full width of the dataset independant of the distance, mountain ranges, peninsula, sea contaminated pixels and the rest.

From my other post which derrived the 3 pc’s for the reconstruction dataset, this data doesn’t seem to be an exact copy of the original data but it is close. What’s more is we can now make sense of the second to last graph which is derived from the full reconstruction using 3 pc’s as presented by Steig. The data from graph 3 has almost a parallelogram shape because surface station data’s correlation vs distance is copied equally across the entire satellite dataset regardless of actual location.

If you take the surface station points (graph 1) and spread copies of the surface station data across the entire width of the Steig satellite data (graph 4), you get (graph 3).

I’m not in any way saying or in any way implying this was done intentionally but this is just about the perfect dataset to use if you want to weight every station equally and basically average the pre 1982 trends across the entire continent. I thought we were going to have to go through RegEM and do a lot of calculation to find if this was the case — not this time. This is the perfect scenario to blend the high concentration of known warming peninsula stations across an entire continent.

107 Comments

So does this basically mean the Steig paper is worthless? It makes little sense to say all of Antarctica is warming when what their process does is spread warming from the peninsula across the entire continent.

If I remember correctly from “the Movie” that was previously posted, most of the coastal based stations demonstrated warming anomalies, which was assumed to represent ocean temperature contamination. So basically Steig’s paper represents warming of the Antarctic Ocean — No?

So let me see if I can say this in layman terms that even I can understand.

They fashion a calculation which shows interior stations correlating well (extremely well) with peninsula stations then use that reasoning to attribute to the interior stations the same temperatures as the peninsula stations?

Nature needs to publish a correction or otherwise withdraw this paper.

It is (for me) very unexpected that the masked AVHRR data would so quickly lead to such a dispositive result.

Steig’s failure to provide the masked AVHRR data suddenly morphs from being uncollegial behavior to being downright suspicious. If Steig was not aware of this problem (and I still think that is the most likely possibility), his decision to withhold this particular piece of data was truly bad luck on his part.
Steig is now the lead author of a Nature cover story that was based on faulty data that he initially attempted to hide from the people he knew were attempting to disprove his result.

Steve called the faulty PC technique used in MBH 98 the “mannomatic”. Since Mann isn’t the lead author in this paper though he’s one of the co-authors, if I recall, the apparently faulty technique could be the Steigomatic or something else. The point is to indicate that the system used apparently will spread the heat regardless of the actual data, just as the mannomatic will produce hockeysticks even from random data.

No problem: prior to Steig the Antarctic cooling was consistent with the models, afterwards the warming was consistent with the global warming, now we’ll just be back to being consistent with the models again.

Everyone is piling on far too quickly and jumping to conclusions. I tried to re-do Jeff’s calculations and got a different result. So let’s first see what the difference is. I also wish that people would stop piling on with premature demands for Nature retraction; we’re just looking at things. There are some odd features to be sure, but we’re still trying to figure out what Steig did. Some self-discipline PLEASE.

Also please keep in mind that Steig’s paper could be goofy and Antarctica could still be warming. For example, let’s suppose that a simple average of the 15 or so stations that we have from the Antarctic show a warming. You don’t need 2 teraflops of operations to take an average and you wouldn’t get a Nature cover, but it wouldn’t mean that Antarctica was cooling. It seems reasonable to me that Antarctica is warming along with the rest of the world; that’s a different issue entirely from whether Steig’s RegEM plus preprocessing is a sensible way of handling data.

I looked back at my own code, which was written independently from Jeff and I typed the wrong variable in the anomaly section resulting in a correlation of temp rather than anomaly. You can actually see the problem in JeffC’s code above.

I’m sorry to everyone for my screw up. I explained above I didn’t see any intent either way. Blogs are real time science done in the open so these things will sometimes happen.

One of the reasons I keep coming back here is exactly what just happened.

A possible error in the data/method used in a peer reviewed document was found, it was checked and rechecked, initial look was pretty damming and people were beginning to draw conclusions. Then, after a call for cooler heads, a review of the checking method showed an error in the math, which was not only quickly recognized, fixed and posted, but even followed up with an apology for missing it in the first place.

You will not find that on most other blogs about this topic, in fact you would more likely be ridiculed or demeaned for questioning the paper in the first place, in at least one blog I could name.

The slight wrong turn here doesn’t mean that this calculation is out of the woods yet. IT sure looks like there’s been quite a bit of massaging on the “raw” AVHRR data and this is still a black box. The scatter plot of reconstruction correlations looks different from the AVHRR correlations – shy is that? what does it imply? We just got the AVHRR data a day or so ago. The possibility that the calculation is sensible should not be excluded out of hand either.

For sure. We know there’s some problem from the sat reconstruction correlation. (graph 2). Looking at Steve’s correlation plot the data shows what we would expect and require from the sat data to have any chance of achieving proper RegEM station weighting.

In my mind it’s just a matter of which step did it occur at and how much effect it has on the result.

Folks, I owe a huge apology to you all, and particularly to Dr. Steig. I re-used code to process the scatter plot that I had previously used for the satellite reconstruction. I neglected to account for the fact that the recon were anomalies, the cloud-masked data set were temperatures. When I recalculated the scatter plot using anomalies, the familiar pattern re-emerged.

Thanks to Hu over in the previous thread for reviewing the code and pointing out this flaw. This mistake was entirely mine and I again apologize for jumping to conclusions.

Steve – thanks for keeping me honest and helping me become a bit more humble.

Circumstances arose one day which delayed preperation of the dinner of a Soto Zen master, Fukai, and his followers. In haste the cook went to the garden with his curved knife and cut off the tops of green vegetables, chopped them together and made soup, unaware that in his haste he had included a part of a snake in the vegetables.

The followers of Fugai thought they never tasted such good soup. But when the master himself found the snake’s head in his bowl, he summoned the cook. “What is this?” he demanded, holding yo the head of the snake.

Well one thing is for sure, mistakes made are out in the open for all to see.

There’s no “CENSORED” folder here.

That doesn’t excuse the demands for retractions and the rest of the huffing and puffing. And others need to look in the mirror before they claim that they can do a better job than The Team. People can be mistaken without being the enemy.

Also should mention that if you were to split the data in to correlation “downwind” and “crosswind” you would likely end up with a similar pattern. What was found in the boundary layer studies was the scale-length for the correlation was different along the wind direction, compared to cross-wind direction.

Thanks to Steve McIntyre at Climate Audit and comments from Hu McCulloch at Climate Audit for quickly spotting this error and bringing it to out attention. I think this points out pretty well that accusations of cherry picking or playing favorites on Climate Audit aren’t reasonable. Problems get chopped up and spit out regardless of the source or meaning.

Mistakes are extremely common throughout science and engineering (some estimate 50 percent or more). The only thing to do is check one’s work, find them, admit them, fix them as soon as feasible. Even the fixes can be mistakes. Keep it up and eventually the final outcome will be a good one.

If one does that, then one loses no credibility in my eyes — rather respect.

I have a tough time respecting most politicians and some climate scientists.

Sinan, you forgot about the nose on one’s face. The “anomalies” subtract a different monthly constant from each month. Otherwise the annual cycle dominates the correlations and you obviously get huge correlations.

Mistakes are MUCH more likely to occur in favor of a result you “want” or believe to be the case in the first place. That’s human nature, and NOT an indication of improper motivation. The lesson of course is to be MORE critical of your own hypothesis. Kudos to the Jeffs, like it has always been with others the CA crowd, in being very upfront in acknowledging the error and seeking to correct it–get it right, first and foremost. We see that that behavior is certainly not universal.

Exactly. While hiding mistakes may not be universal, confirmation bias probably is. It was certainly evident in the bloviation, chest pounding and piling on here after the initial publication of erroneous results. No need to beat Jeff and Jeff around the head and face (they are taking care of that job themselves). Still, I think we can lighten up on the self congratulations here. Science at the speed of blog is not an easy thing to do, and inevitably error will creep in. Keep in mind, however, that the world is watching what goes on here, and with that attention comes a responsibility to get it right. Steve, Ross, Spencer & Christy can attest that even a small error of a switched sign will become the bloody shirt of AGW agitprop. In the immortal words of the Sergeant on Hill Street Blues (god, I am old), “BE CAREFUL OUT THERE!”

Thanks, Jeffs — I had a hunch that you had not removed the seasonal means before running the correlations, so that you were mostly picking up the high correlations of the annual season cycle.

The corrected data shows the classic exponential decay of the correlation with distance assumed by Kriging. The high correlations at short distances mean that the values are a smooth function of location. So this raises the question again, of why Steig bothered to reduce the data to 3 PCs? Was it because the complete data set would choke up RegEM?

Perhaps we could have a new thread devoted to efforts to plug this file into RegEM together with the surface data, to see if anything like Steig’s recon comes out, and why it looks like it does. I’m not working on this but I assume Steve and others are.

This should be easy, in comparison to your efforts to reconstruct the new Steig file from the NSIDC raw data.

Now I don’t know if cloud masking is literally what it suggests, but if it is [infills for non-data under cloudy conditions], it DOES raise a point in which could lie a big source of errors. This is the posited overall warming effect of cloud cover over permanent icefields. If Svensmark’s hypothesis is correct, from about 1957 to 1977 there would have been more cloud cover over the planet which would have a cooling effect on the rest of the planet but a warming effect over Antarctica. Now if the cloudy records are omitted, the earlier Antarctica records might register too cold: thus a false amount of warming could appear to have happened from that time.

The satellite data is a measure of land skin temperature. Skin temperature may be sensitive to slight changes in windspeed. The higher the windspeed, the greater the mixing of air, even in the lowest few cm. The greater the mixing, the warmer the skin temperature even though the total air-column temperature may be unchanged. Something to ponder a bit.

Very informative (not to mention cool) use of color graphics, Luboš! Your graphs show that there is indeed a very disturbing difference between the correlation structure in the new cloudmaskedAVHRR.txt, versus the derived ant_recon.txt. This difference doesn’t show up well in monochrome plots like Steve’s #1 and 3 in the post above, except for a tendency of the ant_recon.txt correlations to stick to unity.

Your graphs show that a very useful way to summarize this mass of numbers would be with 3 lines, indicating the median and quartiles of the distribution of correlations at each distance. The mean is not as useful as the median here, since the distribution of correlations bounded by -1 and +1 can’t be symmetrical. The median is less sensitive to this asymmetry. The quartiles will then give a good sense of how well the median is doing as a summary of the relationship.

In order to compute the median and quartiles, you have to bin the distances somehow. For your purpose, 50km is fine, since that’s the resolution of the data, and you have a bazillion pairs in each bin. For comparison to the surface correlations as in Steve’s plot #2 above, however, bigger bins like 100 km might be necessary in order to have a good number (like the square root of the total) in each bin.

You have randomly sampled the pairs in order reduce the number of correlations from 5509X5509 ≈ 30M down to a more manageable 0.5M. It surely doesn’t matter how this was done, but just to make the calculation more easily replicable (aka auditable), it might be better just to use only every k-th column of the matrix, starting with say column 1. Then anyone can double check that the starting column doesn’t matter. k = 5 or 6 would reduce the number of correlations by a factor of 25 or 36. These easily constructed sub-samples would be very representative of the full set of correlations.

RE #51, While I was typing, Steve moved Luboš’s excellent graphs to the top of this thread.

It would be of interest also to construct Lubograms (and/or median plots) using only the first 1, 3, 5, 10, and 100 of the principal components of the anomalized cloudmaskedAVHRR.txt matrix. This would give some insight into whether the very different behavior of ant_recon.txt is due to its calibration to the surface data in RegEM, or if it’s just caused by its reduced matrix rank.

Although it has been suggested that such interpolation is unreliable owing to the distances involved [Turner et al 2005], large spatial scales are not inherently problematic if there is high spatial coherence, as is the case in continental Antarctica [Schneider et al 2004].

I see negligible evidence from Antarctic spatial decorrelation for higher “spatial coherence” in Antarctica, as opposed to say Australia. The rank-3 version clearly builds in a lot of spurious correlation.

At the end of the day, all that exists for the early portion is the 13-15 stations and it sure seems like they’re spending a lot of energy on weird multivariate methods without spending enough time on data QC.

Dear Hu, that’s a good idea. I must do some shoppings now but when I return, it should be enough to add a few lines of code saying

“Truncate the 300 x 5509 numbers to the first N PCs”

and recalculate the graphs with them. Well, one may expect what will happen. They will start with an amplified level of Steig’s inaccuracy – 1 PC will have 100% correlation of everything 🙂 – and then they will slowly descend through Steig’s picture to the correct picture of the full reality. I shouldn’t have made this prediction because it has substantially reduced my eagerness to actually do it. 🙂

seems like they’re spending a lot of energy on weird multivariate methods without spending enough time on data QC

I know it’s a common theme but in this case I think we’ll eventually have a different reconstruction (maybe tonight or tomorrow) with a different trend that has a high degree of high frequency correlation to local stations and an equally randomized low frequency trend.

Re: Jeff Id (#56), one of the aspects of Steig et al that’s not been covered so far are the changes to AVHRR trend in the new iteration of Comiso data. Perhaps the prior iterations match the surface stations as well as the new iteration – this needs to be examined. To do this, one will need all the AVHRR versions plotted in the Steig SI and we’ve only seen the most recent.

There has been a good deal of discussion on this thread regarding the correlation between temperatures at various locations throughout Antarctica. Several people have looked at the relationship between correlation and distance by creating graphs linking the two. IMO, one of the difficulties in interpreting these is that they are affected by a variety of factors, including the shape and topography of the continent and by the fact that the place is completely surrounded by a large pool of water.

I think that it is informative to pick several locations and to see how the AVHHR series at that location is related to all other locations. I selected two points: the tip of the peninsula (Steig series 1) and the obvious interior point: the South Pole (grid point 1970 is the closest).

For a selected site, after calculating the 5509 correlations, we graph them using a color scale to represent the correlation (as usual red is positive and blue is negative and white areas represent zero correlation). The location of the grid site is represented by a green +. Keep in mind that these are correlations measuring relationships between temperatures at the grid point, not positive or negative trends.

First, we take the latest revelation from Steig, the cloud masked AVHHR data. The grid point is on the tip of the peninsula.

Several things stand out in the graph. Obviously, the region immediately adjacent to the grid point is strongly correlated, but what is somewhat surprising is that the correlation drops off fairly becoming negative while still in the Western Antarctic area. The relatively low correlation continues to the rest of the continent.

Next, we take the original reconstruction: ant_recon.txt. This was supposedly reconstructed from the previous data using RegEM and the manned surface stations:

The correlation has strengthened dramatically in the Western Antarctic so that now the pattern exhibited by the reconstruction at the tip of the peninsula seems to be reflected by the entire west. As well, the Eastern portion has now become more strongly inversely correlated with the peninsula.

I have also looked at the two other reconstructions (detrended and PCA) created by Steig as well as the looking at the South Pole and how its temperatures correlate with the rest of the grid points. These can be found at my statpad site . The R script can be found in a Word document here.

If I understand what Roman has done here (need to read it twice –my problem, not his), I think it really cuts to the matter of the TIR reconstruction producing a West Antarctica that takes “heat” from the Peninsula.

I may have read too much into the authors of Steig et al. intentions, but a major premise seemed to me to be that the well-established Peninsula warming had spread to Western Antarctica (and by inference will continue to spread to mainland Antarctica with all the problematic issues that could raise for a stable Antarctica ice shelf). Continuing this analysis of how sensitive that premise is to the methodology used is certainly in order.

The AWS reconstruction does not show the warming of the West Antarctica that the TIR reconstruction does.

You won, Steve! It’s pretty, the code is short (the proper labels on axes would be difficult for me) – I just don’t understand why you chose the places with densities below 50 to be almost invisible (black). Your colors make it look like there are no distances above 4,000 km and no correlations below 0.1. 😉

Choosing colors can be time consuming. I had a nice set of 14 colors on hand which was a little short and so I agree that I made too much black, but I wanted to get something out. I’ll tidy my list of colors some time.

Dear MrPete, in the density plots, the colors (e.g. green, yellow) are densities. On the other hand, Roman’s plots use colors as correlation, and it is pretty natural to have the “TemperatureMap” (blue-red: blue is cool, red is warm) color function.

Your traffic light color function seems a bit hard to remember. You say that green is go, yellow is cautious, red is stop. Well, red is surely stop, but I would be cautious with the greens, too. 😉 Before you invent a new color scheme, check whether it has already been invented e.g. here:

Suggestion for use of color: People tend to consider Green=go, Yellow=caution, Red=stop. Might use Green=+1, Red=-1, Yellow=0.

I did something like what you suggest in the following graph of Steig ant_recon.txt means in a comment on the Steig Eigenvectors and Chladni Patterns #2 thread:

On a scale of -6 to +6, 0 is yellow, +4 is red, and -4 is blue. Green is at -2 and orange at +2. -6 is mock-ultraviolet, +6 is mock-infrared, and the other integers are fudged in non-linearly to keep yellow prominent. These 13 key colors are then interpolated to create a 121-color MATLAB colormap using colormap colors. For 13 bands, use colormap colors(1) instead.

Re: Navy Bob (#79), I did hate dot matrix printers but they were good for one thing: If a scatter-plot contained multiple observations at a given coordinate, the markers got darker and darker each time a point was printed whereas lone observations remained faint. The effect was especially supple when using worn out ribbons.

I *hate* color in graphs unless the color is associated with a well defined physical meaning. On the other hand, when correlation coefficients are considered one simply has to use color. Just white + two cool colors would be my preference (I am speaking with no expertise in the topic at all other than thinking about what would help me actually understand the graphs posted here. They look very pretty but they are also confusing to me.)

Re: MrPete (#83), in this particular case, we are talking about the correlation in temperature anomalies between two points by distance to a reference station (as far as I can understand). So, if the association between anomalies at two locations is linear, -1 means the temperature anomaly at one location is smaller than average every time the temperature anomaly at the other location is greater than average. Vice versa for r = 1 between two locations.

A perfect correlation of +1 is what you should get if you place two thermometers in the same well-stirred bucket of water and then read them both on various days. The readings don’t have to be identical — one can be in C and the other in F, but they must be a linear transform of one another.

A very high correlation is what you should get if you have two thermometers near one another that are measuring more or less the same thing, with minor differences due to siting or flaws in the thermometers.

A perfect negative correlation is hard to obtain except as a statistical artifact. If you have two thermometers whose readings are uncorrelated, but then measure them each day as differences from their average, the one will always be exactly as high above the average as the other is below the average, and the correlation will be -1, whether they are both measured in C or one in C and the other in F. This is why in the other thread I thought maybe Roman had subtracted out row averages as well as seasonal column averages before he computed his correlations. (He didn’t).

A less than perfect negative correlation would arise at a monthly frequency between two sites on opposite sides of the tropics, if they didn’t have their seasonal means subtracted. If they did, a weak negative correlation between site anomalies would still be possible, but it would take an unusual combination of ocean currents or some such.

In Luboš’s first graph above, the correlations are all positive at first, but then die out toward zero on average by 5000km. But then just by chance half are positive and half negative. These weak correlations are meaningless.

The cluster of negative correlations at around 5000km in Jeff Id’s graph of station correlations (the third plot in the post) may be meaningful, and probably has something to do with coastal currents affecting coastal stations on opposite sides of the continent differently. But generally you’d expect distant correlations between site anomalies to die toward zero.

Thanks! OK, now I’m not flying blind. (Although I am flying seat of the pants. All of my reference tools and past work are packed away somewhere… but hey, this is live interaction with ability to be wrong at any point 🙂 )

What I heard, in layman’s and then possibly visual terms:

+1 == perfect teleconnection or perfect match. Both readings equally valid to represent measurement at the base site.
0 == random (noise). No connection at all.
-1 == perfect anti-match. For temp, this should be opposite seasonality or ???

Or…

+1 == Meaningful connection
0 == Meaningless connection
-1 == Meaningful (confusing in this case!) inverse connection
And one more item that almost always needs to be represented: missing data.

Is that about right?

So, for this one, here’s what I would try, based on now being a bit better informed

a) Set missing data to dark gray or black. Nice contrast and allows the real data to “pop”
b) Set meaningless = white
c) Set meaningful = something like #204896 , a deep color that blends well all the way to white, and that usually represents strength, etc. (This is a “steel blue”/grayed blue. It is not usually a “temperature”)
d) Set anti-meaningful to a complementary color but easily distinguished, perhaps #482096

You could pick any of a number of color pairs. It can be tricky to choose pairs that work well and that don’t immediately imply a different message.

This is quite similar to what RomanM did, with specific modifications. (And this is completely untested; I’m not set up for graphing right now. Who knows, those colors might look Really Ugly when actually used! 🙂 )

And unfortunately, schemes such as provided by Mathematica give very few compatible options. All of their two-color schemes with white in the middle are either explicitly temperature mappings, or go to black at the extremes. I don’t want +1 and -1 to look the same at all.

As Steve M said, it really does take some consideration to develop a useful color scheme. And I’m probably not there yet on this one.

I beg of you and everyone else who might be lurking/watching not to ascribe meaning to a correlation coefficient based on its value any more than “two variables move in the same direction versus two variables move in opposite directions”. The closer |r| is to 1, the closer the points are to a line but that does not necessarily mean there is a linear association between the variables.

The closer |r| is to 1, the stronger is the association between them, but the association need not be meaningful.

The value of the correlation coefficient does not tell you if the relationship is linear or meaningful. Something other than the value of r tells you if an observed correlation is meaningful.

Sinan, I believe in the Steig et al. case the spatial correlations are being used in the reconstruction process and the question that arises, in this layperson’s mind anyway, is are the effects of one region’s temperature anomalies affecting another region real or an artifact of the process. In this case, the correlations are not being so much addressed as an indication of any relationship, but as an indicator of how the Steig reconstruction process works.

Lubos Motl above comments on the “natural” thermodynamic tendency of one regions temperature (changes) affecting a nearby region. I would guess that that argument would be in good agreement with what Steig et al. have perhaps anticipated and then concluded about the effect of a “hot” Antarctica peninsula on the adjacent West Antarctica region with their TIR reconstruction.

Since, in my view, over recent decades I see that effect in the TIR reconstruction and not in the AWS reconstruction or with raw temperature data, I have to wonder how much of what we see in the TIR reconstruction is an artifact of the methods used in the reconstruction.

The other question that arises out of this discussion is whether a warming in the Peninsula and cooling or no trend in East and West Antarctica in recent decades would be unique to that region of the globe – and without visualizing a major yet to be discovered geothermal event under the Peninsula. I see a cooling in the SE US with warming elsewhere in the US. On a localized level in Illinois in the US, I see long term cooling and warming with large differences in magnitude from various stations around the state. Perhaps the station data are not valid or the physics on a local level are different than those on a more regional level.

Another interesting point is that the area of a hot Antarctica peninsula is only a few percent of the total area of Antarctica.

Just to add another complication to your task,somewhere between 5 and 10 percent of us males are color blind in some degree. For those of us who are color vision challenged, your choice of complementary colors may not work. As far as I can tell, the two color blocks you show are the same. Likewise I am usually unable to interpret spaghetti graphs. In the days of hand plotted curves, the use of various geometric shapes to represent the data points allowed interpretation of multiple black curves.

Dear MrPete, the correlation coefficient between +1 and -1 is computed from datasets {x_i, y_i} (for example, x_i are temperatures on one place at time i, while y_i on the other place), as follows:

Subtract the average of x from x_i, and the average of y from y_i. Now, the average (or sum) of x_i is zero, much like for y_i.

Now, calculate the sum of x_i.y_i. If x_i and y_i change independently, this should be close to the sum of x_i, times the sum of y_i, which would be zero. However, if they don’t change independently, the sum of x_i.y_i will be nonzero.

But x_i.y_i have “units” – characteristic units of “x” times units of “y”. You need a unitless result between -1 and +1. So you must divide it by something with units of “x” and something with units of “y”. These two factors are the Euclidean lengths of the vector “x” and the vector “y”, according to the Pythagorean theorem (the square root of the sum of squares of x_i, and similarly for y_i).

If you imagine x_i and y_i are coordinates in an N-dimensional Euclidean space, the correlation coefficient is the scalar/inner product of the vectors x,y, divided by the length of x and length of y. This ratio is nothing else than cos(angle) where the angle is one between the vectors x,y.

If they’re pointing in the same direction, the angle is zero, and the correlation coefficient is +1. For opposite directions, it is -1. If they’re orthogonal, it’s zero.

The discussion about color schemes is fun. I was just developing a color scheme that covers a lot of the RGB space (by “spirals”) but still allows you to distinguish everything. But for practical purposes, the discrete color schemes with jumps may be helpful, too.

When analyzing maps via Mathematica, I was irritated by a hidden detail about the ENSO maps:

The colors that appear on the map actually don’t precisely (RGB bytes) agree with any of the colors on the scale on the bottom (although they’re somewhat close to interpolations of the colors at the bottom). So irritating. So one can’t reverse-engineer the temperature anomalies of the ocean, not even the right intervals.

Luboš, it is likely the mismatch is due to the fact that the original information should be in 24 bit color, and then was interpolated to 8 bit (palletized) when output to the GIF format which only supports 8 bit color.

Ahhh! The +1 correlation question to my #1 -1 correlation question. I expect the answer is “white”! That is, proper answer is still hanging out there and will need more analysis to be determined. But I’m betting on the negative side since we know the stations themselves aren’t wildly warming. But it isn’t yet known exactly how Steig got his results.

RE #51,52, 65,
A very good use of shading and/or color here would to illustrate the relative frequency distribution of correlations at each distance rather than their absolute number.. That is, rather than using say a yellow to green scale to represent the absolute number of correlations as in Lubos’s original plots or (with a different scheme) in Steve’s version (#65), draw a median line, and then add several interquartile ranges (or even a continuum of interquantile ranges) using your color/shading scale. For example, the 50% range(ie interquartile range) could be in say yellow, the 90% range (5% – 95% interval) in yellow-green, the 95% range in green, and the 99% range in pale green. This would be a lot more informative than just 3 lines, and a lot less clutter than a separate line for each quantile.

RE MrPete #87, what scale of color gives 204896 for steel blue? Is this a hexadecimal code that just happens to have no digits above 9? MATLAB uses 0-1 scales for each color rather than 0-255, with the result that even if you try to hit a color on the head, rounding error could easily put you off by 1 click. This could be the source of the problem Luboš encountered in #89 above and that jeez mentions in #90, since I suspect most of the official maps are done in MATLAB.

BTW, Pete, is it just my imagination, or have you been laying low this past year since the famous Starbucks expedition? If so, welcome back!

Hu,
I’ve been laying low both before, during and after. Real Life is pretty distracting these days 🙂
The #rrggbb color spec is how you do color in HTML. Each character pair is a hex code; they just happened to all be in the 0-9 range in this example.
Would it mean anything to combine RomanM’s correlation plots with Luboš’ frequency (density) data? Seems that frequency is simply the number of identical points across the space we’re examining.

Sinan, that’s a good reminder. Although… while we know correlation doesn’t imply causation, doesn’t correlation generally imply *some* kind of meaning or significance? Even if we don’t know what it is?

Just to add another complication to your task,somewhere between 5 and 10 percent of us males are color blind in some degree. For those of us who are color vision challenged, your choice of complementary colors may not work. As far as I can tell, the two color blocks you show are the same. Likewise I am usually unable to interpret spaghetti graphs. In the days of hand plotted curves, the use of various geometric shapes to represent the data points allowed interpretation of multiple black curves.

Wow. Since the colors are shades of blue and purple, either we’re just dealing with too-small color patches, or badly adjusted monitors, or readers who are among the 1% of men who cannot see red colors. Various forms of this colorblindness exist…

Bigger patches (using the same colors as above):
This is the bluish patch, and
This is the purplish patch.

It’s pretty tough to compensate for colorblindness when selecting colors. Apparently, for people with this colorblindness, the rainbow:

looks something like this:

As you (with full color vision) can see, there are really only two colors left, in different shades. Ouch!

Re: MrPete (#102), Thanks Pete – the rainbow looks ok to me so visit to the optician can wait 🙂

With the choice of blueish and purplish though they still look very close to me, but maybe it would look different if they were opposite ends of a scale going through white in the middle (I think this is what you are suggesting?). Maybe my monitor is not great either because I second Stu’s comments re: spaghetti graphs were many lines are often represented with only minor colour graduations between them.

Re: MrPete (#102),
MrPete,
Like curious, I see little difference between the larger bluish and purple swatches. On the other hand, my rainbow looks more like your first illustration than your second one. I am not sure there is a solution that will work for all, but I keep looking for a method by which I can extract data from color data plots.

If you use a graphics progam like PhotoShop, Corel PhotoPaint, Photo Impact etc, you can select a colour properties window. By dragging the mouse over the map you can get numbers for RGB and CMYK schemes. You can also add transparency to colours with this type of package, for making maps. For colour blindness tests (more than 10% of men affected) try a Google on “Ishihara” such ashttp://www.toledo-bend.com/colorblind/Ishihara.asp

A variable that is not being mentioned is the monitor that the color is being displayed on. I’m using an old, carbon belching CRT and I notice that many pictures I find on the internet seem dark. I also notice that such pictures do not look dark when displayed on an LCD monitor. Perhaps that is what is going on here.

Re: frost (#106),
That’s quite possibly part of the answer. Note too, the color I provided will be the dark extreme (+1, -1) and most data will be much lighter.

I picked adjacent colors on purpose, because the +1/-1 meanings are more similar than opposite in this case (vs temperature, where cooling and warming are opposite). (BTW, this entire discussion probably seems like picking nits to a casual observer… to me it’s a valuable interaction on how we represent information visually…)