Gavin and the PC Stories

How many principal components to retain? Recent readers of Climate Audit may not realize that this was an absolute battleground issue of MBH and Wahl and Ammann. In one sense, it was never resolved with MBH back in 2003-2005, but that was before the existence of blogs made it possible to focus attention on problems. So I’m quite fascinated to see this play out under slightly different circumstances.

In the Steig et al case, as shown in the figure below, if one PC is used, there is a negative trend and as more PCs are added, the trend appears to converge to zero. (I haven’t explored nuances of this calculation and merely present this graphic. Retaining 3 PCs, by a fortunate coincidence, happens to maximize the trend. [Note: These were done using a technique similar to but not identical to RegEM TTLS – a sort of Reg Truncated SVD, which converges a lot faster. Jeff C has now run RegEM with higher regpar tho not as high as shown here – and has got (these are 10 times my graphic due to decades rather than years) 1 -0.07; 2 +0.15; 3 +0.15; 4 +0.16; 5 +0.12; 6 +0.10; 7 +0.11; 8 +0.11; 9 -0.08; 10 +0.05; 11 -0.02; 12 -0.05) so the max is not precisely at regpar=3 – so there’s no need to raise eyebrows at 3 rather than 4. However, the trend does not “converge” to 0.15, but goes negative at higher regpars.)

Fig 1. Trends by retained PCs. (Using truncated SVD to approximate RegEM – this will be discussed some time.)

As reported elsewhere, the eigenvectors corresponding to these PCs match very closely to what one expect from spatially autocorrelated series on an Antarctic-shaped disk – a phenomenon that was known in the literature a generation ago. The “physical” explanations provided by Steig et al appear to be flights of fancy – “castles in the clouds” was a phrase used by Buell in the 1970s to describe attempts at that time to attribute meaning to eigenvector patterns generated merely by the geometry. But that’s a different issue than PC retention.

Now let’s turning back the clock a little. As many readers know, some time ago, coauthor Bradley credited Mann with “originating new mathematical approaches that were crucial to identifying strong trends”. One of the “new mathematical approaches” was Mannian principal components (though it wasn’t really “principal components”). It had the unique ability of extracting hockey stick shaped series even from random data. It was that effective at identifying hockey sticks. Mannian principal components were criticized by Wegman and even by the NAS panel. However, in what I suppose is a show of solidarity against such intermeddling, third party paleoclimate use of Mann’s PC1 has actually increased since these reports (Hegerl, Juckes, Osborn and Briffa); Mann’s PC1 even occurs in one of the IPCC AR4 spaghetti graphs.

In the case of the critical North American tree ring network, Mann’s powerful data mining method worked differently than with random red noise. It found Graybill bristlecone data and formed the PC1 from this data. In the red noise case, the method aligned and inverted series to get to a HS-shaped PC1; in the NOAMER tree ring case, all it had to do was line up the Graybill bristlecones.

In MBH98, Mann retained 2 PCs for the AD1400 North American network in dispute. In 2003, no one even knew how many PCs Mann retained for the various network/step combinations and Mann refused to say. Trying to guess was complicated by untrue information (e.g. Mann’s “159 series”). In our 2003 Materials Complaint, we asked for a listing of retained PCs and this was provided in the July 2004 Corrigendum Supplementary Information. The only published principle for retaining tree ring PCs was this:

Certain densely sampled regional dendroclimatic data sets have been represented in the network by a smaller number of leading principal components (typically 3–11 depending on the spatial extent and size of the data set). This form of representation ensures a reasonably homogeneous spatial sampling in the multiproxy network (112 indicators back to 1820).

This makes no mention of Preisendorfer’s Rule N – a rule mentioned in connection with temperature principal components. Code archived for the House Committee in 2005 evidences use of Preisendorfer’s Rule N in connection with temperature PCs – but no code evidencing its use in connection with tree ring PCs was provided then. Nor, given the observed retentions (to be discussed below) does it seem possible that this rule was actually used.

In the absence of any reported principle for deciding how many tree ring PCs to retain, for our emulation of MBH98, prior to the Corrigendum SI, we guessed as best we could, using strands of information from here and there, and after July 2004, used the retentions in the Corrigendum SI. For the AD1400 NOAMER network, we had used 2 PCs (the number used in MBH98 for this network/step combination) right from the outset. However, if you used the default settings of a standard principal components algorithm (covariance matrix), the bristlecones got demoted to the PC4. Instead of contributing 38% of the variance, they yielded less than 8% of the variance. (Using a correlation matrix, they only got demoted to the PC2 – something that we reported in passing in MM2005 EE, but which others paid a lot of attention to later in the piece.)

Using the retention schedule of the Corrigendum SI (2 PCs), this meant that two NOAMER ovariance PCs were retained – neither of which was imprinted by the bristlecone. So in the subsequent regression phase, there wasn’t a HS to grab and results were very different than MBH. Mann quickly noticed that the bristlecones were in the PC4 and mentioned this in his Nature reply and in his December 2004 realclimate post trying to preempt our still unpublished 2005 articles (where we specifically report this phenomenon). We cited a realclimate post in MM2005 -EE by the way.

In the regression phase as carried out in MBH98, it didn’t matter whether the bristlecones got in through the PC4 or the PC1, as long as they got in. In his 2nd Nature reply, Mann argued that application of Preisendorfer’s Rule N to the NOAMER AD1400 network entitled him to revise the number of retained PCs – the “right” number of PCs to retain was now said to be 5. The argument originally presented in his 2nd Nature reply became a realclimate post in late 2004.

In a couple of the earliest CA posts,Was Preisendorfer’s Rule N Used in MBH98 Tree Ring Networks? (see also here), I replicated Mann’s calculation for the North American AD1400 network and then tested other network/calculation step combinations to see if the observed PC retention counts could be generated by Rule N applied to that network. It was impossible to replicate observed counts using this alleged rule. Some differences were extreme – it was impossible to see how Rule N could result in 9 retained PCs for the AD1750 Stahle/SWM network and 1 retained PC for the AD1450 Vaganov network.

Not that the “community” had the faintest interest in whether any descriptions of MBH methodology were true or not. However my guess is that some present readers who are scratching their heads at Gavin’s “explanations” of retained PCs in Steig et al will be able to appreciate the absurdity of Mann’s claims to have made a retention schedule using Rule N. I have no idea how it was actually made – but however it was done, it wasn’t done using the procedure that supposedly rationalized going down to the PC4 in the NOAMER network.

Some of Gavin’s new comments seem to be only loosely connected with the actual record. He said:

Schneider et al (2004) looked much more closely at how many eigenmodes can be usefully extracted from the data and how much of the variance they explain. Their answer was 3 or possibly 4. That’s just how it works out.

Applying PCA to the covariance matrix of monthly TIR anomalies covering the Antarctic continent results in two modes with distinct eigenvalues that meet the separation criteria of North et al. (1982). The leading mode explains 52% of the variance in TIR, while the second mode accounts for 9% of the variance. The first EOF, shown in Fig. 1a as a regression of TIR anomaly data onto the first normalized principal component (TIR -PC1, Fig. 1b) is associated most strongly with the high plateau of East Antarctica. Locally, high correlations in East Antarctica indicate that up to 80% of the variance in TIR can be explained by this first mode, as determined by r2 values. More moderate correlation of the same sign occurs over West Antarctica.

The second EOF (Fig. 1c) is centered on the Ross Ice Shelf and on the Marie Byrd Land region of the continent, where 40-60% of the TIR variance is explained. Most of West Antarctica is of the same sign, but the pattern changes sign over the Ronne-Filchner ice shelf (at 60°W) and most of East Antarctica. Some coastal areas near 120°E have the same sign as West Antarctica. Only a small fraction of the variance in East Antarctic temperatures can be explained by mode 2.

The two EOFs are illustrated in Schneider et al Figure 1 and look virtually identical to the eigenvectors that I plotted (from the PC analysis of the AVHRR data) as shown below (you need to mentally change blue to red for the PC1 – the sign doesn’t “matter”)

From Schneider et al 2004 Figure 1.

“Two modes with distinct eigenvalues that meet the separation criteria of North et al. (1982)” doesn’t mean the same thing to me as

3 or possibly 4. That’s just how it works out

I guess you have to be a to understand this equivalence. Gavin also says in reply to Ryan O:

Since we are interested in the robust features of the spatial correlation, you don’t want to include too many PCs or eigenmodes (each with ever more localised structures) since you will be including features that are very dependent on individual (and possibly suspect) records

Unless, of course, they are bristlecones.

Lest we forget, Wahl and Ammann had their own rationalization for the bristlecones. They argued that if you keep adding PCs until you include the bristlecones, the results “converge” – an interesting argument to keep in mind, given the apparent “convergence” of Steig results to no trend as more PCs are added. Wahl and Ammann:

When two or three PCs are used, the resulting reconstructions (represented by scenario 5d, the pink (1400–1449) and green (1450–1499) curve in Figure 3) are highly similar (supplemental information). As reported below, these reconstructions are functionally equivalent to reconstructions in which the bristlecone/foxtail pine records are directly excluded (cf. pink/blue curve for scenarios 6a/b in Figure 4). When four or five PCs are used, the resulting reconstructions (represented by scenario 5c, within the thick blue range in Figure 3) are virtually indistinguishable (supplemental information) and are very similar to scenario 5b. The convergence of results obtained using four or five PCs, coupled with the closeness of 5c to 5b, indicates that information relevant to the global eigenvector patterns being reconstructed is no longer added by higher-order PCs beyond the level necessary to capture the temporal information structure of the data (four PCs using unstandardized data, or two PCs using standardized data).

The Wahl and Ammann strategy was condemned by Wegman as “having no statistical integrity”. Wegman:

Wahl and Ammann [argue] that if one adds enough principal components back into the proxy, one obtains the hockey stick shape again. This is precisely the point of contention…

A cardinal rule of statistical inference is that the method of analysis must be decided before looking at the data. The rules and strategy of analysis cannot be changed in order to obtain the desired result. Such a strategy carries no statistical integrity and cannot be used as a basis for drawing sound inferential conclusions.

Again this proved not to be an incidental issue. Wahl and Ammann’s summary of the procedure – the one said by Wegman to have “no statistical integrity” siad:

“when the full information in the proxy data is represented by the PC series [i.e. enough to get the bristlecones in], the impact of PC calculation methods on climate reconstruction in the MBH method is extremely small… a slight modification to the original Mann et al. reconstruction is justifiable for the first half of the 15th century (∼+0.05–0.10◦), which leaves entirely unaltered the primary conclusion of Mann et al.”…

It was this conclusion that was adopted by the IPCC as the last word on the entire episode:

The McIntyre and McKitrick 2005a,b criticism [relating to the extraction of the dominant modes of variability present in a network of western North American tree ring chrono-logies, using Principal Components Analysis] may have some theoretical foundation, but Wahl and Amman (2006) also show that the impact on the amplitude of the final reconstruction is very small (~0.05°C).

So however trivial these matters may seem, there’s lots of fairly intricate debate in the background. I, for one, welcome the entry of another data set.

96 Comments

How many PCs to retain? Whatever it takes to get the trend – the thing that is presumed to be “robust” and the pattern to which all observations “converge” given enough observation time and spatial replication. Notice the circular logic. “Robustness” and “convergence” – which ought to be objectively defined – are defined by whether or not a particular pre-determined pattern (hockey stick) is being produced. No trend = no robustness, no convergence. QED.

I found a useful chapter on EOF’s as used in meteorology here. It told me, first, what the “separation criteria” involve. It’s just separation of the eigenvalues – to avoid duplicates. I can see why you have to do that. With the disk, the EOF’s 2 and 3 have the same eigenvalue, and so the EOF’s can be mixed arbitrarily (a 2D subspace). It would be bad to cut off at PC2, because you would be relying on an arbitrary choice of an EOF from that 2D space.

Aside from separation, Schneider, Steig et al 2004 just seem to be saying that two EOF’s seemed to explain all the variance that could be achieved, and separation was satisfied, so they stopped there. Steig09, presumably, with a different data set, either found that PC3 explained more variance, or was not separated from PC2. I can’t see that the distinction between the three they used and the two in SS04 is a big deal.

Incidentally the chapter I linked to said (p 14) that things start to get ragged about PC3 (p 14).

Incidentally the chapter I linked to said (p 14) that things start to get ragged about PC3 (p 14).

They were talking about results from one particular data set. This was not a generalization. The exact text is:

The cumulative EV lines for EOT and EOF are best separated at about m=8 for this data set. But as we add more modes the advantage of EOF decreases. EOFs have a clear advantage only for a few modes. For either EOF or EOT only two modes stand out, or can be called principal components, already at mode 3 we see the beginning of a continuum of slowly decreasing poorly separated EV values. In this case the regular EOT are slightly more efficient than alternative EOT, but this is not generally true.

I can’t see that the distinction between the three they used and the two in SS04 is a big deal.

You mean you can’t see the hypocrisy? The obvious confirmation bias in the number of PCs retained in these meatgrinder exercises (you must include the MBH studies to see the pattern)? It’s only a big deal if you insist the conclusions not be predetermined at the outset of the study. Which is what real scientists do insist.

Re: Nick Stokes (#3),
The notes I linked to came from this site (look for “Huug vanden Dool”). The chapter on teleconnections referenced is also interesting, and is here. Unfortunately, I can’t find the diagrams for Chap 3.

Yes, bender, I should have said that the observation about 3 PCs was for a particular data set (NH ocean oscillations, describes in the teleconnections chapter), And Willis, I agree that ideally one would set out stopping criteria in advance, but they aren’t so black-and-white. For example, suppose you specify that 60% of variance should be explained, and then find that 3 PCs explain 50%, and then the contributions drop to much smaller levels, and seem to be noisy? (This is the kind of situation in the extract that bender quoted #5) And how do you define separation rigidly? I haven’t yet found the original North rule, but here is vanden Dool’s paraphrase (p 15):

Never truncate a λm spectrum in the middle of a shoulder (a shoulder or knee is an interruption in
an otherwise regular decrease of eigenvalue with modenumber).

I can’t see that the distinction between the three they used and the two in SS04 is a big deal.

I disagree. If, as it seems, they used three PCs to reduce the original set, then, according to Steig (2004) the third PC is “noise”. RegEM does not differenciate between any inputs, and hence it gets potentially the same weight as the other two “real” PCs.

ideally one would set out stopping criteria in advance, but they aren’t so black-and-white. For example, suppose you specify that 60% of variance should be explained, and then find that 3 PCs explain 50%, and then the contributions drop to much smaller levels, and seem to be noisy?

IMO this particular case (temperatures) could be done with objective criteria (such as AIC, BIC, MDL,..) known since 1970’s but yet to reach the Team world. In some other sciences, the general problem of selecting PCs is known as model order selection.

The point on separation was discussed in an earlier thread (citing North) and this is not at issue here. I quite agree that the EOF2 and EOF3 are a pair for a disk – so there doesn’t seem to be a plausible reason for truncating one without the other. But that’s not the argument that Schmidt presented. He said that Schneider et al 2004 said 3 or possibly 4 – that’s “how it is”. Well, show me where Schneider et al says 3 or possibly 4.

And maybe they re-did the analysis in Steig 09 and got a different answer. That’s fine. But you’re just guessing. It’s their job to say what they did.

And this is not a first time offence. I’ve never been able to find out how the MBH PC retentions were done.

And contrary to what you say, sometimes it does “matter”. That’s why it was such an issue in MBH. And it was never resolved as Mann never provided an explanation that matched the results. Although the “community” is satisfied with empty words describing methods that weren’t actually used, I’m not; I want to know what they really did.

Am I the only one who finds their antics both incredible and incredibly funny? I cracked up reading this. Two PCs are plenty!! No, we need four minumum! No, three or four!!

Nick, you raise a valid point. The difficulty is, you need to define your criteria for how many PCs you will retain before you do the analysis.

In other words, you need to say “OK, anything with a separation of less than X is a pair. We won’t split a pair. And we’ll pick enough PCs to explain X% of the variance, and not one more.”

And you need to do that before you start. You can’t do it after the fact. Wegman nailed them for that. Mann and his merry men have been doing it ever since their story switched from “one PC explains it all, PC1. Amazing!” to “You need to take four PCs minimum for the magic to happen.” It’s just amusing to watch their numerical gymnastics, puts the Romanian gymnasts to shame …

Re: Willis Eschenbach (#7),
It’s the pea under the thimble. A casual scan of the literature would never reveal this hypocrisy because you wouldn’t see the connections. It’s tragic. It’s comedic. It’s embarrassing. It’s blog-o-rific.

All I can say is that all this Mannian use of principal components reminds me of congressional district gerrymandering by political appointees in order to create voting districts that always elect from the ‘proper’ political party.

In Jackson (A user’s guide to principal components, Wiley, 1991) a large number of stopping rules are discussed. Although the author is careful to be impartial he appears to favour the “cross-validation method”. This stopping rule is recommended when the initial intention of the study is to construct a PCA model with which future sets of data will be evaluated. Given that these various climate studies are ongoing this approach may lend itself to re-analysis (as in, let’s throw in a few more bristlecones). Another criterion, which is a bit like the 80:20 rule, is simply to stop when the cumulative explained variance is close to 80%; this tends to be about 3 or 4 in industrial studies but of course it depends on the particulars of the case.

Re: Omar Adamson (#13), Omar:
I am not sure what you mean by cross validation method in this particular context. With survey data, I test the robustness of the underlying PC structures by randomly splitting the data set and by checking major demographic subsets that a priori should not influence the underlying structure of the data. This is a relatively standard way of generating and defining a “scale” in IO psychology anyway.

In exploratory data analysis a researcher is free to do almost anything as long as they are explicit about what they did and as long as they are prepared to validate their findings with another independent set of data. The latter means that they have to be ready to support all replication efforts. To explore without fully documenting, to claim without validating and to state without fully enabling replication is simply bad practice.

“I guess you have to be a to understand this equivalence.” You are a tease, Steve – what is the missing noun? Genius? Climate Scientist? Team Member? Charlatan? Witch Doctor? Mind you, those categories are not mutually exclusive.

Your “Truncated PC” chart makes it clear that how many PCs one picks makes a difference. And your blog post describes some possible pitfalls in the picking.

For someone like me, new to PC analysis. What are the two or three “most acceptable” objective methods to determine the “correct” or “best” number of PCs to use on a given problem? I am not looking for a detailed cookbook — rather, the short answer (a couple sentences for each method or a link to a reference that describes “acceptable” methods).

Or, is the use of PCs actually a “black art” — wherein one must generate one’s own criteria for retaining PCs?

When developing and applying a new method, it is critical to test it against 1) no pattern to check for spurious results and 2) a known pattern to see if you can detect it — and also to test for bias and accuracy of the method. This practice of applying novel methods with no verification of their behaviors is simply grotesque.

It seems to me that you’d stop truncating at the point where additional PCs don’t change the results significantly. I cannot imagine a more reasonable criteria.

This is after all a form of compression, analogous to JPEG image compression, or MP3 audio compression. Both of these methods of compression project a signal onto an orthogonal basis set that’s ordered by perceptual importance, and each compression method has a cutoff beyond which you stop – when the additional information “doesn’t matter”. When the jpeg looks full and detailed to the eye, when the MP3 sounds like the original CD – you’ve still thrown away a lot of information, but that additional information is not significantly detectable by the human senses.

In Steig et al, we have a method of compressing data that is then sent through another analytical step. Based on the chart above, this second step seems quite capable of detecting the differences between the various levels of compression around the cutoff used by the authors.

It seems to me that you’d stop truncating at the point where additional PCs don’t change the results significantly. I cannot imagine a more reasonable criteria.

Yes, it does sound reasonable – hence the previous posts that suggest “stop when we have explained 80% of the variance”. Nothing wrong with that selection criteria, as far as I can tell – but I’m not a statistician. From what I can tell though, the statisticians are suggesting that you need to establish this cut-off criteria before you look at the data. Based on what I know of stats (which would fit on a postage stamp with room to spare), if you don’t specify the cut-off point a priori you risk distorting the analasys by finding significance that may not exist. However, as others have suggested, it may be reasonable providing you analyse and report on what changing the number of retained PCs does to the analasys. Can one of the stats gurus here confirm/refute this please?

Theres a difference between using 3 pc’s (or 4) for establishing trend which requires only a small number of DOF or using them for correlation to establish spatial position and trend across a continent where more DOF are required. IMO Gavin’s reference to Schneider misses the point.

Higher PC’s numbers are required in RegEM to match the surface stations to the AWS stations. For those who haven’t seen it, Jeff C and I did a post on WUWT last weekend which shows the correlation vs distance for the two reconstructions and compares them to the natural correlation vs distance from the surface stations. It was pretty apparent to us there was a problem, by increasing the PC’s we found that a reasonable correlation vs distance returned and the trend changed substantially.

For some reason, my comment thread is slightly shorter 🙂 but I did catch TCOitis last night.

As SteveM pointed out, it may be a case of more pegs than holes. In the case of satellite data the problem becomes a two level one where the data was initially processed to 3 pcs prior to RegEM followed by a second level which dropped PC3 to zero in the RegEM portion. From the satellite data, the correlation vs distance is even worse.

It’s pretty relevant to the stability analysis points on the other thread. The reconstruction doesn’t just need to be robust to it’s data but its parameters as well. Either that or very clear reasonable explanations of the differences need to be provided.

wouldn’t it just make sense (in all these cases) to present the bar chart as depicted at the top of this post (i.e., show the effect of PC retention choice on the outcome of the analysis)? I assume its not too computationally expensive to do so…and then everyone understands the, uh, robustness of the result WRT PC retention (and, of course, they could also show the amount of variance explained as a cumulative function.)

Well, Schneider, Steig and Comiso (2004) didn’t have Mike Mann on board yet. No wonder they couldn’t find the third PC!

The differences in trend between 2, 3, and 4 PCs aren’t great, and wouldn’t make much different with Steig 09 CI’s. However, when the CI’s are adjusted for even AR(1) serial correlation (see “Steig 2009’s non-correction for serial correlation”), the significance of the trends could easily be overturned by using a slightly weaker point estimate.

Steig 09’s primary results are based on the TIR recon, not the AWS back-up recon. We don’t know how this was constructed from the raw data, but now that we know it is based on 3 PCs and what these are, isn’t it possible to construct trends based on just the first 1 or 2 PCs? (Given that the EOFs are orthogonal, omitting the third shouldn’t affect the coefficients on the first two, unless I’m mistaken.)

The differences in trend between 2, 3, and 4 PCs aren’t great, and wouldn’t make much different with Steig 09 CI’s. However, when the CI’s are adjusted for even AR(1) serial correlation (see “Steig 2009’s non-correction for serial correlation”), the significance of the trends could easily be overturned by using a slightly weaker point estimate.

That’s been my point of contention, i.e. why an apparently small difference in trends could loom large in the light of adjusted CI’s (calculated trends significantly different than zero). I would be very surprised if this situation could not have been pointed out by a grad student involved with the research.

We don’t know how this was constructed from the raw data, but now that we know it is based on 3 PCs and what these are, isn’t it possible to construct trends based on just the first 1 or 2 PCs?

Yes, you simply omit the 3rd PC from the RegEM input. Step by step instructions for someone with free time this evening:
1) Calculate the TIR PCs from post 1982 data (see Roman’s post for details how to do that in R)
2) Omit the 3rd PC, and prepare the input from 2PCs and the station data (see Jeffs’ posts how to do that)
3) Run RegEM (with regpar=2 or regpar=3, IMO regpar=2 is more correct)
4) Project the TIR reconstructed PCs back to grid cells (by multiplying by EOF’s obtained in step 1)

Again I remind interested readers to browse through the link to a prior attempt to verify a story about how PCs were retained in MBH98 here.

Unfortunately methodological descriptions are all too often legends in the cover story sense – something that looks and sounds plausible and will be accepted by the “community”, but nonetheless untrue.

I use the word legend advisedly and with some frustration here, as I’ve been notably unsuccessful in convincing the “community” that the ostensible PC selection rules of MBH were a legend although the evidence is uncomplicated and seems plain as day to me.

Another oddity here. We all have to be careful not to conflate “significant” and “separable”.

If you re-examine the disk eigenvectors, these come in couplets and multiplets. The EOF2 and EOF3 have the same eigenvalues. As North 1982 emphasized in a meteorology context, these couplets are not “separable” in that any linear combination of the eigenvectors is also an eigenvector. Even if both elements of the couplet are “significant”, it seems to me that this would pose a real problem for linear regression-type methods like RegEM because none of the individual patterns is “real”.

North 1982 alluded to a further complication when statistical error is present. What are the error bounds on the eigenvalue estimates used for the determination of whether these values are “separable”?

Taken at face value, Steig is saying that the 2nd and 3rd eigenvalues of the observed situation are “separable” even though a disk couplet wouldn’t be. Unfortunately, they didn’t give any details on how they established this interesting fact. The more one thinks about it, the harder it would be to prove that the 2nd and 3rd eigenvalues are in fact separable – especially when North 1982- type error considerations are allowed. And if the 2 and 3 eigenvalues aren’t “statistically separable” – and we haven’t seen any proof or evidence that they are – then you’re back in the conundrum of mixed up eigenvectors in your RegEM module.

At the end of the day, I think that 99% of the problems arise out of trying to use convoluted methods for something that they aren’t designed for.

Think about what the PC1 is – it’s a weighted average of the gridcells with the interior points overweighted. A simple average evens out the weighting. All you’re trying to do is figure out the past average using 15 or so surface stations mostly around the boundary and the South Pole. There’s no reason why there should be some kind of stable regression relationship between the 15 stations and the PC1 rather than the 15 stations and the simple average.

I’ll grant, first, that I’m not the world’s leading expert on PC analysis, but I’m not entiret ignorant of it either. And based on what I think I know, the whole idea of using PCs in this way strikes me as questionable.

We know that there are issues with the sampling of temperature data in Antarctica – things like a surfeit of coastal, peninsular sampling sites, and probably greater noise (due to maintenance issues) in interior stations – but PC analysis does nothing to help this. If anything, it amplifies the problem, because the large number of correlated, well-maintained, geographically similar sites is likely to create a hefty mass of explainable variance relative to other stations. So whatever trend appears in those sites will appear in one or more of the “major” PCs, causing it to be retained while other PCs representing minor components of the variance are thrown out. Hence the oversampled area will weigh even more heavily after analysis than before.

As an additional problem, I doubt that the interior of the continent has as much temperature variability as the coasts, likely causing all interior stations to be underweighted for that reason as well. Less signal-to-noise means less explainable variance, and hence a lower ranking PC.

This is speculation, but if the satellite data shows a stronger signal-to-noise ratio on the borders of the continent (for technical or geographical reasons), then it, too, will preferentially retain the coastal, peninsular areas in the major PCs.

Is it appropriate to speculate that the employment of bristlecone pines in Mann’s paleoclimate temperature reconstructions is, in some sense, functionally equivalent to the employment of automated weather stations in Steig’s Antarctic temperature reconstructions — assuming of course that a thousand year old pine tree is as accurate a temperature sensing and recording device as is a modern day automated weather station.

Some of Gavin’s new comments seem to be only loosely connected with the actual record…Does Gavin think that nobody’s going to check?

He doesn’t have to worry about the readers at RC to check, nor does he have to worry about them coming over here. He can just post it at RC, and the readers nod their head and say, “Case closed! Gavin solves another case.”

What appears interesting chez gavin is that his commenters all accept what he says without quetion, or that is at least how it appears, but we should not forget gavin’s high bandpass filtering system. Anyone who questions the great god Iam gets deleted.

Re: Stephen Richards #38
You’re absolutely right on the “filtering” If you read his latest post on what he says how George Will should have “apologized”, I commented to him that I thought it was funny for him to say that considering the games he played with McIntyre over the Steig paper and incorrect station data. Then I asked him how was Will’s column less ethical than that? Needless to say, he didn’t print that part of my comment.

I come from a statistics background and had never heard of “Preisendorfer’s Rule N”. Perisendorfer isn’t the inventor of PCA, but is apparently the guy who first tought it to atmospheric scientists. I’m not willing to spend $895 to find out the rules he tought them, but I am willing to cite a rule I drum in to everyone to whom I teach data analysis: it is not a good idea to fit to reduced data. Fit to the actual data. Data reduction (binning, smoothing, retaining only a subset of components) is only good for drawing pictures.

The authors under consideration appear to believe that throwing away higher principal components removes noise and “artifacts” from their data. That may be the case, but to know whether it is they need a model for the noise. And given a model for the noise, really the right way to handle it is to incorporate the model for the noise into the model for the data and go back to fitting the unreduced data. Without a model for the noise, there is no data reduction technique that will magically seperate your signal from your noise.

It may occasionally happen that your model for the noise is too vague or the right procedure is too difficult, so that you are forced to fitting reduced data. (For example, you may only know that your noise is mostly high-frequency, so you apply a low-pass filter and then analyze the reduced data.) In that case, you need to do a lot of sensitivity analysis to show that your results are not sensitive to your data reduction technique. The graph at the top of Steve’s post would appear in this case to show the opposite.

I might add that another rule I try to drum into students is never tune you data analysis to maximize your signal. Too many smart scientists — and not just in climatology — have foolishly convinced themselves that doing so is a good idea. They are practically guaranteed to get a strong result, and that result is practically guaranteed to be suprious. Steve’s graph also suggests that the authors under consideration may fell have fallen into this trap.

I might add that another rule I try to drum into students is never tune you data analysis to maximize your signal. Too many smart scientists — and not just in climatology — have foolishly convinced themselves that doing so is a good idea.

But…but…but…that’s EXACTLY what Mann describes when he designs his statistical meatgrinder. He chooses his favourite metric and tunes his model in order to maximize the significance given by that metric.

Mann and Wegman are poles apart on this issue. One of them must be wrong.

Re: David Wright (#40), I understand your frustration; these guys are in a world of their own making. In addition to Jackson there is another standard text on PCA by I. T. Jolliffe “Principal Component Analysis” Springer 2002. Jolliffe discusses the book by Preisendorfer and Mobley (1988) “Principal components analysis in meteorology and oceanography”, Amsterdam: Elsevier, saying the book “concentrates on meteorology and oceanography. Because of this, the notation in Preisendorfer and Mobley differs considerably from that used in mainstream statistical sources.” Later Jolliffe says its “not an easy read, it rivals Rao (1964) in its range of novel ideas relating to PCA, some of which have yet to be fully explored. The bulk of the book was written by Preisendorfer over a number of years, but following his untimely death the manuscript was edited and brought to publication by Mobley.”

No, he is probably correct but not telling the whole story. Here is Gavin’s response:

(for instance, for the AWS reconstruction the trends in the mean are -0.08, 0.15, 0.14, 0.16, 0.13 deg C/dec for k=1,2,3,4,5). – gavin]

When I ran the same comparison using Jeff Id’s implemetation of RegEM, I got the following:

1 -0.07
2 +0.16
3 +0.15
4 +0.17
5 +0.12
6 +0.09

There are still some small discrepancies between Steig RegEM, Id RegEM and Steve’s R version. We are trying to figure out the differences but aren’t quite there yet (probably has to do with the anomaly calculation time window, number of iterations, stuff like that). Instead of helping resolve the discrepancies, Gavin is using them to obfuscate.

What Gavin doesn’t do is continue the trend. He shows you 2 through 5 that appear to be just a noisy variation around the same value. If he continued using higher-order PCs, it would show the same roll-off Steve showns above. k=6 is 0.09 using Jeff Id’s RegEM.

The other point I don’t see addressed here is that when you use higher order PC’s in surface station data, you no longer get the option to simply average. In my opinion Jeff C’s regridding has demonstrated beyond a doubt that the trend must be area weighted because there are areas of high and low station concentration. If area weighting wasn’t important the trends wouldn’t be cut in half. It is simply hiding behind the most basic artifact of the paper to say ‘look the trends don’t change’ — they shouldn’t change much. RegEM is presumed to account for spatial distribution errors by it’s nature, applying correct trends to correct areas. Otherwise they would have simply used linear averaging of station data pre 1982 and pasted the same trend on every gridcell.

I’m going to do a post on this so I’ll stop here. I don’t think we’re getting quality answers from the team though.

The other problem is that Gavin’s reply is simply throwing up noise. The discussion at RC has no bearing on the aggregate trend; the discussion is whether the choice of 3 PCs gives them sufficient geographic resolution to support the claim of substantial warming in West Antarctica. The overall trend has nothing to do with that. It’s chaff.
.
If he ran those numbers showing the trend for West Antarctica only – and arrived at the same general conclusion – he’d have a point. Otherwise, it’s as relevant as me telling him my shoes are black.
.
(and if you’re looking for hidden meaning in that, there isn’t any . . . so that’s how relevant it is)

Ryan O- I agree that Gavin’s response is throwing up noice with regard to your discussion over at RC. It seems to be a common strategy over there: say something smart and true, and somewhat related to the discussion at hand, even if doesn’t at all answer the challenge laid out in prior comments. However, the info should limit our accusations here (e.g. it is not the case that the team just picked 3 PCs because that maximized the trend, even if the decision remains suspect).

Jeff C – thanks for the quick response. Would you have to have the significance level of the trend using the other PCs?

FWIW, I just ran RegEM with a few more cases and also refined my earlier cases. I ran with neigs unset, we had earlier had this set to 3 but found leaving it unset gave better agreement with Steig. I let RegEM run as many iterations as it wished, not stoppoing until it stagnated. Here are the trends for various regpar (k) settings

Readers should also note that this shows results from a slightly different method than RegEM TTLS – a truncated SVD that yielded very similar results but converged a lot faster. In the earlier post, I said that I would post on the differences but I forgot to do so. I’ll try to get to in the next few days.

Following up on #52, Here is something else interesting, although I have no idea how to explain it. We noted this in the WUWT post and dubbed in “temporal smearing”. That may be a misnomer, but I’m not sure what else to call it.
I was looking at the AWS trends not only from 57 to 06, but also from 1957 to 1979 and from 1980 to 2006. This seemed reasonable since there is no real data pre-1980. Roman and others had noted that from 1980 on the trend was virtually flat.
Look what happens when you increase the number of PCs. The 57-06 trend drops slowly, but the 80-06 trend starts cooling pretty dramatically. At the same time, 57-79 stays flat or even slightly increases warming.

If we believe the higher number of PC trends are more accurate (by no means a given yet), it would mean that the lower number of PC trends somehow “smear” warmer temperatures from pre-1980 into post-1980.

It is as if the higher order pc 57-06 trend is an inverted “V” with the peak around 1980. Lowering the number of PCs pushes down the peak and raises the slope in later years.
Not sure what happens at regpar=9 as it is a real outlier. RegEM doesn’t work for 13 and above. Twelve is iffy as it takes about 200 iterations to converge.

What I meant to say was that adding PCs simply creates more detail in the reconstructions and after 1 PC area weighting makes more sense because of the improvements in the trend smearing. The trend may even converge to something stable more quickly.

But a general remark for all people running these tests: don’t be fixated with RegEM-TTLS, try also Ridge-regression. TTLS is not the choice of the original author (Schneider), and his preference is ridge-regression (standard parameters in RegEM). I have a reason (which is none of the reasons mentioned by the Team; sorry gavin, I’m not going to tell it here) for prefering TTLS over ridge in this application, but, on the other hand, I think this problem could/should be handled without any regularization at all! Steig et al. describe their choice of TTLS as follows:

Using RegEM with TTLS provides more robust results for climate field reconstruction than the ridge-regression method originally suggested in ref. 11 for data infilling problems, when there are large differences in data availability between the calibration and reconstruction intervals [10].

The citation 10 is, of course, a Team-internal citation to Mann et al (2007), which says about the issue (my bold):

However, we have found that the estimation of optimal ridge parameters is poorly constrained at decadal and longer timescales in our tests with pseudoproxy data, and we have learned that earlier results using ridge regression (e.g., M05) are consequently sensitive to, e.g., the manner in which data are standardized over the calibration period (see discussion below in section 2.3). We have found TTLS to provide more robust results with respect to these considerations.

I guess that could be substituted for the definition of “handwaving”. Anyhow, data in Steig et al is not (pseudo)proxy data, it’s actual temperatures. Moreover, the time scale is 50 years at maximum, and there is no need for standardization (in the sense described in the above paragraph). Also, notice that Smerdon et al (2008) preference for TTLS is for high noise proxydata; this is not the case here.

Re: Jean S (#62),
Brilliant observation.
Appeal to authority disguised as appeal to reason. And it passes review because pseudoscientific language is so good at masking non-scientific arguments. lucia, this is yet another problem with Revkin’s critics: appeal to authority disguised as appeal to reason.

You have to realize that Bender is a bit of a jokester. You have to overlook his jests which fall flat. And in this case, remember that, AFAIK, Bender doesn’t have any Team Skeptics articles being peer reviewed at Nature. I could be wrong there as I don’t know his actual name.

bender wasn’t sniping at you. He was sniping at the term “seem to match” – which slides into Team articles as though it meant something. And he’s sniping at Nature for permitting such loose language in Nature articles. So if certain people have something that “seems to match”, that may qualify for a Nature cover.

The PDO was in a “cool” phase from the late 1940’s to around 1977, which would warm the waters around Antarctica.

The PDO was in a “warm” phase from 1977 to 2007 which would cool the waters around Antarctica.

Bender, this observer (with reasonable amounts of memory surviving for one of his years) wants to know: What raised your hackles more the term “seem to match” or a reference to PDO?

Actually it was Steig in an earlier paper who noted that ice core samples from the West Antarctica showed that the period 1935-1945 was by far the warmest 10 years for the 20th century in West Antarctica. The authors of that paper attributed that warming to an unusual period of warming of the waters of the Southern Pacific ocean. I do not recall whether they connected the Pacific warming to the PDO, Bender.

Patrick and Jeff:
Picking up on Patrick’s point, why not use the known PDO phases to set the time parameters, leaving gaps if necessary? After all there supposedly is a physical process driving the temperatures.

Jeff – I agree that more PCs don’t necessarily make the results more accurate. It is clear that strange things start happening above 8 PCs. I think you are correct that the weighting is a better approach. The thing that fascinated me about the chart above was the different effect on the two time periods. We have talked a lot about spatial smearing, but there seems to be a systemic effect over time. It could be the same phenomena viewed from different domains, but I have not been able to sort that out in my head.

Jean – I have been focused on TTLS and have complexly ignored ridge regression, since that is what the team used. I’ll experiment with it ans see what happens. Regarding neigs, it was your earlier comment that made me realize we should leave it alone instead of setting it to match regpar. Once that was done, the agreement with Steig on the k=3 case improved dramatically. Thanks for the tip.

An equally big question to how many PCs are used, IMHO, is where the AVHRR PC3 came from. This was first plotted on Jeff Id’s site, and then again on CA, as follows:

As Jeff put it, this “ain’t natural”! The PC must have been zeroed out before 1980 by hand at one point, and perhaps then fed back into the meatgrinder, where it picked up a small amount of noise in its early portion from whatever processing goes on.

When Comiso comes through with the raw AVHRR data he has promised Steve, it will be interesting to see what this PC looked like before it was “corrected”.

As a non-statistician reading this blog on a daily basis, I believe it is important that this group stay focused on the issue at hand and not fall into the various pits which it seems that “the team” has done. The previous post on Jan de Leeuw should be kept at a conscious level at all times, particularly several passages:

“Another problem, which is related to the first one, is that the models typically proposed by statistics are not very realistic.”

“Our conclusions so far, on the basis of the above, can be summarized quite briefly. The task of statistics is to describe the results of empirical investigations and experiments in such a way that the investigator can more easily make his predictions and generalizations. Thus it is not the task of statistics to make generalizations. Statistical inference, whatever it is, is not useful for empirical science. Many statistical procedures are very useful, many statistical measures of fit provide convenient scales of comparison, and many statistical models provide interesting theoretical illustrations and gauges with which we can compare our actual data. But generalization, prediction, and control are outside of statistics, and inside the various sciences. Statistics has given us many useful tools and scales, but it has not given us a methodology to make the appropriate inductions…”

Any correlations to natural phenomena to be gleaned from the Steig et al model cannot be anything more than pure speculations, because this model purports a non-reality. It is a statistical manipulation of data, subjected to multiple arbitrary assumptions and parameters, which leads to an array of possibilities. I believe that it is more important to stay focused upon the arbitrary methodology at play in this model than to try to use it to correlate to natural phenomena.

But seriously. Why did Steig et al. not do a formal comparison of PCs to the regions EOFs?

There is a lot that Steig et al. (2009) have left to the readers’ imaginations: the 1935-1945 West Antarctica warming, the trends in the Antarctica coming forward from 1957, the differences between AWS and TIR trends, the details for a physical explanation of the PCs 1, 2 and 3,the input into the PCs and the adjustments for AR1 in trend slope CIs.

Steig et al. appear to be like the great poets or novelists, who by explaining too much would have ruined the effect. It does, however, provide fodder for CA.

We have sliced and diced the AWS recon for a while now. No matter what we try (distance weighting ReGEM’s input, including more PCs, distance weighting RegEM’s output) we get the same answer. That trend is toward less warming overall and pronounced cooling post-1980.

Hu, Kenneth, Ryan and others have shown the claims of statistical robustness of the study need to be taken with a grain of salt. While I understand on the order of only 25% of their analysis, it seems clear to me they have been honest and straight-forward without cherry-picking.

This is all good stuff, but the satellite recon is the big boy. Comiso’s earlier work and station temperatures indicated widespread cooling outside the peninsula and a few other places. If you read the Schneider paper Steve mentions above (Schneider 2004), it seems clear that these guys have convinced themselves that the first three PCs are correlate to the dominant atmosrheric factors in Antarctic climate. They don’t discuss the whole Chladni pattern issue and the “castles in the clouds” warning in drawing their conclusions. That is fine; I hold many strong opinions on flimsier evidence than they have.

Where they ventured into highly-questionable territory was the Steig 2009 satellite recon. In my opinion, they didn’t believe the temperatures in the cloud-masked satellite data. Maybe they didn’t think the cloud-masking algorithm worked properly and the data was contaminated. For whatever reason, they replaced the satellite data with what they thought had to be the real temperatures, i.e. temperatures reconstructed from those 3 PCs identified in Schneider 2004. They then gloss over this in the RegEM discussion of the methods summary of the paper, giving the misleading impression that they were infilling, as opposed to replacing. Follow that with fancy graphics on the cover of Nature and press conferences. Now that people are wising up, they stonewall in releasing the actual cloud-masked data (or possibly never intended to release it). I strongly suspect we will never get the real cloud-masked satellite data.

While I have to admire the audaciousness and PR-savviness of the whole affair, the ethics of it leave me in disbelief. I would almost certainly be fired for pulling a similar stunt in my job.

I tried posting this yesterday at RC after seeing a lack of response to Ryan O’s last comment:

376. John Norris Says: Your comment is awaiting moderation.
7 March 2009 at 8:09 PM
re: 375 in the context of 353, 364, 369, 370, 372, 373, and 374
So any further defense of “Antarctic warming is robust”?

It no longer shows up as awaiting moderation, so I am guessing it didn’t make the cut. Perhaps they are preparing an irrefutable response that shows something wasn’t taken into account in the analysis that all you folks did. Then they can then close the thread.

Re: Layman Lurker (#89), They both have the same general effect, reducing the average continental warming trend. I’m going to run some comparisons to help quantify the similarities and differences between the two. One thing I find interesting is how they affect the pre-1980 vs post-1980 trend. That distinction gets lost when you only examine linear trends over the 50 year time frame.

It seems that input vs. output weighting addresses different deficiencies. Input would presumably address spurious correlations in RegEM. Output would address the geographic inbalance in calculating a mean. At this point I have not looked closely enough to draw conclusions.

I will be interested in seeing your post on that Jeff. Depending on what you see, you may want to pursue accounting for other sources of correlation as well (region, ocean, latitude, etc.). One would think there would be other correlations, and similar degradation of such correlation with the low order factors.

I lack specialised software to extract PC/EOFs but I do so in a way that works (for me) but I have no proof.

Is it generally true that if one forms the covariance matrix C (NxN) and any normalised vector V (N) then the product CV (once normalised) is always a better approximation of the eigenvector with the lagest eigenvalue provided that all the eigenvalues are different?

In practice, iteration always seeks and finds an approximation of that eigenvector and finding the corresponding PC/EOF is straightforward.

If so does any know a reasonably simple proof? Or am I making a huge error somewhere?

What you are looking at is called the “power method” of calculating the eigenvector corresponding to the largest eigenvalue. It works if the largest eigenvalue is unique and if the initial vector you have chosen is not orthogonal to the eigenvector you are trying to find.

A Google search with “calculating eigenvectors power method” finds lots of pages on the topic including the pretty much unreadable Wiki page and a simpler explanation found here.

The method converges reasonably quickly when the separation between the largest and next largest eigenvalue is large. You can also speed up the convergence by continually squaring the covariance matrix to form powers of C of order 2^n before multiplying by the vector V instead of just successively multiplying by C at each step.

Many thanks, the links you provided led me to proof that I could understand. I was damn sure it was true, but it was a procedure that I came to by trial and error. I shall continue to use it. It is all I need for matrices that will fit into EXCEL and macros can take care of the iterations.

Sitting on another eigenvector is not a problem as the smallest error does eventually turn the vector.

At this risk of getting this all wrong, I will simply write what I thought to be the case.

When looking for a signal, it is convenient to perform a PCA as it enables one to reduce the degrees of freedom provided you can be sure that the PCs that are discarded contain no signifcant information and that you can verify this.

Having done that one has a smaller dataset in which to look for the signal you are interested in. If you have some data to calibrate against you use it to discover the weights for that signal in each of the retained PCs. Having done that you form the product of the weighs and the retained PCs and you have a recovered signal. If the calibration data was partial you have extended the recovered signal to cover the gaps. Obviously there are tests that need to be performed to gauge the level of confidence one can have in the recover data.

Now either I am being simplistic and just wrong, but I thought that it was a straightforward almost mechanical operation. Also I thought that you could always include all the PCs, it just requires more computation.

I feel I must be missing something, I get the feeling (not on this site) that the PCs are presented if they are signals whereas I would consider them more like “flavour scales”. Surely if a strong PC is on average weighted down as much as up the PC remains but the signal is cancelled out.

I do use PCs and other orthogonal bases, and often it is not all over until the last variance is explained.

Sorry if this is banal but I think I understand, yet I simply can not understand the way PCA is used in some of the papers. If you need at least 32 (as above) you need at least 32 you can not discard PCs if they contain the signal be it a positive or negative contribution. PCs that look nothing like the signal can still contain sufficient signal to be worth including.