Category Archives: Scientific Process

Post navigation

Well they’ve been running around on the flat expanses of the early Holocene lake bed with impressively large machines, whacking down and gathering the soybeans and corn. This puts dirt clods on the roads that cause one on a road bike at dusk to weave and swear, but I digress. The Farmer’s Almanac indicates says that it must therefore be about World Series time, which in turn is just about approximately guaranteed to initiate various comments regarding the role of luck, good or bad, in deciding important baseball game outcomes.

There are several important things to be blurted out on this important topic and with the Series at it’s climax and the leaves a fallin’ now’s the time, the time is now.

It was Bill James, the baseball “sabermetric” grandpa and chief guru, who came up with the basic idea some time ago, though not with the questionable terminology applied to it I think, which I believe came later from certain disciples who knelt at his feet.

The basic idea starts off well enough but from there goes into a kind of low-key downhill slide, not unlike the truck that you didn’t bother setting the park brake for because you thought the street grade was flat but found out otherwise a few feet down the sidewalk. At which point you also discover that the bumper height of said truck does not necessarily match that of a Mercedes.

The concept applies not just to baseball but anything involving integer scores. Basic idea is as follows (see here). Your team plays 162 baseball games, 25 soccer matches or whatever, and of course you keep score of each. You then compute the fraction S^x/(S^x + A^x), where using the baseball case, S = runs scored, A = runs allowed and x = an exponent that varies depending on the data used (i.e. the teams and years used). You do this for each team in the league and also compute each team’s winning percentage (WP = W/G, where W = number of wins and G = games played in the season(s)). A nonlinear regression/optimization returns the optimal value of x, given the data. The resulting fraction is known as the “pythagorean expectation” of winning percentage, claiming to inform us of how many games a given team “should” have won and lost over that time, given their total runs scored and allowed.

Note first that the value of x depends on the data used: the relationship is entirely empirically derived, and exponents ranging from (at least) 1.8 to 2.0 have resulted. There is no statistical theory here whatsoever, and in no description of “the pythag” have I ever seen any mention of such. This is a shame because (1) there can and should be, and (2) it seems likely that most “sabermatricians” don’t have any idea as to how or why. Maybe not all, but I haven’t seen any discuss the matter. Specifically, this is a classic case for application of Poisson-derived expectations.

However the lack of theory is one, but not really the main, point here. More at issue are the highly questionable interpretations of the causes of observed deviations from pythag expectations, where the rolling truck smashes out the grill and lights of the Mercedes.

You should base an analysis like this on the Poisson distribution for at least two very strong reasons. First, interpretations of the pythag always involve random chance. That is, the underlying view is that departures of a given team’s won-loss record from pythag expectation is always attributed to the action of randommness–random chance. Great, if you want to go down that road, that’s exactly what the Poisson distribution is designed to address. Secondly, it will give you additional information regarding the role of chance that you cannot get from “the pythag”.

Indeed, the Poisson gives the expected distribution of integer-valued data around a known mean, under the assumption that random deviations from that mean are solely the result of sampling error, which in turn results from the combination of Complete Spatial Randomness (CSR) complete randomness of the objects, relative to the mean value and the size of the sampling frame. In our context, the sampling frame is a single game and the objects of analysis are the runs scored, and allowed, in each game. The point is that the Poisson is inherently designed to test just exactly what the SABR-toothers are wanting to test. But they don’t use it–they instead opt for the fully ad-hoc pythag estimator (or slight variations thereof). Always.

So, you’ve got a team’s total runs scored and allowed over its season. You divide that by the number of games played to give you the mean of each. That’s all you need–the Poisson is a single parameter distribution, the variance being a function of the mean. Now you use that computer in front of you for what it’s really ideal at–doing a whole bunch of calculations really fast–to simply draw from the runs scored, and runs allowed, distributions, randomly, say 100,000 times or whatever, to estimate your team’s real expected won-loss record under a fully random score distribution process. But you can also do more–you can test whether either the runs scored or allowed distribution fits the Poisson very well, using a chi-square goodness-of-fit test. And that’s important because it tells you basically, whether or not they are homogeneous random processes–processes in which the data generating process is unchanging through the season. In sports terms: it tells you the degree to which the team’s performance over the year, offensive and defensive, came from the same basic conditions (i.e. unchanging team performance quality/ability).

The biggest issue remains however–interpretation. I don’t how it all got started, but somewhere, somebody decided that a positive departure from “the pythag” (more wins than expected) equated to “good luck” and negative departures to “bad luck”. Luck being the operative word here. Actually I do know the origin–it’s a straight forward conclusion from attributing all deviations from expectation to “chance”. The problem is that many of these deviations are not in fact due to chance, and if you analyze the data using the Poisson as described above, you will have evidence of when it is, and is not, the case.

For example, a team that wins more close games than it “should”, games won by say just one or two runs, while getting badly smoked in a small subset of other games, will appear to benefit from “good luck”, according to the pythag approach. But using the Poisson approach, you can identify whether or not a team’s basic quality likely changed at various times during the season. Furthermore, you can also examine whether the joint distribution of events (runs scored, runs allowed), follows random expectation, given their individual distributions. If they do not, then you know that some non-random process is going on. For example, that team that wins (or loses) more than it’s expected share of close games most likely has some ability to win (or lose) close games–something about the way the team plays explains it, not random chance. There are many particular explanations, in terms of team skill and strategy, that can explain such results, and more specific data on a team’s players’ performance can lend evidence to the various possibilities.

So, the whole “luck” explanation that certain elements of the sabermetric crowd are quite fond of and have accepted as the Gospel of James, may be quite suspect at best, or outright wrong. I should add however that if the Indians win the series, it’s skill all the way while if the Cubs win it’ll most likely be due to luck.

I’ve been thinking some more about this issue–the idea that selection should tend to favor those genotypes with the smallest temporal variations in fitness, for a given mean fitness value (above 1.00). It’s taken some time to work through this and get a grip on what’s going on and some additional points have emerged.

The first point is that although I surely don’t know the entire history, the idea appears to be strictly mathematically derived, from modeling: theoretical. At least, that’s how it appears from the several descriptions that I’ve read, including Orr’s, and this one. These all discuss mathematics–geometric and arithmetic means, absolute and relative fitness, etc., making no mention of any empirical origins.

The reason should be evident from Orr’s experimental description, in which he sets up ultra-simplified conditions in which the several other important factors that can alter genotype frequencies over generations, are made unvarying. The point is that in a real world experimental test you would also have to control for these things, either experimentally or statistically, and that would not be easy. It’s hard to see why anybody would go to such trouble if the theory weren’t there to suggest the possibility in the first place. There is much more to say on the issue of empirical evidence. Given that it’s an accepted idea, and that testing it as the generalization it claims to be is difficult, then the theoretical foundation had better be very solid. Well, I can readily conceive of two strictly theoretically-based reasons of why the idea might well be suspect. For time’s sake, I’ll focus on just one of those here.

The underlying basis of the argument is that, if a growth rate (interest rate, absolute fitness, whatever) is perfectly constant over time, the product of the series gives the total change at the final time point, but if it is made non-constant, by varying it around that rate, then the final value–and thus the geometric mean–will decline. The larger the variance around the point, the greater the decline. For example, suppose a 2% increase of quantity A(0) per unit time interval (g), that is, F = 1.020. Measuring time in generations here, after g = 35 generations, A(35) = F^g = 1.020^35 = 2.0; A is doubled in 35 generations. The geometric (and arithmetic) mean over the 35 years is 1.020, because all the yearly rates are identical. Now cause F to instead vary around 1.02 by setting it as the mean of a normal distribution with some arbitrarily chosen standard deviation, say 0.2. The geometric mean of the series will then drop (on average, asymptotically) to just below 1.0 (~ 0.9993). Since the geometric mean is what matters, genotype A will then not increase at all–it will instead stay about the same.

This is a very informative result. Using and extending it, now imagine an idealized population with two genotypes, A and B, in a temporally unvarying selection environment, with equal starting frequencies, A = B = 0.50. Since the environment doesn’t vary, there is no selection on either, that is F.A = F.B = 1.0 and they will thus maintain equal relative frequencies over time. Now impose a varying selection environment where sometimes conditions favor survival of A, other times B. We would then repeat the above exercise, except that now the mean of the distribution we construct is 1.000, not 1.020. The resulting geometric mean fitness of each genotype is now 0.9788 (just replace 1.02 with 1.00 in the above code).

So what’s going to happen? Extinction, that’s what. After 35 generations, each will be down to 0.9788^35 = 0.473 of it’s starting value, on average, and on the way to zero. The generalization is that any population having genotypes of ~ equal arithmetic mean (absolute) fitness and normally distributed values around that mean, will have all genotypes driven to extinction, and at a rate proportional to the magnitude of the variance. If instead, one genotype has an arithmetic mean fitness above 1.00 a threshold value determined by it’s mean and variance, while all others are below it, then the former will be driven to fixation and the latter to extinction. These results are not tenable–this is decidedly not what we see in nature. We instead see lots of genetic variation, including vast amounts maintained over vast expanses of time. I grant that this is a fairly rough and crude test of the idea, but not an unreasonable one. Note that this also points up the potentially serious problem caused by using relative, instead of absolute, fitness, but I won’t get into that now.

Extinction of course happens in nature all the time, but what we observe in nature is the result of successful selection–populations and species that survived. We know, without question, that environments vary–wildly, any and all aspects thereof, at all scales, often. And we also know without question that selection certainly can and does filter out the most fit genotypes in those environments. Those processes are all operating but we don’t observe a world in which alleles are either eliminated or fixed. The above examples cannot be accurate mathematical descriptions of a surviving species’ variation in fitness over time–something’s wrong.

The “something wrong” is the designation of normally distributed variation, or more exactly, symmetrically distributed variation. To keep a geometric mean from departing from it’s no-variance value, one must skew the distribution around the mean value, such that values above it (x) are inverses (1/x) (mean/x) of those below it–that is the only way to create a stable geometric mean while varying the individual values. [EDIT: more accurately, the mean must equal the product of the values below the mean, multiplied by the mean divided by the product of the values above the mean, but the values will be skewed in any case.] Mathematically, the way to do so is to work with the logarithms of the original values–the log of the geometric mean is designated as the mean of normally distributed logarithms of the individual values, of whatever size variance one wants. Exponentiation of the sum of the logarithms will equal the product of the fitness series.

Hopefully, what I’m driving at is emerging. If the variance structure must obey this mathematical necessity to preserve a genotype’s mean fitness at 1.00, while still allowing the individual series values to vary…then why should we not expect the same to hold true when the mean geometric fitness is not equal to 1.00? I would argue that that’s exactly what we should expect, and that Gillespie’s original arguments–and Orr’s, and others’ summaries thereof–are not particularly defensible theoretical expectations of what is likely to be happening in nature. Specifically, the idea that the variance in fitness around an arithmetic mean should necessarily arise from symmetrically (normally) distributed values, is questionable.

As alluded to above, there is (at least) a second theoretical argument as well, but I don’t have time to get into it now (nor for this one for that matter). Suffice it to say that it involves simultaneous temporal changes in total population size and selective environments. All this without even broaching the entire hornet’s nest of empirically testing the idea, a topic reviewed five years ago by Simons. For starters, it’s not clear to me just how conservative “bet hedging” could ever be distinguished from the effects of phenotypic plasticity.

Last week at the blog Dynamic Ecologyit was argued that natural selection behaves like a “risk-averse” money investor. That is, assuming that fitness varies over time (due to e.g. changing environmental variables or other selective factors), natural selection favors situations in which the mean fitness is maximized while the variance is minimized. The idea is explained in this short paper by Orr (2007), whose goal was to explain previous findings (Gillespie, 1973) intuitively. This presumes that knowledge of investor behavior is commonplace, but for my money, an examination of the math details and assumptions is what’s really needed.

Late last week a useful memo came down from the powers that be here at The Institute that I thought might prove informative regarding the inner workings of a powerful think tank, which The Institute most certainly is, in spades.

We wish, as always, to express our appreciation for the excellent, ongoing work that continues to move The Institute steadily forward, at roughly the cutting edge of science, or at least at the cutting edge of rough science. Accordingly, we take this opportunity to remind everyone of the basic tenets that have guided our various predictive activities in the past:

(1) Future events and event trajectories, notwithstanding our best efforts, continue to display an aggravating uncertainty, and it is remarkable just how easily this fact avoids taking up residence in our conscious minds.

(2) The future occupies a fairly large, and apparently non-diminishing, portion of the temporal spectrum.

(3) Given the above, it is incumbent upon us all to keep in mind the following:(a) Phrasing article titles with undue certainty, given the actual knowledge of system behavior, while understandable from a science culture perspective, may be counter-productive in a larger context. Fortunately, many non-scientists tend to seize upon such titles and, lacking proper restraint, make them even worse, often proclaiming future event x to be a virtual certainty. Without the ability to re-direct attention to these exaggerations, often originating from the press and various activist groups, undue attention to our own excesses, for which we have no readily available excuse, could become noticeably more uncomfortable. This possibility is not in the best interest of either science or The Institute.

(b) Science doesn’t actually “prove” anything, proof being a rather archaic and overly harsh concept–a “bar too high” if you like. Rather, science is in the business of “suggesting” that certain things “may” happen somewhere “down the road”. Science, when you boil it right down to nails, is really nothing but a massive pile of suggestions of what might happen. The pile is the thing really and our goal is to contribute to it. Popper is entitled to his opinion but frankly, The Institute is not so arrogant as to assume the right of making judgments on this, that or the other members of said scientific pile.

(c) It is hoped that the relation of points (a) and (b) above do not require elaboration.

Sincerely,
The PTB

This is an excellent reminder and I have, personally, tacked this memo to the wall in front of my workstation, with intent to glance at it every now and then before tacking something else over top of it.

I just found out that the second annual Peer Review Week is well underway. There are several online articles on the topic, perhaps best found via Twitter searches using #RecognizeReview or #PeerRevWk16, or via links at the link above.

This year’s theme thereof is Recognition For Review, and in that context it’s perfect timing, relative to a peer review post that I already had in mind. I don’t think there’s any question that the peer review process as a whole has very major problems, ones which greatly weaken the clarity, efficiency and reliability of the scientific process. These problems originate largely in the design of the review process, which in turn affect review execution. However, this reality doesn’t preclude the fact that thousands of people perform excellent review work, daily. And they’re not getting much credit for it either.

Some attention then, to one of the most interesting, important–and puzzling–reviews I’ve ever seen. Occasionally a paper comes out which is worth paying intense attention to, for reasons that go beyond just its technical content, and this is surely one in my opinion. The review and paper in question are publicly available at Atmospheric Chemistry and Physics (ACP). This was a long, involved review on a long, involved paper. If you have limited time to devote to this, go read Peter Thorne’s ICARUS article, a summary of his overall review experience.

The journal is one of a set of European Geophysical Union (EGU) journals that have gone to a completely open review process. The commenting process is online and open to anyone, although two or more official reviewers are also designated by the editor, who (unlike volunteer reviewers) may remain anonymous if they choose. For this open process alone the EGU deserves major recognition and gratitude, as it is arguably the single biggest step that can be taken to improve the peer review process. Everything has to be open.

There is a lot to say on this and I’ll start with the puzzling aspect of it. The article in question’s lead author is James Hansen, arguably still the most famous climate scientist in the world. Several of the reviews show that the article’s main claims are quite contentious, relative to the evidence and analysis presented, as summarized most completely by Thorne’s two reviews, the second of which–a phenomenal piece of review work–also summarizes Hansen et al’s responses (and non-responses) to the numerous reviewer comments, a job which presumably should really have fallen to the editor.

I’ve not yet worked all the way through everything, but you can’t read it and not wonder about some things. The authors didn’t have to submit their paper to an open review journal. So why did they? Did they assume the claims of the paper were largely non-contentious and it would thus slide smoothly through review? But given the clearly important claims, why not then submit to a highly prominent journal like Science or Nature for maximum attention and effect? Maybe they did, had it rejected and this was the second or third submission–I don’t know.

A second issue, one of several that did not sit at all well with Thorne, was the fact that Hansen et al. notified members of the press before submission, some of whom Thorne points out then treated it as if it were in fact a new peer reviewed paper, which it surely was not. When confronted on this point, Hansen was completely unapologetic, saying he would do the same thing again if given the chance, and giving as his reason the great importance of the findings to the world at large, future generations in particular. What? That response pretty well answers the question regarding his confidence in the main conclusions of the paper, and is disturbing in more than one way.

Thorne was also not at all pleased with Hansen’s flippant and/or non-responses to some of the review comments, and for this he took him severely to task for his general attitude, especially given the major weaknesses of the paper. The most important of the latter was the fact that there was no actual, model connection between the proposed processes driving rapid ice sheet melt, and the amount of fresh water flowing into the oceans to drive the rapid sea level rise that is the main claim of the paper. Rather, that flow was prescribed independently of the ice melt processes in what amounted to a set of “what if” scenarios more or less independent of the model’s ice melt dynamics. More importantly, this highly important fact was not clear and prominent: it had to be dug out by careful reading, and moreover, Hansen essentially denied that this was in fact the case.

There are major lessons here regarding conduct of peer review, how scientists should behave (senior scientists in particular), and scientific methodology. Unfortunately, I have no more time to give this right now–and I would give it a LOT more if I did. This is thus largely a “make aware” post. The paper and its review comprise a case study in many respects, and requires a significant commitment. I personally have not seen a more important paper review in a very long time, if ever. Peter Thorne, some of the other volunteer reviewers, and ACP, deserve recognition for this work.

Those who cultivate the sciences among a democratic people are always afraid of losing their way in visionary speculation. They mistrust systems; they adhere closely to facts and the study of facts with their own senses. As they do not easily defer to the mere name of any fellow-man, they are never inclined to rest upon any man’s authority; but, on the contrary, they are unremitting in their efforts to point out the weaker points of their neighbor’s opinions. Scientific precedents have very little weight with them; they are never long detained by the subtlety of the schools, nor ready to accept big words for sterling coin; they penetrate, as far as they can, into the principal parts of the subject which engages them, and they expound them in the vernacular tongue. Scientific pursuits then follow a freer and a safer course, but a less lofty one.

The mind may, as it appears to me, divide science into three parts. The first comprises the most theoretical principles, and those more abstract notions, whose application is either unknown or very remote. The second is composed of those general truths, which still belong to pure theory, but lead nevertheless by a straight and short road to practical results. Methods of application and means of execution make up the third. Each of these different portions of science may be separately cultivated, although reason and experience show that none of them can prosper long, if it be absolutely cut off from the two others.

In America the purely practical part of science is admirably understood, and careful attention is paid to the theoretical portion which is immediately requisite to application. On this head the Americans always display a clear, free, original, and inventive power of mind. But hardly any one in the United States devotes himself to the essentially theoretical and abstract portion of human knowledge. In this respect the Americans carry to excess a tendency which is, I think, discernible, though in a less degree, among all democratic nations.

Discussing science on the internet can be interesting at times, even on Twitter, which seems to have been designed specifically to foster misunderstanding by way of brevity. Here are two examples from my week.

Early in the week, Brian Brettschneider, a climatologist in Alaska, put up a global map of monthly precipitation variability:
Brian said the metric graphed constitutes the percentiles of a chi-square goodness-of-fit test comparing average monthly precipitation (P) against uniform monthly P. I then made the point that he might consider using the Poisson distribution of monthly P as the reference departure point instead, as this was the more correct expectation of the “no variation” situation. Brian responded that there was no knowledge, or expectation, regarding the dispersion of data, upon which to base such a decision. That response made me think a bit, and I then realized that I was thinking of the issue in terms of variation in whatever driving processes lead to precipitation measured at monthly scales, whereas Brian was thinking strictly in terms of the observations themselves–the data as they are, without assumptions. So, my suggestion was only “correct” if one is thinking about the issue the way I was. Then, yes, the Poisson distribution around the overall monthly mean, will describe the expected variation of a homogeneous, random process, sampled monthly. But Brian was right in that there is no necessary reason to assume, apriori, that this is in fact the process that generated the data in various locations.

The second interchange was more significant, and worrisome. Green Party candidate for President, physician Jill Stein, stated “12.3M Americans could lose their homes due to a sea level rise of 9ft by 2050. 100% renewable energy by 2030 isn’t a choice, it’s a must.” This was followed by criticisms, but not just by the expected group but also by some scientists and activists who are concerned about climate change. One of them, an academic paleoecologist, Jacquelyn Gill, stated “I’m a climate scientist and this exceeds even extreme estimates“, and later “This is NOT correct by even the most extreme estimates“. She later added some ad-hominem barbs such as “That wasn’t a scientist speaking, it was a lawyer” and “The point of Stein’s tweet was to court green voters with a cherry-picked figure“. And some other things that aren’t worth repeating really.

OK so what’s the problem here? Shouldn’t we be criticizing exaggerations of science claims when they appear in the mass culture? Sure, fine, to the extent that you are aware of them and have the time and expertise to do so. But that ain’t not really the point here, which is instead something different and more problematic IMO. Bit of a worm can in fact.

Steve Bloom has been following the climate change debate for (at least) several years, and works as hard to keep up on the science as any non-scientist I’ve seen. He saw Gill’s tweets and responded, that no, Stein’s statement did not really go so far beyond the extreme scientific estimates. He did not reference some poor or obsolete study by unknown authors from 25 years ago, but rather a long, wide ranging study by James Hansen and others, only a few months old, one that went through an impressive and unique open review process (Peter Thorne was one of the reviewers, and critical of several major aspects of the paper, final review here, and summary of overall review experience here). Their work does indeed place such a high rate of rise within the realm of defensible consideration, depending on glacier and ice sheet dynamics in Greenland and Antarctica, for which they incorporate into their modeling some recent findings on the issue. So, Jill Stein is not so off-the-wall in her comments after all, though she may have exaggerated slightly, and I don’t know where she got the “12.3M homes” figure.

The point is not that James Hansen is the infallible king of climate science, and therefore to be assumed correct. Hanson et al. might be right or they might be wrong, I don’t know. [If they’re right we’re in big trouble]. I wasn’t aware of the study until Steve’s tweeted link, and without question it will take some serious time and work to work through the thing, even just to understand what they claim and how they got there, which is all I can expect to achieve. If I get to it at all that is.

One point is that some weird process has developed, where all of a sudden a number of scientists sort of gang up on some politician or whatever who supposedly said some outrageous thing or other. It’s not scientist A criticizing public person B this week and then scientist C criticizing public person D the next week–it’s a rather predictable group all ganging up on one source, at once. To say the least, this is suspicious behavior, especially given the magnitude of the problems I see within science itself. I do wonder how much of this is driven by climate change “skeptics” complaining about the lack of criticisms of extreme statements in the past.

To me, the bigger problem is that these criticisms are rarely aimed at scientists, but rather at various public persons. Those people are not immune to criticism, far from it. But in many cases, and clearly in this one, things being claimed originate from scientists themselves, in publications, interviews or speeches. For the most part, people don’t just fabricate claims, they derive them from science sources (or what they consider to be such), though they certainly may exaggerate them. If you don’t think the idea of such a rapid rise is tenable, fine…then take Hanson et al. to the cleaners, not Jill Stein. But, unless you are intimately familiar with the several issues involving sea level rise rates, especially ice melt, then you’ve got some very long and serious work ahead of you before you’re in any position to do so. This stuff is not easy or simple and the authors are no beginners or lightweights.

The second issue involves the whole topic of consensus, which is a very weird phenomenon among certain climate scientists (not all, by any means). As expected, when I noted that Stein was indeed basically referencing Hanson et al., I was hit with the basic argument (paraphrased) “well they’re outside of the consensus (and/or IPCC) position, so the point remains”. Okay, aside from the issues of just exactly how this sacred consensus is to be defined anyway… yeah, let’s say they are outside of it, so what? The “consensus position” now takes authority over evidence and reasoning, modeling and statistics, newly acquired data etc., that is, over the set of tools we have for deciding which, of a various set of claims, is most likely correct? Good luck advancing science with that approach, and especially in cases where questionable or outright wrong studies have formed at least part of the basis of your consensus. It’s remarkably similar to Bayesian philosophy–they’re going to force the results from prior studies to be admitted as evidence, like it or not, independent of any assessment of their relative worth. Scientific ghoulash.

And yes, such cases do indeed exist, even now–I work on a couple of them in ecology, and the whole endeavor of trying to clarify issues and correct bad work can be utterly maddening when you have to deal with that basic mindset.

So, without getting into the reasons, I’m reading through the entry in the International Encyclopedia of Statistical Science on “Statistical Fallacies: Misconceptions and Myths”, written by one “Shlomo Sawilowsky, Professor, Wayne State University, Detroit MI, USA”. Within the entry, 20 such fallacies are each briefly described.

Sawilowsky introduces the topic by stating:

Compilations and illustrations of statistical fallacies, misconceptions, and myths abound…The statistical faux pas is appealing, intuitive, logical, and persuasive, but demonstrably false. They are uniformly presented based on authority and supported based on assertion…these errors spontaneously regenerate every few years, propagating in peer reviewed journal articles…and dissident literature. Some of the most egregious and grievous are noted below.

Great, let’s get after it then.

He then gets into his list, which proceeds through a set of +/- standard types of issues, including misunderstanding of the Central Limit Theorem, Type I errors, p values, effect sizes and etc. Up comes item 14:

14. Chi-square
(a) We live in a Chi-square society due to political correctness that dictates equality of outcome instead of equality of opportunity. The test of independence version of this statistic is accepted sans voire dire by many legal systems as the single most important arbiter of truth, justice, and salvation. It has been asserted that any statistical difference between (often even nonrandomly selected) samples of ethnicity, gender, or other demographic as compared with (often even inaccurate, incomplete, and outdated) census data is primae faciea evidence of institutional racism, sexism, or other ism. A plaintiff allegation that is supportable by a significant Chi-square is often accepted by the court (judges and juries) praesumptio iuris et de iure. Similarly, the goodness of fit version of this statistic is also placed on an unwarranted pedestal.

Bingo Shlomo!!

Now this is exactly what I want from my encyclopedia entries: a strictly apolitical, logical description of the issue at hand. In fact, I hope to delve deep into other statistical writings of Dr. Sawilowsky to gain, hopefully, even better insights than this one.

Postscript: I’m not really bent out of shape on this, and would indeed read his works (especially this one: Sawilowsky, S. (2003) Deconstructing arguments from the case against hypothesis testing. J. Mod. Appl. Stat. Meth. 2(2):467-474). I can readily overlook ideologically driven examples like this to get at the substance I’m after, but I do wonder how a professional statistician worked that into an encyclopedia entry.

I note also that the supposed “screening fallacy” popular on certain blogs is not included in the list…and I’m not the least bit surprised.

General views of the Fashioned, be it matter aggregated into the farthest stars of heaven, be it the phenomena of earthly things at hand, are not merely more attractive and elevating than the special studies which embrace particular portions of natural science; they further recommend themselves peculiarly to those who have little leisure to bestow on occupations of the latter kind. The descriptive natural sciences are mostly adapted to particular circumstances: they are not equally attractive at every season of the year, in every country, or in every district we inhabit. The immediate inspection of natural objects, which they require, we must often forego, either for long years, or always in these northern latitudes; and if our attention be limited to a determinate class of objects, the most graphic accounts of the travelling naturalist afford us little pleasure if the particular matters, which have been the special subjects of our studies, chance to be passed over without notice.

As universal history, when it succeeds in exposing the true causal connection of events, solves many enigmas in the fate of nations, and explains the varying phases of their intellectual progress—-why it was now impeded, now accelerated—-so must a physical history of creation, happily conceived, and executed with a due knowledge of the state of discovery, remove a portion of the contradictions which the warring forces of nature present, at first sight, in their aggregate operations. General views raise our conceptions of the dignity and grandeur of nature; and have a peculiarly enlightening and composing influence on the spirit; for they strive simultaneously to adjust the contentions of the elements by the discovery of universal laws, laws that reign in the most delicate textures which meet us on earth, no less than in the Archipelagos of thickly clustered nebulae which we see in heaven, and even in the awful depths of space-—those wastes without a world.

General views accustom us to regard each organic form as a portion of a whole; to see in the plant and in the animal less the individual or dissevered kind, than the natural form, inseparably linked with the aggregate of organic forms. General views give an irresistible charm to the assurance we have from the late voyages of discovery undertaken towards either pole, and sent from the stations now fixed under almost every parallel of latitude, of the almost simultaneous occurrence of magnetic disturbances or storms, and which furnish us with a ready means of divining the connection in which the results of later observation stand to phenomena recorded as having occurred in bygone times; general views enlarge our spiritual existence, and bring us, even if we live in solitude and seclusion, into communion with the whole circle of life and activity — with the earth, with the universe.

This post constitutes a wrap-up and summary of this series of articles on clustering data.

The main point thereof is that one needs an objective method for obtaining evidence of meaningful groupings of values (clusters) in a given data set. This issue is most relevant to non-experimental science, in which one is trying to obtain evidence of whether the observed data are explainable by random processes alone, versus processes that lead to whatever structure may have been observed in the data.

But I’m still not happy with my description in part four of this series, regarding observed and expected data distributions, and what these imply for data clustering outputs. In going over Ben Bolker’s outstanding book for the zillionth time (I have literally worn the cover off of this book, parts of it freely available here), I find that he explains what I was trying to better, in his description of the negative binomial distribution relative to the concept of statistical over-dispersion, where he writes (p. 124):

…rather than counting the number of successes obtained in a fixed number of trials, as in a binomial distribution, the negative binomial counts the number of failures before a pre-determined number of successes occurs.

This failure-process parameterization is only occasionally useful in ecological modeling. Ecologists use the negative binomial because it is discrete, like the Poisson, but its variance can be larger than its mean (i.e. it can be over-dispersed). Thus, it’s a good phenomenological description of a patchy or clustered distribution with no intrinsic upper limit, that has more variance than the Poisson…The over-dispersion parameter measures the amount of clustering, or aggregation, or heterogeneity in the data…

Specifically, you can get a negative binomial distribution as the result of a Poisson sampling process where the rate lambda itself varies. If lambda is Gamma-distributed (p.131) with shape parameter k and mean u, and x is Poisson-distributed with mean lambda, then the distribution of x will be a negative binomial distribution with mean u and over-dispersion parameter k (May, 1978; Hilborn and Mangel, 1997). In this case, the negative binomial reflects unmeasured (“random”) variability in the population.

The relevance of this quote is that a distribution that is over-dispersed, that is, one that has longer right or left (or both) tails than expected from a Poisson distribution having a given mean, is evidence for a non-constant process structuring the data. The negative binomial distribution describes this non-constancy, in the form of an “over-dispersion parameter” (k). In that case, the process that is varying is doing so smoothly (as defined by a gamma distribution), and the resulting distribution of observations will therefore also be smooth. In a simpler situation, one where there are say, just two driving process states, a bi-modal distribution of observations will result.

Slapping a clustering algorithm on the latter will return two clusters whose distinction is truly meaningful–the two sets of values were likely generated by two different generating parameters. A clustering applied to a negative binomial distribution will be arbitrary with respect to just which values get placed in which cluster, and even to the final number of clusters delineated, but not with respect to the idea that the observations do not result from a single homogeneous process, which is a potentially important piece of information. Observation of the data, followed by some maximum likelihood curve fits of negative binomial distributions, would then inform one that the driving process parameters varied smoothly, rather than discretely/bimodally.

It’s not always easy to hit all the important points in explaining an unfamiliar topic and I need to step back and mention a couple of important but omitted points.

The first of these is that we can estimate the expected distribution of individual values, from a known mean, assuming a random distribution of values, which, since the mean must be obtained from a set of individual values, means we can compare the expected and observed values, and thus evaluate randomness. The statistical distributions designed for this task are the Poisson and the gamma, for integer- and real-valued data respectively. Much of common statistical analysis is built around the normal distribution, and people are thus generally most familiar with it and prone to use it, but the normal won’t do the job here. This is primarily because it’s not designed to handle skewed distributions, which are a problem whenever data values are small or otherwise limited at one end of the distribution (most often by the value of zero).

Conversely, the Poisson and gamma have no problem with such situations: they are built for the task. This fact is interesting, given that both are defined by just one parameter (the overall mean) instead of two, as is the case for the normal (mean and standard deviation). So, they are simpler, and yet are more accurate over more situations than is the normal–not an everyday occurrence in modeling. Instead, for whatever reason, there’s historically been a lot of effort devoted to transforming skewed distributions into roughly normal ones, usually by taking logarithms or roots, as in e.g. the log-normal distribution. But this is ad-hoc methodology that brings with it other problems, including back transformation.

The second point is hopefully more obvious. This is that although it is easy to just look at a small set of univariate data and see evidence of structure (clustered or overly regular values), large sample sizes and/or multivariate data quickly overwhelm the brain’s ability to do this well, and at any rate we want to assign a probability to this non-randomness.

The third point is maybe the most important one, and relates to why the Poisson and gamma (and others, e.g. the binomial, negative binomial etc.) are very important in analyzing non-experimental data in particular. Indeed, this point relates to the issue of forward versus inverse modeling, and to issues in legitimacy of data mining approaches. I don’t know that it can be emphasized enough how radically different the experimental and non-experimental sciences are, in terms of method and approach and consequent confidence of inference. This is no small issue, constantly overlooked IMO.

If I’ve got an observed data set, originating from some imperfectly known set of processes operating over time and space, I’ve got immediate trouble on my hands in terms of causal inference. Needless to say there are many such data sets in the world. When the system is known to be complex, such that elucidating the mechanistic processes at the temporal and spatial scales of interest is likely to be difficult, it makes perfect sense to examine whether certain types of structures might exist just in the observed data themselves, structures that can provide clues as to just what is going on. The standard knock on data mining and inverse modeling approaches more generally is that of the possibility of false positive results–concluding that apparent structures in the data are explainable by some driving mechanism when in fact they are due to random processes. This is of course a real possibility, but I find this objection to be more or less completely overblown, primarily because those who conduct this type of analysis are usually quite well aware of this possibility thank you.

Overlooked in those criticisms is the fact that by first identifying real structure in the data–patterns explainable by random processes at only a very low probability–one can immediately gain important clues as to just what possible causal factors to examine more closely instead of going on a random fishing expedition. A lot of examples can be given here, but I’m thinking ecologically and in ecology there are many variables that vary in a highly discontinuous way, and this affects the way we have to consider things. This concept applies not only to biotic processes, which are inherently structured by the various aggregational processes inherent in populations and communities of organisms, but to various biophysical thresholds and inflection points as well, whose operation over large scales of space of time are often anything but well understood or documented. As just one rough but informative example, in plant ecology a large fraction of what is going on occurs underground, where all kinds of important discontinuities can occur–chemical, hydrologic, climatic, and of course biological.

So, the search for non-random patterns within observed data sets–before ever even considering the possible drivers of those patterns–is, depending on the level of apriori knowledge of the system in question, a potentially very important activity. In fact, I would argue that this is the most natural and efficient way to proceed in running down cause and effect in complex systems. And it is also one requiring a scientist to have a definite awareness of the various possible drivers of observed patterns and their scales of variation.

So, there’s a reason plant ecologists should know some physiology, some reproductive biology, some taxonomy, some soil science, some climatology, some…

In ecology and other sciences, grouping similar objects together for further analytical purposes, or just as an end in itself, is a fundamental task, one accomplished by cluster analysis, one of the most fundamental tools in statistics. In all but the smallest sample sizes, the number of possible groupings very rapidly gets enormous, and it is necessary therefore to both (1) have some way of efficiently avoiding the vast number of clearly non-optimal clusters, and (2) choosing the best solution from among those that seem at least reasonable.

First some background. There are (at least) three basic approaches to clustering. Two of these are inherently hierarchical in nature: they either aggregate individual objects into ever-larger groups (agglomerative methods), or successively divide the entire set into ever smaller ones (divisive methods). Hierarchical methods are based on a distance matrix that defines the distance (in measurement space) between every possible pair of objects, as determined by the variables of interest (typically multivariate) and the choice of distance measure, of which there are several depending on one’s definitions of “distance”. This distance matrix increases in size as a function of (n-1)(n/2), or roughly a squared function of n, and so for large datasets these methods quickly become untenable, unless one has an enormous amount of computer memory available, which typically the average scientist does not.

The k-means clustering algorithm works differently–it doesn’t use a distance matrix. Instead it chooses a number of random cluster starting points (“centers”) and then measures the distance to all objects from those points, and agglomerates stepwise according to which objects are closest to which centers. This greatly reduces the memory requirement for large data sets, but a drawback is that the output depends on the initial choice of centers; one should thus try many different starting combinations, and even then, the best solution is not guaranteed. Furthermore, one sets the number of final clusters desired beforehand, but there is no guarantee that the optimal overall solution will in fact correspond to that choice, and so one has to repeat the process for all possible cluster numbers that one deems reasonable, with “reasonable” often being less than obvious.

When I first did a k-means cluster analysis, years ago, I did it in SPSS and I remember being surprised that the output did not include a probability value, that is, the likelihood of obtaining a given clustering by chance alone. There was thus no way to determine which among the many possible solutions was in fact the best one, which seemed to be a pretty major shortcoming, possibly inexcusable. Now I’m working in R, and I find…the same thing. In R, the two workhorse clustering algorithms, both in the main stats package are kmeans and hclust, corresponding to k-means and hierarchical clustering, respectively. In neither method is the probability of the solution given as part of the output. So, it wasn’t just SPSS–if R doesn’t provide it, then it’s quite possible that no statistical software program (SAS, S-Plus, SigmaStat, etc.) does so, although I don’t know for sure.

There is one function in R that attempts to identify what it calls the “optimal clustering”, function optCluster in the package of the same name. But that function, while definitely useful, only appears to provide a set of different metrics by which to evaluate the effectiveness of any given clustering solution, as obtained from 16 possible clustering methods, but with no actual probabilities attached to any of them. What I’m after is different, more defensible and definitely more probabilistic. It requires some careful thought regarding just what clustering should be all about in the first place.

If we talk about grouping objects together, we gotta be careful. This piece at Variance Explained gives the basic story of why, using examples from a k-means clustering. A principal point is that one can create clusters from any data set, but the result doesn’t necessarily mean anything. And I’m not just referring to the issue of relating the variable being clustered to other variables of interest in the system under study. I’m talking about inherent structure in the data, even univariate data.

This point is easy to grasp with a simple example. If I have the set of 10 numbers from 0 to 9, a k-means clustering into two groups will place 0 to 4 in one group and 5 to 9 in the other, as will most hierarchical clustering trees trimmed to two groups. Even if some clustering methods were to sometimes place say, 0 to 3 in one group and 4 to 9 in the other, or similar outcome (which they conceivably might–I haven’t tested them), the main point remains: there are no “natural” groupings in those ten numbers–they are as evenly spaced as is possible to be, a perfect gradient. No matter how you group them, the number of groups and the membership of each will be an arbitrary and trivial result. If, on the other hand, you’ve got the set {0,1,2,7,8,9} it’s quite clear that 0-2 and 7-9 define two natural groupings, since the members of each group are all within 1 unit of the means thereof, and with an obvious gap between the two.

This point is critical, as it indicates that we should seek a clustering evaluation method that is based in an algorithm capable of making this discrimination between a perfect gradient and tightly clustered data. Actually it has to do better than that–it has to be able to distinguish between perfectly spaced data, randomly spaced data, and clustered data. Randomly spaced data will have a natural degree of clustering by definition, and we need to be able to distinguish that situation from truly clustered data, which might not be so easy in practice.

There are perhaps several ways to go about doing this, but the one that is most directly obvious and relevant is based on the Poisson distribution. The Poisson defines the expected values in a set of sub-samples, given a known value determined from the entire object collection, for the variable of interest. Thus, from the mean value over all objects (no clustering), we can determine the probability that the mean values for each of the n groups resulting from a given clustering algorithm (of any method), follow the expectation defined by the Poisson distribution determined by that overall mean (the Poisson being defined by just one parameter). The lower that probability is, the more likely that the clusters returned by the algorithm do in fact represent a real feature of the data set, a natural aggregation, and not just an arbitrary partitioning of random or gradient data. Now maybe somebody’s already done this, I don’t know, but I’ve not seen it in any of the statistical software I’ve used, including R’s two workhorse packages stats and cluster.

This is a long post. It analyzes a paper that recently appeared in Nature. It’s not highly technical but does get into some important analytical subtleties. I often don’t know where to start (or stop) with the critiques of science papers, or what good it will do anyway. But nobody ever really knows what good any given action will do, so here goes. The study topic involves climate change, but climate change is not the focus of either the study or this post. The issues are, rather, mainly ecological and statistical, set in a climate change situation. The study illustrates some serious, and diverse problems.

Before I get to it, a few points:

The job of scientists, and science publishers, is to advance knowledge in a field

The highest profile journals cover the widest range of topics. This gives them the largest and most varied readerships, and accordingly, the greatest responsibilities for getting things right, and for publishing things of the highest importance

I criticize things because of the enormous deficit of critical commentary from scientists on published material, and the failures of peer review. The degree to which the scientific enterprise as a whole just ignores this issue is a very serious indictment upon it

I do it here because I’ve already been down the road–twice in two high profile journals–of doing it through journals’ established procedures (i.e. the peer-reviewed “comment”); the investment of time and energy, given the returns, is just not worth it. I’m not wasting any more of my already limited time and energy playing by rules that don’t appear to me designed to actually resolve serious problems. Life, in the end, boils down to determining who you can and cannot trust and acting accordingly

For those without access to the paper, here are the basics. It’s a transplant study, in which perennial plants are transplanted into new environments to see how they’ll perform. Such studies have, at least, a 100 year history, dating to genetic studies by Bateson, the Carnegie Institute, and others. In this case, the authors focused on four forbs (broad leaved, non-woody plants), occurring in mid-elevation mountain meadows in the Swiss Alps. They wanted to explore the effects of new plant community compositions and T change, alone and together, on three fitness indicators: survival rate, biomass, and fraction flowering. They attempted to simulate having either (1) entire plant communities, or (2) just the four target species, experience sudden temperature (T) increases, by moving them downslope 600 meters. [Of course, a real T change in a montane environment would move responsive taxa up slope, not down.] More specifically, they wanted to know whether competition with new plant taxa–in a new community assemblage–would make any observed effects of T increases worse, relative to those experienced under competition with species they currently co-occur with.

Their Figure 1 illustrates the strategy:

Figure 1: Scenarios for the competition experienced by a focal alpine plant following climate warming.If the focal plant species (green) fails to migrate, it competes either with its current community (yellow) that also fails to migrate (scenario 1) or, at the other extreme, with a novel community (orange) that has migrated upwards from lower elevation (scenario 2). If the focal species migrates upwards to track climate, it competes either with its current community that has also migrated (scenario 3) or, at the other extreme, with a novel community (blue) that has persisted (scenario 4).

There are many things I don’t understand regarding how various folks do things, and here’s a good example. It falls, to my mind, within what Brian McGill at the Dynamic Ecology blog last year called “statistical machismo”: the use of statistical methods that are more complicated than necessary in order to appear advanced or sophisticated, when these are either unnecessary (but trendy), or worse, clearly not the best choice for the problem at hand. “Cutting edge”, their practitioners like to think of them, but I’ve got no problem in calling them sophistry that stems from a lack of real statistical understanding, combined with a willingness to do whatever will maximize the chances of getting published, of which there rarely seems to be a shortage.

I’ve had, for some time, a growing suspicion that much of the use of Bayesian statistical methods in science falls pretty squarely in this category. That of course doesn’t mean I’m right, especially since I do not fully understand everything about modern Bayesian methods, but I get the basic ideas, and the following example is a good illustration of why I think that way. It relates to my recent cogitations on the design of a general, model-based partitioning (clustering) algorithm for a common type of categorical data: data in which each sample is represented by only a small fraction of the total number of categories. In such cases, clear associations between the various categories is far from obvious.

I started thinking about the issue in relation to the estimation of forest tree community types in some widespread and very important historical tree data sets, where each sample contains individuals from, at most, only two to four tree taxa (usually, species), when there may be upwards of 10 to 30 such taxa in the population over some large landscape area (earlier post on this topic here) However, the issue has by far its greatest application in the field of population genetics, specifically w.r.t. the goal of identifying cryptic population structure–that is identifiable groups of individuals who are breeding primarily or entirely among themselves (“demes”), leading to allele and genotype frequencies that vary characteristically from deme to deme, demes which are not otherwise readily identifiable by external phenotypic characters. These are groups involved in the first step on the road to “incipient species”, to use Darwin’s phrase. The similarity with the tree data is that at each gene locus for any given diploid individual–which represents our sample–you have only two alleles, even though many such may occur in some larger, defined population.

In 2000, Pritchard et al. published what would have to be considered a landmark study, given that it’s been cited nearly 14,000 times since. This comes out to about 2.5 citations per day; I wouldn’t have guessed that so many popgen papers were even published at that kind of rate. The paper introduces a method and program (“STRUCTURE”) for the above-stated task, one based on Bayesian techniques, using Markov Chain Monte Carlo (MCMC), which is an iterative method for estimating parameters of the posterior distribution when no analytical techniques, or approximations thereof, are available. Furthermore, the paper has spawned several spin-off papers introducing various modifications, but all based on the same basic Bayesian/MCMC approach. And each of those has gotten hundreds to thousands of citations as well.

I freely admit that I am an unabashed maximum likelihood (ML) type of thinker when it comes to statistical inference and model selection. I’m far from convinced that Bayesianism offers any clear, definitive advantage over ML methods, while appearing to be saddled with some complex, time-consuming and uncertain estimation techniques (like MCMC), which is most definitely a disadvantage. To my view, Bayesianism as a whole might well fall within Brian’s machismo category, at least as employed in current practice, if not in its fundamental tenets. I very much doubt that many people who use it do so for any reason other than that a lot of others are using it, and so they just go with the flow, thinking all is kosher. Scientists do that a lot, at least until they develop their own understanding of the issues.

As I was thinking through the problem, it seemed to me pretty clear that, although a strict analytical solution was indeed not possible, one based on a ML approach, as heavily guided by expectations from binomial/multinomial probability and divisive clustering, was the way to go. Indeed, I can’t see any other logical and algorithmically efficient way to go about solving this type of problem. The underlying goal and assumptions remain the same as Pritchard et al’s, namely to find groups that approximate Hardy-Weinberg equilibrium, and which therefore represent approximately randomly mating groups. And there is also still a “Monte Carlo” procedure involved, but it’s quite different: always guided by a definite strategy, and much less intense and random than in the Bayesian/MCMC approach. As far as I can tell, nobody’s taken this approach (although I just found an Iowa State student’s dissertation from last year that might), and I don’t know why. I thought it was recognized that defaulting to a uniform (i.e. uninformative) prior probability distribution–because you really have no idea otherwise, or worse, when the idea of some “prior distribution” doesn’t even make sense to begin with–and you have quite a few parameters to estimate, that MCMC algorithms can be very slow to converge (if at all), and to do so to potentially unstable estimates at that. But that’s exactly what the authors did, and there are other limitations of the approach also, such as having to constrain the total number of possible demes to begin with–presumably because the algorithm would choke on the number of possible solutions otherwise.

These are the kinds of things I run into far more often than is desirable, and which generate a certain mix of confusion, frustration and depression. If I keep working on this topic–which I find fascinating and which, being statistical, generalizes to different fields, but which I really don’t have time for–I’ll post more details about the two approaches. The fan mail has been clamoring for posts on this topic. Or ask questions if you’re entirely desperate for something to do while waiting for the Final Four to go at it.

Continuing from part one, this post looks at a specific method for estimating TCS (transient climate sensitivity), for any desired year and/or radiative forcing scenario, as predicted by any AOGCM climate model. And some associated topics.

The basic idea was devised by Good et. al (2010, 2011; links at end), and expanded upon by Caldeira and Myhrvold (2013), who fit various equations to the data. The basic idea is fairly simple, but clever, and integrates some nice mathematical solutions/approximations, including Gregory’s linear regression ECS estimation method. The basic idea is simply that if you have an idealized RF pulse or “step” increase (i.e. sudden, one-time increase, as with the instant 4X CO2 (= ~7.4 W/M^2) increase experiment in CMIP5), and run any given AOGCM for say 150-300 years from that point, you can record the temperature course resulting from the pulse, over that time (which will rise toward an asymptote determined by the climate sensitivity). That asymptote will be twice the ECS value (because the CO2 pulse is to 4X, not 2X, CO2). From these data one can fit various curves describing the T trend as a function of time. One then simply linearly scales that response curve to any more realistic RF increase of interest, corresponding to a 1.0% or 0.5% CO2 increase, or whatever. Lastly, if each year’s RF increase is considered as one small pulse, an overlay and summation of the temperature responses from all such, at each year, gives each year’s estimated temperature response, for however long the RF is increasing. The RF increase does not have to stop at any point, although it can. It can also increase or decrease at any rate over time.

The figure below from the paper, illustrate the method and the comparison (Fig. 1 of paper, original caption):

Fig. 1 Illustrating the method. a Global mean temperature evolution in a 4xCO2 step experiment (from the HadCM3 GCM; CMIP5 GCMs give qualitatively similar results). b Reconstruction method for years1–5 of a 1pctCO2 experiment. Red, yellow, green, blue and purple curves temperature responses estimated for the forcing changes in years 1, 2, 3, 4 and 5 respectively. Each coloured curve is identical (for the case of the 1pctCO2 scenario) and is given by scaling the step experiment temperature response. Black curve reconstructed temperature response, given by the sum of the coloured curves (Eq. 1a).

Good et al (2011), did this for nine AOGCMs, testing the method against the results of the CMIP5 1% per year CO2 increase experiment. This is interesting; they are testing whether the basic functional response to an instant, 400% CO2 increase, is similar to that from a 1% per year increase over 140 years. And lo and behold, the overall agreement was very high, both for the collection of models, and individually, for both surface T and heat content. Their Fig. 2 is shown below:

To me, this result is rather astounding, as it says that the time decay of the temperature response to a pulsed RF increase, is highly similar, no matter the magnitude of that increase. That is absolutely not a result I would have expected, given that the thermodynamic interaction between the ocean and the atmosphere is highly important and seemingly not likely to be in phase. Of course, this result does not prove this dynamic to be a reality–only that the AOGCM models tested consider, via their encoded physics, that the two responses to be highly similar in form, just differing in magnitude.

Caldeira and Myhrvold (2013) then extended this approach by fitting four different equation forms and evaluating best fits, via Akaike AIC and RMSE criteria. To do this they first used the Gregory ECS estimation method (ref at end) to define the temperature asymptote reached. They don’t give the details of their parameter estimation procedure, which must be some type of nonlinear optimization (and hence open to possible non-ML solutions), since the equation forms they tested were three (inverted) negative exponential forms and one other non-linear form (based on heat diffusion rates in the ocean). They also don’t provide any R^2 data indicating variance accounted for, but their figures (below) demonstrate that for all but one of their model forms (a one-parameter, inverted negative exponential) the fits are extremely good (and extremely similar) across most of the AOGCMs used in CMIP5:

Figure 2. Temperature results for CMIP5 models that have performed the abrupt4xCO2 simulations (black dots). Also shown are fits to this data using the functions described in the text: θ1-exp, green; θ2-exp, blue; θ3-exp, brown; θ1D, red. The left vertical axis shows the fraction of equilibrium temperature change (i.e., ΔT/ΔT4×); the right vertical axis indicates the absolute change in global mean temperature. Fit parameters are listed in SOM tables S3–S5 (available at stacks.iop.org/ERL/8/034039/mmedia).

Figure 5. Results from CMIP5 models (black dots) running simulations of the 1pctCO2 protocol. Projections made by simulations based on curve fits to the abrupt4xCO2 simulations as described in the text: θ1-exp, green; θ2-exp, blue; θ3-exp, brown; θ1D, red. All but θ1-exp provide similar approximations to the temperature results for most of the fully coupled, three-dimensional climate model simulations. Note that the GFDL-ESM2G and GFDL-ESM2M models did not continue with increasing atmospheric CO2 content after reaching twice the preindustrial concentration.

So, both Good et al. (2011, 2012), and Caldeira et al. (2013) provide strong evidence that the physical processes involving surface temperature change, as encoded in AOGCMs, are likely very similar across extremely widely varying radiative forcing increases per unit time, from unrealistically huge, to (presumably) however small. Note that in both cases, a very large percentage (roughly, 40-60%) of the total temperature response (at equilibrium) occurs within the first decade (when normalized to the pulse magnitude). This seems to have implications for the importance of various feedbacks, an issue which is complicated by the fact that some of the models tested are Earth System Models, which include e.g. integrated carbon cycle feedbacks, while others do not. Certainly there will be major potential differences in carbon cycle feedbacks between an earth surface that has just increased 3 degrees C, instantly, versus one that has warmed only a tiny fraction of that amount.

TBC; the next post will demonstrate application to various delta RF scenarios.