On replication

This week has been dominated by questions of replication and of what standards are required to serve the interests of transparency and/or science (not necessarily the same thing). Possibly a recent example of replication would be helpful in showing up some of the real (as opposed to manufactured) issues that arise. The paper I’ll discuss is one of mine, but in keeping with our usual stricture against too much pro-domo writing, I won’t discuss the substance of the paper (though of course readers are welcome to read it themselves). Instead, I’ll focus on the two separate replication efforts I undertook in order to do the analysis. The paper in question is Schmidt (2009, IJoC), and it revisits two papers published in recent years purporting to show that economic activity is contaminating the surface temperature records – specifically de Laat and Maurellis (2006) and McKitrick and Michaels (2007).

Both of these papers were based on analyses of publicly available data – the EDGAR gridded CO2 emissions, UAH MSU-TLT (5.0) and HadCRUT2 in the first paper, UAH MSU-TLT, CRUTEM2v and an eclectic mix of economic indicators in the second. In the first paper (dLM06), no supplementary data were placed online, while the second (MM07) placed the specific data used in the analysis online along with an application-specific script for the calculations. In dLM06 a new method of analysis was presented (though a modification of their earlier work), while MM07 used standard multiple regression techniques. Between them these papers and their replication touch on almost all of the issues raised in recent posts and comments.

Data-as-used vs. pointers to online resources

MM07 posted their data-as-used, and since those data were drawn from dozens of different sources (GDP, Coal use, population etc. as well as temperature), trends calculated and then gridded, recreating this data from scratch would have been difficult to say the least. Thus I relied on their data collation in my own analysis. However, this means that the economic data and their processing were not independently replicated. Depending on what one is looking at this might or might not be an issue (and it wasn’t for me).

On the other hand, dLM06 provided no data-as-used, making do with pointers to the online servers for the three principle data sets they used. Unlike for MM07, the preprocessing of their data for their analysis was straightforward – the data were already gridded, and the only required step was regridding to a specific resolution (from 1ºx1º online to 5ºx5º in the analysis). However, since the data used were not archived, the text in the paper had to be relied upon to explain exactly what data were used. It turns out that the EDGAR emissions are disaggregated into multiple source types, and the language in the paper wasn’t explicit about precisely which source types were included. This was apparent when the total emissions I came up with differed with the number given in the paper. A quick email to the author resolved the issue since they hadn’t included aircraft, shipping or biomass sources in their total. This made sense, and did not affect the calculations materially.

Data updates

In all of the data used, there are ongoing updates to the raw data. For the temperature records, there are variations over time in the processing algorithms (satellites as well as surface stations), for emissions and economic data, updates in reporting or estimation, and in all cases the correction of errors is an ongoing process. Since my interest was in how robust the analyses were, I spent some time reprocessing the updated datasets. This involved downloading the EDGAR3 data, the latest UAH MSU numbers, the latest CRUTEM2/HadCRU2v numbers, and alternative versions of the same (such as the RSS MSU data, HadCRUT3v, GISTEMP). In many cases, these updates are in different formats, have different ‘masks’ and required specific and unique processing steps. Given the complexity of (and my unfamiliarity with) of economic data, I did not attempt to update that, or even ascertain whether updates had occurred.

In these two papers then, we have two of the main problems often alluded to. It is next-to-impossible to recreate exactly the calculation used in dLM07 since the data sets have changed in the meantime. However, since my scientific interest is in what their analysis says about the real world, any conclusion that was not robust to that level of minor adjustment would not have been interesting. By redoing their calculations with the current data, or with different analyses of analogous data, it is very easy to see that there is no such dependency, and thus reproducing their exact calculation becomes moot. In the MM07 case, it is very difficult for someone coming from the climate side to test the robustness of their analysis to updates in economic data and so that wasn’t done. Thus while we have the potential for an exact replication, we are no wiser about its robustness to possibly important factors. I however was able to easily test the robustness of their calculations to changes in the satellite data source (RSS vs. UAH) or to updates in the surface temperature products.

Processing

MM07 used an apparently widespread statistics program called STATA and archived a script for all of their calculations. While this might have been useful for someone familiar with this proprietary software, it is next to useless for someone who doesn’t have access to it. STATA scripts are extremely high level, implying they are easy to code and use, but since the underlying code in the routines is not visible or public, they provide no means by which to translate the exact steps taken into a different programming language or environment. However, the calculations mainly consisted of multiple linear regressions which is a standard technique, and so other packages are relatively easily available. I’m an old-school fortran programmer (I know, I know), and so I downloaded a fortran package that appeared to have the same functionality and adapted it to my needs. Someone using Matlab or R could have done something very similar. It was a simple matter to then check that the coefficients from my calculation and that in MM07 were practically the same and that there was a one-to-one match in the nominal significance (which was also calculated differently). This also provides a validation of the STATA routines (which I’m sure everyone was concerned about).

The processing in dLM06 was described plainly in their paper. The idea is to define area masks as a function of the emissions data and calculate the average trend – two methods were presented (averaging over the area then calculating the trend, or calculating the trends and averaging them over the area). With complete data these methods are equivalent, but not quite when there is missing data, though the uncertainties in the trend are more straightforward in the first case. It was pretty easy to code this up myself so I did. Turns out that the method used in dLM07 was not the one they said, but again, having coded both, it is easy to test whether that was important (it isn’t).

Replication

Given the data from various sources, my own codes for the processing steps, I did a few test cases to show that I was getting basically the same results in the same circumstances as was reported in the original papers. That worked out fine. Had their been any further issues at this point, I would have sent out a couple of emails, but this was not necessary. Jos de Laat had helpfully replied to two previous questions (concerning what was included in the emissions and the method used for the average trend), and I’m sure he or the other authors involved would have been happy to clarify anything else that might have come up.

Are we done? Not in the least.

Science

Much of the conversation concerning replication often appears to be based on the idea that a large fraction of scientific errors, or incorrect conclusions or problematic results are the result of errors in coding or analysis. The idealised implication being, that if we could just eliminate coding errors, then science would be much more error free. While there are undoubtedly individual cases where this has been the case (this protein folding code for instance), the vast majority of papers that turn out to be wrong, or non-robust are because of incorrect basic assumptions, overestimates of the power of a test, some wishful thinking, or a failure to take account of other important processes (It might be a good idea for someone to tally this in a quantitative way – any ideas for how that might be done?).

In the cases here, the issues that I thought worth exploring from a scientific point of view were not whether the arithmetic was correct, but whether the conclusions drawn from the analyses were. To test that I varied the data sources, the time periods used, the importance of spatial auto-correlation on the effective numbers of degree of freedom, and most importantly, I looked at how these methodologies stacked up in numerical laboratories (GCM model runs) where I knew the answer already. That was the bulk of the work and where all the science lies – the replication of the previous analyses was merely a means to an end. You can read the paper to see how that all worked out (actually even the abstract might be enough).

Bottom line

Despite minor errors in the printed description of what was done and no online code or data, my replication of the dLM07 analysis and it’s application to new situations was more thorough than I was able to do with MM07 despite their more complete online materials. Precisely because I recreated the essential tools myself, I was able to explore the sensitivity of the dLM07 results to all of the factors I thought important. While I did replicate the MM07 analysis, the fact that I was dependent on their initial economic data collation means that some potentially important sensitivities did not get explored. In neither case was replication trivial, though neither was it particularly arduous. In both cases there was enough information to scientifically replicate the results despite very different approaches to archiving. I consider that both sets of authors clearly met their responsibilities to the scientific community to have their work be reproducible.

However, the bigger point is that reproducibility of an analysis does not imply correctness of the conclusions. This is something that many scientists clearly appreciate, and probably lies at the bottom of the community’s slow uptake of online archiving standards since they mostly aren’t necessary for demonstrating scientific robustness (as in these cases for instance). In some sense, it is a good solution to a unimportant problem. For non-scientists, this point of view is not necessarily shared, and there is often an explicit link made between any flaw in a code or description however minor and the dismissal of a result. However, it is not until the “does it matter?” question has been fully answered that any conclusion is warranted. The unsatisfying part of many online replication attempts is that this question is rarely explored.

To conclude? Ease of replicability does not correlate to the quality of the scientific result.

295 Responses to “On replication”

A related, common confusion is the meaning of statistical significance in all fields, not just climatology. Often it is explained “there is only a five percent chance that this experiment’s results were due to chance.” Lay folk often take that to mean “there is only a five percent chance that this drug doesn’t work.” Nope.

The significance test probability of that experiment is relevant to an extraordinarily narrow conclusion about that particular experiment. You can repeat that particular experiment slavishly hundreds of times and thereby verify that probability, but the experiment’s design might well be fundamentally flawed in regard to the take-home message about the drug’s effectiveness.

There is no rock-solid, objective, quantitative way to compute the probability of the real, broader question of whether the drug is effective. That’s the role of “consensus” of expert scientists’ judgments.

To be clear, then, was there no need for dLM06 and MM07 to have attempted to provide transparency? Could you have made your paper’s essential points without access to archived data and/or code or, in one case, a quick e-mail to an author?

You have offered this new thread, presumably, in the context of the debate about how much transparency should be offered re. the Schmidt paper now that a few minor errors have surfaced. I take it that you think he has done enough at this point and that there is no further need to help Murelka, for example, work through his reconstruction.

[Response: Perhaps you could be clearer in your questions. What ‘Schmidt’ paper are you talking about (do you mean Steig)? and who is Murelka? As to what was necessary above, I made it clear (I think) that both sets of authors provided enough data and information to replicate their results. – gavin]

[Response: If you are talking about about Steig et al, please don’t repeat the dishonest talking point that “errors have surfaced”. Surely, if you’ve actually read what we’ve written on this, you are aware that there are no errors at all in the analysis presented in Steig et al, the problems are only with some restricted AWS stations which were used in a 2ndary consistency check, and not the central analysis upon which the conclusions are based (which uses AVHRR data, not AWS data). Those who continue to repeat this manufactured controversy will not find their comments published -mike]

So seriously as the result of this process, which you felt to be useful you don’t think the author’s of papers should provide as much as possible? In both cases you cited this either did save you time, or would have. As you point out the replication of the previous analysis was a necessary means to an end.

I agree completely that science is generally not determined by trivial errors. However it is a very fine line when you cross over from trivial to important. The tests that you did in the cases, changing time periods, looking at auto correlation etc. to you seem non trivial, but that is because you found them relevant. The original authors might not agree.

Outstanding post Gavin. The essence of what the scientific enterprise is all about. The crux: don’t get hung up on uncertainties or supposed system determinants that may in fact be unimportant. Understand sensitivities, don’t leave out the major players, and don’t ignore scale issues.

“…the vast majority of papers that turn out to be wrong, or non-robust are because of incorrect basic assumptions, overestimates of the power of a test, some wishful thinking, or a failure to take account of other important processes…”

I would add poor understanding of appropriate statistical methods, inattention to scale issues, and lack of existing, appropriate empirical data with which to test proposed models, the latter usually being beyond the control of the researcher.

Gavin, Thanks for this post. It makes some very important points that nonscientists often miss–namely that “replication” is not merely “copying”. Underlying the scientific method is the assumption that there is an objective reality that will point to the same conclusion for any valid method of inquiry. Thus, if a scientist understands the data sources and what constitutes a valid method, the new analysis should replicate the conclusions of the previous analysis. If it is truly independent, it will also validate the methodology of the previous study to some extent (i.e. it will show that any errors do not materially change the conclusions).
Another common mistake: The model is not the code. Rather the code is an expression of the model. So archiving the “code,” which is obsolete by the time the study is published, doesn’t really accomplish that much.
Finally, a mistake even many scientists make: If you want cooperation, learn to play nice with others. If someone is a jerk and making accusations of fraud, incompetence or conspiracy, why in the hell would anyone want to work with them?

Nicolas, Just because Rutherford said that physics should be explainable to a barmaid does not mean that a journal article must be a cookbook so that a barmaid can copy (not replicate) what was done. There is a reason that it takes ~10 years of graduate and undergraduate education in a specific field before one is usually considered capable of contributing to that field. There is also a reason why the scientific method–including publication–has evolved the way it has. It’s the most efficient way of elucidating understanding of the natural realm.

[Response: For those who have been following the Steig et al paper discussion, this is a different (not Scott!) Rutherford ;) – mike]

reproducibility of an analysis does not imply correctness of the conclusions.

It’s true that reproducibility doesn’t prove correctness, but it does rule out several possible failure modes and is therefore a necessary-but-not-sufficient criteria on which to judge the plausibility of a conclusion you previously had reason to doubt.

Suppose a new study has been published which claims to overturn the prior consensus view on some issue. Before this study came out, everybody believed “A”, but the author of Study X has done some nontrivial math and analysis which he claims supports conclusion “B”. How should we decide whether to update our beliefs?

One unfortunate possibility is that the author of Study X might be lying or might have made a simple mistake that affected the conclusion. If the code is available, we can quickly rule those possibilities out and move on analyzing the sources and methodology in greater detail. But until the code is available we can’t rule those out and the likely level of dialog is correspondingly impoverished.

If I’m a layman with no relevant expertise I might not be qualified to judge whether the code/data is correct, but even I can probably tell if it present. If it is present, simply knowing that other people who share my attitudes on the subject matter have had the opportunity to inspect it gives me warm fuzzy feelings about the possibility that the new study might be valid. And when – inevitably – a few bugs discovered, if the code is present I am more willing to accept assurances regarding the minimal impact of those bugs than I am when I essentially have to take it on faith.

To answer your questions in the other thread: yes, we do see studies receive this level of scrutiny in other fields in similar circumstances. Consider how we treat economic studies that claim to reach new and unlikely conclusions related to gun control or minimum wage laws.

2r1 I apologize for the typo (Schmidt should be Steig, yes) and Murelka is the person over at CA trying to reconstruct the Steig’s AWS backup analysis. The question I was trying to ask was [and yes, it is a mere layman’s question] are you attempting to defend the lack of transparency through the example of your paper’s refutation of the other two papers? In the context of the Steig paper controversy/tempest-in-a-teapot, the criticism has been that he has not done enough to make his work transparent, and there have been many arguments put forth here about why transparency is not necessary. I find it confusing, then, for you to have offered up three transparent papers to make an argument that transparency is not really required. I assume I am misinterpreting something, then, and hoping you can clear that misinterpretation up.

[Response: let’s get something straight. Scientific work needs to be replicable – I have never suggested otherwise, so please stop accusing me of being against transparency every time I point out it that it is more complex than you think. My two examples here showed that replicability doesn’t require what is being continually demanded every time some one doesn’t like the conclusions of a study. Looking at the Steig et al data page, it is clear that there is enough information to replicate the AWS reconstruction with only a little work, which presumably interested parties will put in. – gavin]

r2 As to the “errors that have surfaced”, I don’t know what else to call them, since the BAS is listing “corrections”, and I understand, because of your previous explanation on “Robust”, that they have only to do with the AWS backup, which is why I referred to them as “minor”. I don’t see what’s so dishonest about that, though I understand your sensitivity about the integrity of the study.

[Response: Errors in a secondary input data set are not the same as errors in the analysis. And they have already been incorporated and make little or no difference to the results. – gavin]

What kind of effect do fires, like the current one in Australia, have on climate. Do they cool, like Pinatubo, or do the soot and smoke trap more heat than they turn away? Since drought and attendant fires are expected to increase due to AGW, this seems like a reasonable issue to explore.

One unfortunate possibility is that the author of Study X might be lying or might have made a simple mistake that affected the conclusion. If the code is available, we can quickly rule those possibilities out and move on analyzing the sources and methodology in greater detail. But until the code is available we can’t rule those out and the likely level of dialog is correspondingly impoverished.

So the “skeptic” position is that climate scientists are lying or incompetent until one can prove that they aren’t.

This open admission does more to explain the antagonism between working climate scientists and the stone-throwing mob than anything I’ve seen posted thus far …

Dr. Schmidt’s work illustrates one of the major differences in ‘culture’ surrounding replication. In one culture the issue is whether your calculations and their description actually match each other. In the other culture, the goal is to reinvent the analysis guided by the description.

Among the former culture, which Dr. Schmidt has tentatively entered, people release their code and data along with their results and description of the procedure. Using materials provided by the authors, occasionally subtle but important errors are found later on by people looking at the code. That is, the code and the published description/implications do not jibe. Sometimes even when the code checks out they may show that some step in the procedure that is a judgment call has an important impact.

An analogy in symbolic math: a person states a theorem and provides a detailed proof. Another person confirms the proof but notices that one step includes “assume a solution to f(x)=0 exists. The replicator goes on to show that this is only true under more restrictive conditions than the other assumptions. They publish a follow up that shows that the initial results are less interesting than advertised. (As I understand Dr. Schmidt, the analogy in the other culture would be as follows: the full proof is not released, just key steps. Others would recreate the full proof from these guideposts. If they got stuck in this process they would expect no help from the original authors, especially if they were considered antagonistic.)

A real-life example in the first culture: Donohue and Levitt published an (in)famous paper concerning abortion and crime [Quarterly Journal of Economics 119(1) (2001), 249–275].
By releasing the Stata code they allowed other researchers to discover an error [Foote and Goetz Quarterly Journal of Economics February 2008, Vol. 123, No. 1: 407–423.]. It turned out that they had not actually reported state-level fixed effect results as the article claimed (due to an error in the code). The replicator went on to make a case that the results are much less robust than advertised Besides getting the chance to get rich popularizing their faulty result (i.e. Freakonomics) the original authors got a chance to reply. [Donohue and Levitt Quarterly Journal of Economics February 2008, Vol. 123, No. 1: 425–440.] Now people can make up their own minds.

Of course, if the code and the description check out then one usually would expect to get no publication out of it. In replicating the results of MM07 using materials provided publicly by the authors, Dr. Schmidt has achieved what would be expected of a senior project for applied econometrics undergraduate course. My congratulations to him. I would give him an A-.

[Response: You are too generous. I agree, replication projects are generally good for students to do, but without further work, the scientific value is small. But the point here is not to replicate this for the sake of it, but to discover whether the conclusions drawn from their analysis were valid by testing their procedure in test situations where one knows the answer already. I certainly wouldn’t have written a paper based purely on a replication without looking further into the science of what was being analysed. I gave this example to demonstrate not how clever I am (oooh! I can replicate!) but to highlight issues that people were discussing without nuance or practical examples. – gavin]

“One unfortunate possibility is that the author of Study X might be lying or might have made a simple mistake that affected the conclusion. If the code is available, we can quickly rule those possibilities out and move on analyzing the sources and methodology in greater detail. But until the code is available we can’t rule those out and the likely level of dialog is correspondingly impoverished.”

I’d have to disagree with that sequence of steps. If the author(s) did any sort of decent job in the Methods section, there should be a number of questions you could probe in your mind, alluded to above. The very last thing I would do is start wading into reading computer code, and I highly doubt there would be anything “quick” about it if I did. At any rate, the lion’s share of science is done with standard methods in +/- standard statistical packages, and this “code” consideration doesn’t even apply. You look for bigger-type study design issues long before you think about computer code. And lying? You can rule that out in 99.9999% of the cases just by knowledge of science culture and practice.

Chris Ferrall nailed it. Many skeptics come out of competing scientific traditions (math, stats, econ, compsci…) in which “can I replicate this exactly?” is a primary concern.

JimB and dhogaza:

I work in software quality assurance. One thing I’ve learned from that tradition is that the person who wrote a batch of code often isn’t the best person to evaluate its robustness. Software developers acquire blind spots; they have hidden assumptions they don’t even realize are assumptions. Somebody who expects the code to work uses it in the manner they expect to work and might on that basis confidently say “this works flawlessly!” right before a QA engineer finds dozens of bugs in the same program by using it a slightly different way or different environment or different mindset. The QA engineer is indeed trying to find something wrong with it but that’s a good thing, because the code is intended to ultimately work in a much larger context than “just when this one guy uses it”. Being vetted makes programs better.

My default assumption about both software developers and researchers is merely that they are fallible. All code has bugs. Some bugs “matter” and many go unnoticed until an antagonistic second party – somebody who expects to see problems – attempts replication.

Thus, if I explicitly withhold information that would allow my code to be verified by others that sends a message that I don’t care if my code is robust. This should reduce confidence in any conclusions reached based on that code.

My conclusion: Ease of replicability does correlate to the believability of the scientific result.

Caveat: this applies most strongly in the case of studies that utilize novel computational procedures.

Second, I can’t understand how the distinction between “all of the code” and a “barely sufficient” amount of code (the distinction that got me mixed up in all this) seems to be escaping people. In theory they may be the same, assuming that the code was error-free to begin with, but in practice they are drastically different.

Third, the statement “Ease of replicability does not correlate to the quality of the scientific result.” delivered with a flourish at the end of this article seems totally obvious to all concerned. What it affects is the ability for others to advance the conversation, either by building on the result or by challenging it.

Fourth, the discussion in point 3 above is baffling. Nicolas asks “So seriously as the result of this process, which you felt to be useful you don’t think the author’s of papers should provide as much as possible? ” and Gavin replies “No. They should provide as much as necessary.” The only way I can reconcile this is with different meanings of “should”. Arguably, the minimum necessary to replicate the result, in the 19th century sense, is what is traditionally required for publication. If, however, one places the advancement of knowledge ahead of the advancement of one’s own position, a different normative structure applies. I can see how Gavin’s position is legalistically correct but with all due respect it seems hollow.

I will note that the distinction between standard practice among scientists and those of other trained producers of digital products is remarkable, and that repeatability of a much higher order is built into commercial workflows everywhere. If nothing else, this feeds into the perception by our critics that we are hiding something. To the extent that what we are doing is actually important, meeting the minimal standards of publication that have existed in the past is simply not rising to the occasion.

[Response: Ok, let’s talk cases: The STATA script used by MM07 was indeed ‘all the code’. Did it aid replication? No. dLM07 provided no code at all. Did that make a difference? No. But because both papers provided documentation, pointers or data, replication was relatively easy. Thus the ‘all the code’ mantra is not correlated to the ease of replicability. If you use exactly the same programs/styles/proprietary software/operating system then a script like the STATA one would be instantly usuable to you. But that doesn’t include 95% of people. Thus more general and traditional concepts of replicability have to dominate. The issue here is that there is always a cost to any new standard. Time taken to provide unnecessary and superfluous details is time taken away from blogging doing real work. Given a cost, the benefit needs to outweigh it – and you seem to be implying that even mentioning this is somehow ‘old-school’. It’s not, and if you want to bring people along with you on this, you need to be explicit about the costs as well as trumpeting the benefits, otherwise it’s going to be seen as utopian and unrealistic. – gavin]

Dr. Schmidt writes:
I gave this example to demonstrate not how clever I am (oooh! I can replicate!) but to highlight issues that people were discussing without nuance or practical examples.

Perhaps he is not aware that the issue of replication in the ‘first culture’ has been explored with nuance and practical examples. So perhaps that explains when they are surprised/perplexed/infuriated that the other culture not only does not support it but actually trivializes the culture and claims superiority of their approach, apparently ignorant that other fields have some familiarity with non-experimental data.

And from that I draw this quote from a highly influential economist who ‘benefited’ from this type of replication.
“The best model for this admission is Feldstein’s (1982) ‘Reply,’ the first sentence of which was ‘I am embarrassed by the one programming error that Dean Leimer and Selig Lesnoy uncovered but grateful to them for the care with which they repeated my original study.’”

Dr. Schmidt is right, I was being too generous. I really would have given his replication a B-, but given that he misses the point of my comment I would adjust the grade down.

[Response: One the guidelines in cross-cultural communication is that one should learn not to patronize people whose culture you don’t appreciate. I am perhaps sometimes guilty of that, but so are you. If you want to move past snide insinuations of cultural superiority, you would be most welcome. There may well be lessons worth learning from other fields, but unless one recognises that different fields do have a different culture, imposing inappropriate standards that work well in one case may not be that fruitful. The biggest barrier is related to how results are valued in a field. I would venture to suggest that it is very different in economics than in climatology. We tend to grade based on getting the answer right, rather than the attitude of the student. – gavin]

Michael, I replied just after Eric’s inline response — I pointed out how the guy first asked for “the code” then “your code” and finally “antarctic code” — each request making clearer he knew little, and escalating as Eric was leaving and couldn’t respond.

You didn’t know who the guy was at the time. Would that have changed how you responded — to his specific request, at that particular time, to Eric in particular, on this?

Hard cases make bad law, as they say. This guy’s request is not a real good example of an appropriate request for a scientist’s available time.

There’s a reason scientists cooperate with other scientists. Because they can, eh? It’s mutual

My default assumption about both software developers and researchers is merely that they are fallible.

You brought in the word “lying”, not us: “One unfortunate possibility is that the author of Study X might be lying “.

Somebody who expects the code to work uses it in the manner they expect to work and might on that basis confidently say “this works flawlessly!” right before a QA engineer finds dozens of bugs in the same program by using it a slightly different way or different environment or different mindset.

So you’re saying a QA engineer doesn’t simply replicate the tests done by the developer?

Are you suggesting that a QA engineer does something a bit different than the developer, attacks the code in different ways than the developer, in order to learn whether or not it is robust. In other words, are you suggesting that a QA engineer acts ANALOGOUSLY TO THE EXAMPLE PROVIDED BY GAVIN ABOVE RATHER THAN SIMPLY MIMIC WHAT THE DEVELOPER DID?

Gosh.

Here’s another trick question: does a QA engineer ever design tests without reading the code being tested? Or would you claim that the only way a QA engineer can do their job is to be able to read the code before designing tests?

When a company like MS or Apple releases beta versions of new operating systems to developers of third party software to test, do MS and Apple release full source of that operating system to each of these third party software developers?

Do these thirty party software developers insist that the only way they can test the new version is to have full access to all the source?

I will note that the distinction between standard practice among scientists and those of other trained producers of digital products is remarkable, and that repeatability of a much higher order is built into commercial workflows everywhere

But not into the production of software resulting from research into software engineering or computer science. Yet strangely you and the others aren’t calling for such standards in those fields.

Somebody who expects the code to work uses it in the manner they expect to work and might on that basis confidently say “this works flawlessly!” right before a QA engineer finds dozens of bugs in the same program by using it a slightly different way or different environment or different mindset.

The fact that someone might find that the software used to generate results reported in a paper fails if used differently is not relevant to the results reported in that paper. Think about it. A researcher won’t claim “this works flawlessly!”, only that “this worked flawlessly when used in our working (computer and software) environment, on a given dataset, in this way”.

That’s a much weaker quality requirement than is necessary for generalized commercial software which is expected to work correctly under a wide range of unanticipated circumstances.

Gavin in reply to #18: “The issue here is that there is always a cost to any new standard. Time taken to provide unnecessary and superfluous details is time taken away from blogging doing real work. Given a cost, the benefit needs to outweigh it – and you seem to be implying that even mentioning this is somehow ‘old-school’. It’s not, and if you want to bring people along with you on this, you need to be explicit about the costs as well as trumpeting the benefits, otherwise it’s going to be seen as utopian and unrealistic.”

There is no doubt that there is an activation barrier, but experience shows that the long term result is a net benefit in productivity for the individual researcher as well as for the community in codifying the workflow for every digital product.

As Claerbout says in the CiSE article I linked: “I began inflicting this goal upon a team of graduate students – all our research should be reproducible by other people by means of a simple build instruction. … Although I made the claim (which was true) that reproducibility was essential to pass wisdom on to the next generation, our experience was always that the most likely recipient would be the author herself at a later stage of life.”

Again, this is such common practice in industry, including applied sciences and engineering, that many readers assume it is common practice among scientists. On that basis alone it is difficult to see this simple technical advance as “utopian and unrealistic”.

[Response: Well, most scientists are pretty much self-taught in everything useful and they are almost always working in an exploratory mode. This is a huge contrast from a large firm (think Google, Accenture or McKinsey) that spends millions of dollars training their employees to code the same way and use the same workflow methods on all their (very repetitive) projects. First there aren’t the same level of resources, second the work is much less repetitive, and third no-one has designed workflow methods that are going to work over the large range of methods that scientists actually use. Methods just don’t easily translate. – gavin]

Re 10.
Gavin, in your short answer, you forget that biomass burning is a source of greenhouse gases: CO2, CH4 and N2O.

[Response: Yes of course. I was thinking too far ahead… and worrying about the pre-industrial biomass burning estimates that we need for our new control runs that haven’t been released yet…. sorry. – gavin]

According to his website: “I submitted it [the paper] to JGR. The editor said that it is, technically, a response to comments from critics, but none of our critics have submitted their comments for peer review, so they cannot proceed with the paper.” http://www.uoguelph.ca/~rmckitri/research/jgr07/jgr07.html

[Response: Well the editor is right. One can’t submit a response to comment that hasn’t been submitted. However, he could have written a new paper discussing this issue more thoroughly. My take on his preprint is that his conclusion that ‘zero spatial correlation cannot be rejected’ is astonishing. Fig 4. in my paper indicates that the d-o-f of all the fields is significantly less (and sometimes much, much less) than what you would get in the zero spatial correlation case. – gavin]

Steig et al had no dependencies other than Matlab and Tapio Schneider’s published library. Accordingly, portability is not an issue in the case at hand. Their scripts (if they exist) should work anywhere that has Matlab, and fail instantly with “Matlab not found” elsewhere.

In other cases, portability may be a bigger deal. Many of the difficulties in portability trace directly to the use of Fortran, whose design predates many contemporary standards. Things that work on one configuration commonly fail on another, which is one of the main reasons to argue that Fortran is a huge productivity sink.

Even in the worst case, like say a multi-parallel executable Fortran90/MPI/infiniband configuration running on a queue-managed cluster (sigh), whatever, a build from source and data to final output should be achievable in a single script locally. Such a script, although it cannot work everywhere, is surely helpful to others attempting to replicate results elsewhere but can be crucial to local workers attempting to revive dormant research projects.

RM states in the conclusion … “Across numerous weighting specifications a robust LM statistic fails to reject the null hypothesis that no spatial autocorrelation is present, indicating that the estimations and inferences reported in MM07 are not affected by spatial dependence of the surface temperature field.”

The null hypothesis was “no spatial autocorrelation is present.”

They could *not* reject the null hypothesis – meaning that “no spatial autocorrelation is present.”

Too many double and triple negatives – and I had to re-read it to make sure that I wasn’t cracked.

[Response: Right. I think that is very unlikely. There is lots of spatial correlation in the data. – gavin]

“… he sees “taking” (information) without “giving” (feedback) as not keeping up with the takers part of a two-way process. He’s also worried about what he calls “espionage”, and data getting discussed before it’s peer reviewed.

“Oh please!

“Firstly, as to the taking without giving: In some communities, presenting is the price of attendance ….”

The stata script provided did aid replication; it is just that Gavin did not want to stump up the hundreds of $ to purchase STATA. Whether it costs money to do an experiment, is a different issue from whether you can replicate that experiment; so conflating the two issues does not help.

This commentary seems to consider only the small world of climate science. There is a larger environment (science generally), where replication of results is a critical issue. There are several important papers that have been withdrawn (e.g. Science Vol. 277. no. 5325, pp. 459 – 463) after failure to replicate (e.g. Nature 385:494). Professor JPA Ioannidis has made a career out of pointing out where epidemiology studies show poor replication, and the consequent implications for clinical practice. Given that even replication is such a high hurdle, it is very helpful to have all the information to be able to replicate, rather than an unusably terse subset.

per

[Response: I’m not saying whether it could have potentially been useful, but in this case it wasn’t. But is your point that replication is fine if it’s only theoretical (i.e. if everyone bought STATA)? – gavin]

Gavin wrote: “One the guidelines in cross-cultural communication is that one should learn not to patronize people whose culture you don’t appreciate…The biggest barrier is related to how results are valued in a field. I would venture to suggest that it is very different in economics than in climatology. We tend to grade based on getting the answer right, rather than the attitude of the student.”

Yup – economists don’t care about getting the answer right, just the attitude of the student. Ben Bernanke doesn’t care about fixing the current problems in the US, just that people feel OK about it. I think you might have confused economists with politicians. Care to reconisder your characterisation of economics?

I’ll use the Steig, et al paper in my example. Suppose I’m interested
in exploring RegEM, but with a different regularization scheme and I’d
like to compare the results of my new scheme with the results obtained
by Steig. I decide that I’ll use the MATLAB code referenced at Steig’s
Web site as a starting point to save time and add my regularization
method as a new option. Unfortunately, I don’t know and/or don’t like
MATLAB (the language used for the Antarctic analysis; since Gavin
(still) uses FORTRAN he can probably identify with this!), but am
proficient in R and decide to port the code. As a test, I’d like to
run the analysis using the TTLS calculation described in the
paper. The Steig site, however, only contains pointers to the
Antarctic station data and AVHRR satellite data, so I download the
data from those sites, convert it, and run the analysis using my
freshly ported R code. I look at the results, compare it to the Steig
analysis and it doesn’t match.

How do I determine what went wrong? Is it my code, or has the data
changed in some way? Have I made an error when converting the data?
Note that I’m not just trying to reproduce the results of the paper as
an exercise, but as a means of testing a new hypothesis — that my new
regularization scheme is more reliable than TTLS. Now I have to do a
lot of tedious debugging to determine the source of the problem.

To make matters worse, what if I’m analyzing this data five years from
now and funding has been cut for archiving the Antarctic data so the
data is no longer available? Or the links referenced in the paper have
changed? Or, suppose the algorithm used to generated the Tir data from
AVHRR has changed?

In short, I believe it’s extremely useful to have both code and source
data archived. Really, it’s not that difficult to do. And while it’s
true that ease of replicability doesn’t increase the quality of the
science, it does make it easier for others to build on that science.

[Response: Two issues (at least). Eric’s page is not a permanent archive either – and my guess is that BAS is a more robust institution than a faculty page at UW. But it does underline the real lack of a permanent citeable databases with versioning. That’s not Eric’s fault though. As part of those databases, I would love to see upload areas for analysis code that could be edited and improved collectively. Such a system would deal with all your needs and would allow us to build more effectively on the science and as you know, I am advocating strongly for such a thing. But, since that system doesn’t yet exist, everything possible is a compromise, which since it is not ideal, is always open to criticism. – gavin]

Dr. Schmidt replied: “The biggest barrier is related to how results are valued in a field. I would venture to suggest that it is very different in economics than in climatology. We tend to grade based on getting the answer right, rather than the attitude of the student.”

Anyone might learn something from taking an econometrics course that required a replication exercise just like the one Dr. Schmidt carried out using publicly provided data and code from the authors. However, it appears people from the other culture might refuse to submit their replication code to the teacher on the grounds that it is better if the teacher reads the methods section and writes their own code. I guess in Dr. Schmidt’s class the person would get an A for following the accepted norms. In my class such a student would get marked down for not understanding the bigger point (even while benefiting from it!).

Just as a reminder of the benefits of writing one’s own code to deal with data sets (I found this for a friend just now), one of the more famous cases (also discussed at RC in the past) illustrates the reason to write one’s own code and not rely on even long-used prior work:

“… The apparent lack of consistency between the
temperature trends at the Earth’s surface and aloft was troubling because climate models indicate temperatures aloft should be rising at least as rapidly as temperatures at the Earth’s surface. Fu and collaborators devised a new algorithm for retrieving temperatures from the satellite measurements.

“In contrast to previously published work, their results indicated that the lower atmosphere is warming at a rate consistent with model predictions. Subsequent studies by other groups have borne out the reliability of Fu et al.’s trend estimates and they have shown that the algorithm used in prior estimates had a sign error which led to spuriously small trends. These studies lend greater confidence to the detection of human-induced global warming and they serve to reduce the level of uncertainty inherent in estimates of the rate of warming.”

On replication in mathematics: I recall being told by a math prof years ago that the easiest way to understand and check the work of another mathematician is NOT to pore through his proof, but to focus on the major milestones and prove them oneself.

In other words, focus on verifying the general conceptual consistency, not the specific steps taken. There is generally more than one way to skin a cat, and individuals may differ on which tools they like to do the skinning.

This is much closer to Gavin’s philosophy of replication of results than it is to M&M’s “turn over all your code” approach.

Many of the difficulties in portability trace directly to the use of Fortran, whose design predates many contemporary standards. Things that work on one configuration commonly fail on another, which is one of the main reasons to argue that Fortran is a huge productivity sink.

Portability isn’t necessarily a design requirement for a lot of research-related software.

Hell, it’s not even a design requirement of a lot of software that only runs on (say) Windows. C# applications, for instance.

Once again rather than the general, let’s be specific. I believe it removes the issues of how much extra work would be created. Dr. Steig has said that he is willing to provide the data to legitimate researchers. My response is to simply post what he would provide. I still haven’t heard from Dr. Schmidt what the objection is to that concept.

Also as to the specifics in Dr. Steig’s paper. I believe that there is probably sufficient information on AWS trends. However I don’t think there is sufficient information to reproduce the gridded AVHRR temperature results. They are quite dependent on corrections for clouds, and manipulation to produce temperature values as I understand it.

[Response: Joey Comiso is apparently working on making that available with appropriate documentation – patience. – gavin]

re. #27 — McKitrick will be posting his response tomorrow, apparently, with a preliminary comment on the thread “Gavin on McKitrick and Michaels” just begun this afternoon on CA. It should be an instructive exchange, or at least I hope it will be.

1) No need to apologize for Fortran! It has often been said (starting with Backus, I think):
Q: what high-performance language will we be using in year xxxx?
A: don’t know, but it will be called Fortran.

xxxx has generally been picked to be 10-20 years away. After all, some hoped that Algol 60 and then PL/I would make Fortran go away… and certainly, Fortran 90/95 have come a long way from Fortran II or IV.

2)I’d missed that mess on protein-folding, but not surprising, given how touchy those things are.

a) Many people keep over-generalizing from subsets of computing applications to the rest. Sometimes there are reasonable arguments, sometimes it feels like Dunning-Kruger.

b) People keep talking about version control, makefiles, rebuild scripts, etc. Many of the modern versions of those are rooted in code and methodologies done at Bell Labs in the 1970s, by various BTL colleagues. In many cases, the tool code has been rewritten (like SCCS => RCS => CVS, for example, and the current make’s have evolved from Stu’s original), but we know where the ideas came from. Likewise, in statistics, S came from Bell Labs (John Chambers) about then, and of course, John Tukey was around to stir people up to do meaningful analyses rather than torture data endlessly.

We were quite accustomed to using toolsets for automation and testing that went way beyond those widely available. [Which meant: we tried pretty hard to get our stuff out to the industry, despite the best efforts of certain lawyers worried about Consent Decrees and such. Of course old-timers know open-source code, especially in science, goes back probably to ~1948, maybe earlier, certainly much of modern, non-vendor-user-group open source approaches are rooted in those BTL efforts, although they were hardly the earliest. SHARE and DECUS go way back.]

But still, inside BTL, the amount of machinery and Q/A done varied tremendously from {code written to analyze some lab data by a physics researcher} to the {fault-tolerance and testing done for an electronic switching system whose design goal was ~1980), but much is still quite relevant.

“I believe the hard part of building software to be the specification, design, and testing of this conceptual construct, not the labor of representing it and testing the fidelity of the representation.” (emphasized in the original)

Summarizing several chapters: we’ve done a lot of automate the simple stuff, but building the right thing is something different. Fred tends to talk more about software *products* of course, so this is a slightly different domain, but I think the principle applies.

I know of two very large BTL 1970s projects, whose project methodologies were OK, who dedicated huge resources to Q/A, who were using automation tools way ahead of much of the industry at the time, with complex testframes able to provide workloads to Systems Under Test, etc, etc … and they both failed, miserably, because they turned out not to be the right products.

4) Maybe it’s worth enumerating other examples around climate science. For example, one might consider the UAH-vs-RSS example, in which the error was *not* discovered by having hordes of people paw through code.

A scientific (or empirical) skeptic is one who questions the reliability of certain kinds of claims by subjecting them to a systematic investigation. The scientific method details the specific process by which this investigation of reality is conducted. Considering the rigor of the scientific method, science itself may simply be thought of as an organized form of skepticism. This does not mean that the scientific skeptic is necessarily a scientist who conducts live experiments (though this may be the case), but that the skeptic generally accepts claims that are in his/her view likely to be true based on testable hypotheses and critical thinking.

To amplify on Neal J. King’s point in #36 with respect to the hypothetical situation described by Chris Ferrall in #13 (“A person states a theorem and provides a detailed proof. Another person confirms the proof but notices that one step includes “assume a solution to f(x)=0 exists.”) In fact, mathematicians never give all the details of their proofs (with the exception of a few logicians, and, even among them, precious few since Russell and Whitehead 100 years ago). Recently, there was a bit of a kerfuffle over the question of whether Grisha Perelman’s proof of the Poincaré conjecture was really a proof or just a suggestion of a proof. The concensus was that it qualified as a proof if a professional in the field could fill in the missing steps without the need for truly original work. Projects were undertaken to fill in the details, but even these efforts were not intended to provide a level of detail that a non-professional could follow. That’s just the way things are done in the real world, even in the totally rigorous field of mathematics. It seems to me both unreasonable and unproductive to expect a different standard from climatology.

As part of those databases, I would love to see upload areas for analysis code that could be edited and improved collectively. Such a system would deal with all your needs and would allow us to build more effectively on the science and as you know, I am advocating strongly for such a thing. But, since that system doesn’t yet exist, everything possible is a compromise, which since it is not ideal, is always open to criticism.

Why wouldn’t something like Google Code (or something similar) work for the code portion of this?

[Response: There is some merit to that idea….how would it work in practice? – gavin]

I am reminded of some work I did a long time ago, when we were simulating the behaviour of a bolometer maintained at a constant temperature with a feedback loop. The simulation, a curve fit combined with a fourth order adaptive Runge-Kutta had no free parameters (!) and reproduced the data extremely well. We were very pleased and proceeded on our way. Some time later I reused this approach for another problem, and discovered that the algorithm was seriously flawed (an LU decomposition subroutine was horribly miscoded) so i spent some considerable time redoing a buncha calculations. Amazingly, in this case, the error made absolutely no difference ! So I spent even more time analysing why this was so. So publishing the original flawed code would have not really helped. In any event, this simulation was only a small part of the publication (a faster, less complicated simulation also worked, with experimentally determined input parameters)

Another point I ought to make here regards the use of closed source software. In my work we use symbolic algebra packages since the days of REDUCE and MACSYMA, these days the closed source Mathematica, Maple and others. I recall a case where both Mathematica and Maple got the answer wrong, (although in fairness, Mathematica made more sophisticated errors, and Maple failed naively…) and we took some considerable time writing and verifying our own routines (three independent routines written by three different people cross checked against each other) before we were satisfied. In retrospect, we ought to have done this earlier since the error took about a person year to resolve. Someday, if Mathematica and Maple are ever open sourced, I might devote some of my copious spare time (not!) to digging out the flaw, if it still exists.

Hoe did we find the error? Because, as we proceeded, we were always doing pen and paper approximations to the calculations and comparing with the experimental results. We did not rely entirely on machine calculation.

In this second case, we did convey bug reports. In both cases, I feel it is better to let the reviwers and readers, satisfy themselves with their own calculations that the results are correct, rather than have them use our own, possibly erroneous, code.

1. Pretty moderate and pleasant discussion going on. Kudos to my side and to Gavin for that.

2. Captcha is giving me fits.

3. I don’t know why some of my posts get through and others not (even moderate remarks). It’s hard to be engaged in the discussion and/or make meaningful remarks with that much uncertainty that they will get posted.

5. For papers that use elaborate or new or both statistics or that turn mostly on the data analysis (as opposed to someone doping silicon and measuring conductivity and doing a 1/T plot and getting best fit on it and slope of that) it is a really good idea to show more thorough methods descriptions. Will even improve quality of the work for ther workers.

Nice post, Gavin. However, while I agree that exact reproducibility is not as crucial as the fundamental science, in this era of ever increasing amounts of data, I think it is important to keep track of data and methods. You mentioned emailing authors for clarifications. What happens if you want to try to reproduce results/methods in light of new information 10 years from now and the authors are no longer working in the field and/or no longer have records of their data or processing steps?

That is why it is crucial to provide and preserve solid documentation on the data and methods used. I have seen numerous papers that use “NSIDC sea ice” in their methods, with no reference to the exact dataset or version use. While it may be that as in the case shown above this doesn’t matter much, but in another case it could be crucial. I urge all scientists to be vigilant in making sure that refereed journal articles not only provide solid scientific results, but also solid information on their data and methods.

Walt Meier
National Snow and Ice Data Center

[Response: Hi Walt, well in ten years time, most of this kind of analysis being done now will be obsolete (not the methods, just the results) since much more data will hopefully be available. Your larger point is well taken, and that goes to the citeability of datasets – especially when they are continually evolving. I asked on a previous thread whether anyone knew of a specific database suite that gave all of that functionality (versioning of binary data, URL citeability of specific versions, forward citation to see who used what etc.), but other than vague references, there wasn’t anything concrete. Perhaps NSIDC are working on such a thing? – gavin]

[Response: There is some merit to that idea….how would it work in practice? – gavin]

1) Code author creates account at Google (he already has an account if he has a GMail address)
2) Author creates a project and uploads code
3) Author determines which other users are allowed to upload/modify the code database
4) Any user can download code, but only authorized users are allowed to modify code/data.

Google code uses subversion as the source code control system. Note that subversion allows versioning of binary as well as text files. Don’t know what the limit on storage at Google is, but it is possible that smaller datasets as well as code could be stored in this way.

I asked on a previous thread whether anyone knew of a specific database suite that gave all of that functionality (versioning of binary data, URL citeability of specific versions, forward citation to see who used what etc.), but other than vague references, there wasn’t anything concrete. Perhaps NSIDC are working on such a thing?

This would be ideal. But even a system without version control might satisfy most of these requirements — for instance, if the data can be tagged with text and a URL, then it’s pretty easy for a user to figure which version is which, especially if the number of versions are small. We’ve developed (warning: shameless plug) a system (in beta) at PMEL that allows users to archive, tag and visualize gridded netCDF data. We might be able to expand it to support text data. Or, perhaps Google code will be adequate for most users. My point is that there are solutions to this problem that are available now — not perfect, but good enough.

Re: Stata, my point is that you could have bought Stata (it is commercially available), and you could have done a direct replication if it was important. In this case, there seems to be no difficulty in repeating the results. In other cases, there are difficulty in repeating the results; and in those cases, it is enormously helpful to be able to replicate exactly.