On replication

This week has been dominated by questions of replication and of what standards are required to serve the interests of transparency and/or science (not necessarily the same thing). Possibly a recent example of replication would be helpful in showing up some of the real (as opposed to manufactured) issues that arise. The paper I’ll discuss is one of mine, but in keeping with our usual stricture against too much pro-domo writing, I won’t discuss the substance of the paper (though of course readers are welcome to read it themselves). Instead, I’ll focus on the two separate replication efforts I undertook in order to do the analysis. The paper in question is Schmidt (2009, IJoC), and it revisits two papers published in recent years purporting to show that economic activity is contaminating the surface temperature records – specifically de Laat and Maurellis (2006) and McKitrick and Michaels (2007).

Both of these papers were based on analyses of publicly available data – the EDGAR gridded CO2 emissions, UAH MSU-TLT (5.0) and HadCRUT2 in the first paper, UAH MSU-TLT, CRUTEM2v and an eclectic mix of economic indicators in the second. In the first paper (dLM06), no supplementary data were placed online, while the second (MM07) placed the specific data used in the analysis online along with an application-specific script for the calculations. In dLM06 a new method of analysis was presented (though a modification of their earlier work), while MM07 used standard multiple regression techniques. Between them these papers and their replication touch on almost all of the issues raised in recent posts and comments.

Data-as-used vs. pointers to online resources

MM07 posted their data-as-used, and since those data were drawn from dozens of different sources (GDP, Coal use, population etc. as well as temperature), trends calculated and then gridded, recreating this data from scratch would have been difficult to say the least. Thus I relied on their data collation in my own analysis. However, this means that the economic data and their processing were not independently replicated. Depending on what one is looking at this might or might not be an issue (and it wasn’t for me).

On the other hand, dLM06 provided no data-as-used, making do with pointers to the online servers for the three principle data sets they used. Unlike for MM07, the preprocessing of their data for their analysis was straightforward – the data were already gridded, and the only required step was regridding to a specific resolution (from 1ºx1º online to 5ºx5º in the analysis). However, since the data used were not archived, the text in the paper had to be relied upon to explain exactly what data were used. It turns out that the EDGAR emissions are disaggregated into multiple source types, and the language in the paper wasn’t explicit about precisely which source types were included. This was apparent when the total emissions I came up with differed with the number given in the paper. A quick email to the author resolved the issue since they hadn’t included aircraft, shipping or biomass sources in their total. This made sense, and did not affect the calculations materially.

Data updates

In all of the data used, there are ongoing updates to the raw data. For the temperature records, there are variations over time in the processing algorithms (satellites as well as surface stations), for emissions and economic data, updates in reporting or estimation, and in all cases the correction of errors is an ongoing process. Since my interest was in how robust the analyses were, I spent some time reprocessing the updated datasets. This involved downloading the EDGAR3 data, the latest UAH MSU numbers, the latest CRUTEM2/HadCRU2v numbers, and alternative versions of the same (such as the RSS MSU data, HadCRUT3v, GISTEMP). In many cases, these updates are in different formats, have different ‘masks’ and required specific and unique processing steps. Given the complexity of (and my unfamiliarity with) of economic data, I did not attempt to update that, or even ascertain whether updates had occurred.

In these two papers then, we have two of the main problems often alluded to. It is next-to-impossible to recreate exactly the calculation used in dLM07 since the data sets have changed in the meantime. However, since my scientific interest is in what their analysis says about the real world, any conclusion that was not robust to that level of minor adjustment would not have been interesting. By redoing their calculations with the current data, or with different analyses of analogous data, it is very easy to see that there is no such dependency, and thus reproducing their exact calculation becomes moot. In the MM07 case, it is very difficult for someone coming from the climate side to test the robustness of their analysis to updates in economic data and so that wasn’t done. Thus while we have the potential for an exact replication, we are no wiser about its robustness to possibly important factors. I however was able to easily test the robustness of their calculations to changes in the satellite data source (RSS vs. UAH) or to updates in the surface temperature products.

Processing

MM07 used an apparently widespread statistics program called STATA and archived a script for all of their calculations. While this might have been useful for someone familiar with this proprietary software, it is next to useless for someone who doesn’t have access to it. STATA scripts are extremely high level, implying they are easy to code and use, but since the underlying code in the routines is not visible or public, they provide no means by which to translate the exact steps taken into a different programming language or environment. However, the calculations mainly consisted of multiple linear regressions which is a standard technique, and so other packages are relatively easily available. I’m an old-school fortran programmer (I know, I know), and so I downloaded a fortran package that appeared to have the same functionality and adapted it to my needs. Someone using Matlab or R could have done something very similar. It was a simple matter to then check that the coefficients from my calculation and that in MM07 were practically the same and that there was a one-to-one match in the nominal significance (which was also calculated differently). This also provides a validation of the STATA routines (which I’m sure everyone was concerned about).

The processing in dLM06 was described plainly in their paper. The idea is to define area masks as a function of the emissions data and calculate the average trend – two methods were presented (averaging over the area then calculating the trend, or calculating the trends and averaging them over the area). With complete data these methods are equivalent, but not quite when there is missing data, though the uncertainties in the trend are more straightforward in the first case. It was pretty easy to code this up myself so I did. Turns out that the method used in dLM07 was not the one they said, but again, having coded both, it is easy to test whether that was important (it isn’t).

Replication

Given the data from various sources, my own codes for the processing steps, I did a few test cases to show that I was getting basically the same results in the same circumstances as was reported in the original papers. That worked out fine. Had their been any further issues at this point, I would have sent out a couple of emails, but this was not necessary. Jos de Laat had helpfully replied to two previous questions (concerning what was included in the emissions and the method used for the average trend), and I’m sure he or the other authors involved would have been happy to clarify anything else that might have come up.

Are we done? Not in the least.

Science

Much of the conversation concerning replication often appears to be based on the idea that a large fraction of scientific errors, or incorrect conclusions or problematic results are the result of errors in coding or analysis. The idealised implication being, that if we could just eliminate coding errors, then science would be much more error free. While there are undoubtedly individual cases where this has been the case (this protein folding code for instance), the vast majority of papers that turn out to be wrong, or non-robust are because of incorrect basic assumptions, overestimates of the power of a test, some wishful thinking, or a failure to take account of other important processes (It might be a good idea for someone to tally this in a quantitative way – any ideas for how that might be done?).

In the cases here, the issues that I thought worth exploring from a scientific point of view were not whether the arithmetic was correct, but whether the conclusions drawn from the analyses were. To test that I varied the data sources, the time periods used, the importance of spatial auto-correlation on the effective numbers of degree of freedom, and most importantly, I looked at how these methodologies stacked up in numerical laboratories (GCM model runs) where I knew the answer already. That was the bulk of the work and where all the science lies – the replication of the previous analyses was merely a means to an end. You can read the paper to see how that all worked out (actually even the abstract might be enough).

Bottom line

Despite minor errors in the printed description of what was done and no online code or data, my replication of the dLM07 analysis and it’s application to new situations was more thorough than I was able to do with MM07 despite their more complete online materials. Precisely because I recreated the essential tools myself, I was able to explore the sensitivity of the dLM07 results to all of the factors I thought important. While I did replicate the MM07 analysis, the fact that I was dependent on their initial economic data collation means that some potentially important sensitivities did not get explored. In neither case was replication trivial, though neither was it particularly arduous. In both cases there was enough information to scientifically replicate the results despite very different approaches to archiving. I consider that both sets of authors clearly met their responsibilities to the scientific community to have their work be reproducible.

However, the bigger point is that reproducibility of an analysis does not imply correctness of the conclusions. This is something that many scientists clearly appreciate, and probably lies at the bottom of the community’s slow uptake of online archiving standards since they mostly aren’t necessary for demonstrating scientific robustness (as in these cases for instance). In some sense, it is a good solution to a unimportant problem. For non-scientists, this point of view is not necessarily shared, and there is often an explicit link made between any flaw in a code or description however minor and the dismissal of a result. However, it is not until the “does it matter?” question has been fully answered that any conclusion is warranted. The unsatisfying part of many online replication attempts is that this question is rarely explored.

To conclude? Ease of replicability does not correlate to the quality of the scientific result.

“[Response: My working directories are always a mess – full of dead ends, things that turned out to be irrelevent or that never made it into the paper, or are part of further ongoing projects. Some elements (such a one line unix processing) aren’t written down anywhere. Extracting exactly the part that corresponds to a single paper and documenting it so that it is clear what your conventions are (often unstated) is non-trivial. – gavin]”

Gavin, what you are describing here is what would be called, in any commercial or industrial setting, bad practice.

That is exactly the point I am trying to make. It’s not a point about openness, it’s about effectiveness. Good practice in any discipline evolves from long experience. The behavior you are describing is behavior every programmer occasionally does on quick projects. However, most of us know better than to defend such behavior on major work products.

It is considered bad practice with good reason. It takes a lot of effort to go back and replicate your own results from memory, but very little to maintain a script which can do all of it. If steps are expensive, you need to learn a tiny bit of rule-based logic, but that is hardly beyond the abilities of anybody doing scientific computations. The payoff is not just throwing the CA folks a bone to chew on. It’s a very important component of reasoning about computations, which are, after all, error-prone.

Basically, you are making it easier to make mistakes.

Why should pure science be held to a lower standard than applied science or commerce? Does climate science matter or doesn’t it?

Per #98,

So to make appropriate use of the code and paper (enough to enable us to DEMAND it all be available), we need a team…

The issue is not whether the CA people are competent to examine the process or not. They might or might not be.

The issue is that when Gavin claims that this is unreasonably difficult, he is making a claim that many readers already know, as a consequence of their own daily practice, to be false and indeed absurd. Indeed, these readers overlap strongly with the group of nonscientists most competent and most willing to evaluate scientific claims. This does the credibility of RC, and by further extension the whole of climate science, no good.

In any case, whether this is sound practice on the part of the scientist or not, whether it is responsible behavior on the part of the hobbyists or not, one can expect demands for replication on any observational climatology analysis. Observational climatology is not at fault for having the perhaps worst relationship with its interested public of any science ever. There really are, after all, some truly malign forces involved. But it’s nothing to celebrate, and it’s worth making some effort not to make it worse.

In summary: First, it is not true that maintaining end-to-end scripts is onerous. If large calculations are involved a series of scripts or a rule-based makefile may be practical, but these are easy skills to develop compared to the background needed to do science. Doing so in commercial and engineering settings is standard practice because it dramatically reduces error and increases the potential for objective tests.

Second, that some branches of science don’t do this is going to be perceived as an embarrassment. Defending the absence of a practice of end-to-end in-house repeatability is difficult, and coming from someone who has not spent much time thinking about it, likely to make silly claims.

Of course, as climatologists, we are barraged with silly claims, but in that we are not unique. We tend to lose patience with people making strident and erroneous claims about things they don’t understand. In this, we are not unique.

Strong programmers also tend to dismiss the opinions of those making strong claims they know to be untrue. Every technical mailing list has plenty of examples, some quite funny. (Strong programmers can be quiet clever in their putdowns.) Generally, if one is trying to convince others of the validity of one’s ideas, it pays to be modest and willing to learn about points where one may be less expert than the person one is trying to convince.

[Response: Michael, you spend your time decrying the fact that everything isn’t perfect. I am trying to explain to you why that is the case, that is not ‘absurd’, it is reality. Things just don’t work as well as perfectly executed flawless plan says they should. I was not defending my bad practice, (though show me anyone among us who does everything the way it should be done at all times?) – I’m just observing it. It would indeed be great if I knew exactly what was going to work ahead of time, that I never got interrupted in the middle of things, that I never made mistakes, that I never did preliminary estimates to see what was worth doing, or that I never hacked something so that it would work now because I had a deadline rather than doing it properly even though that would be better in the long run. But that is simply not the real world. Methodologies stick when they work with a culture, not against it. Doing science is messy – it just is. Discussing how to make it prettier is fine, but as long as you think of scientists as willfully ignoring your wonderful advice you aren’t going to get anywhere.

Here’s an analogy. Chefs in big kitchens with hundreds of diners and dozens of menus choices have to have an extremely regimented workflow. They have sous chefs by the dozen, well-trained waiters and strong commercial pressures to make it work well night after night (and we’ve all seen Gordon Ramsey’s Kitchen Nightmares to know what happens when it doesn’t). Most scientists however are the equivalent of the home cook, mostly small scale stuff and the occasional big dinner party. For bigger projects (such as team GCM development) better practice has to be enforced, but most scientific work is not like that. Do these cooks use the same methods as the professional chef? No. They don’t have the resources, nor the same training, nor the same pressures. Thus the kitchen after a domestic dinner party is usually a mess, and the clearing up is left until afterwards. Your comments are the equivalent of saying that the meal tastes bad because there is washing-up in the sink. I’m sure the host would much rather hear you offering to help clean up. – gavin]

TLE, your comment made me smile, but I have to add that it isn’t really “instant access to all human knowledge;” it’s instant access to lots of human information. Information isn’t knowledge until someone understands it, and “ain’t none of us” understand it all.

“The issue is that when Gavin claims that this is unreasonably difficult, he is making a claim that many readers already know, as a consequence of their own daily practice, to be false and indeed absurd. ”

1) Tell Microsoft they are cack because not only do they have bad documentation, they have to rely on third parties reverse engineering procedures to find out what their code is doing.

2) Gavin would have to do the work. Not you. So it’s easy for YOU to say you won’t do it but demand Gavin proves he can’t either. And YOU won’t do the work of replicating the papers work why? Are you saying it’s impossible? Maybe it is maybe it isn’t, but that’s not the point, is it. YOU COULD.

3) Don’t you tell others that because they haven’t done what you DEMANDED of them that they are scurrilous ne’erdoowells. You prove you’re not trying to kill research so that you can go back to thinking “It;s not my fault, it’s not my fault”. All it requires is your complete job history and your bank receipts statements. That shouldn’t be hard, should it. And if you have nothing to hide, you have nothing to fear, right????

Gavin, what you are describing here is what would be called, in any commercial or industrial setting, bad practice.

Yet science has just about the best track record for progress of any human endeavor. Scientists do things differently; Gavin’s situation is commonplace. And: it works brilliantly! But: bean-counters just can’t handle it.

We’re not slaves to procedure because if we took the time to make things “good practice” by your definition, we’d cut our productivity by 99%. God save us from industry types who think they know how to do science better than scientists.

Tidying up gets done, but nowhere near as much as people like to pretend it was already done before you asked.

Anyone who has managed, or has helped manage, a “real” software project, knows that release management consumes sizable resources. Programs written for the private use of a researcher or research group (no matter what the field), or just for private chores on a home computer, are very unlikely to be tidied up in this way. For the intended use, there’s simply no need to expend the resources, and in a research group, I’d be very surprised to see doing so be part of an approved budget.

“We’re not slaves to procedure because if we took the time to make things “good practice” by your definition, we’d cut our productivity by 99%.”

Nice theory but isn’t good practice what is supposed to be assured with peer review and Journal publishing?
And I’m not seeing such freedoms granted skeptic scientist?
Quite the contrary they appear to be held to a higher standard because they are outside looking in.

Would not the Precautionary Principle require that we listen to those who have actually put into practice the concepts they propose for consideration, in contrast to listening to those who have not yet even attempted to apply the concepts?

Especially those who have yet to attempt the practices and insist they cannot under any conditions known to human-kind be useful and simply dismiss them with hand-waving.

I’m sorry to have been seen as so disagreeable, but I just don’t accept that I am being unrealistic at all.

In exchange for a modest change in behavior you can have improved productivity and dramatically improved error reduction. I am not suggesting you stop exploring, just to consistently leave a trail when you make progress. This could hardly amount to a 1% tax on your time if you are so competent that you never need to backtrack. For most people, that 1% will pay back handily on the first occasion that there is the least correction somewhere in your workflow.

Regarding Tamino’s claim “Yet science has just about the best track record for progress of any human endeavor.” I appreciate the qualifier. However, I believe the applied sciences (notably engineering and medicine) actually do better than the pure sciences in terms of track record for progress, precisely because they can’t escape rigorous quality control. It is indeed slightly less fun to practice these disciplines, because they are more disciplined. It’s difficult to compare across disciplines, but it’s my impression that pure sciences are not as productive as applied ones. I also note that climatology has become an applied science, so the increased demand for formal method is a consequence of being consequential.

The productivity of science is due to the brilliance of its strongest practitioners, and I make no claim to being one of those myself. So perhaps it is absurd of me to criticize. On the other hand, I know enough about what scientists do to find these arguments hollow. I feel that the productivity of science could be vastly increased if science paid some attention to how productivity is achieved in other fields.

I am indeed trying to construct tools to help make this sort of thing easier. That is, in fact, what I do.

But it’s really a worthwhile endeavor in any case. It seems to me a matter of principle that every graphic you publish should be a graphic you can reproduce exactly from the raw data, ideally matching to the last pixel. This is one of the main advantages of computation and it seems very strange to me to reject it as too onerous.

Folks I’m sorry but twenty years ago good source control and release management took a lot of time. Today it saves time, and essentially everyone who does this for a living practices it. The tools have become so integrated that there is no excuse other than training for not using them. The fact that you have to start and stop on projects is a reason to use them. The fact that you might change your mind is a reason to use them.

But I just believe that Dr. Schmidt hasn’t been exposed to and trained on these tools and so doesn’t see the value yet. I hope he does as it will improve his life.

[Response: But I do use these systems – particularly for big software projects. Just not for every random little thing I do. – gavin]

But I will bring up that this is all theoretical. In the case of Dr. Steig’s paper it appears that the data and code are available for sharing, or will be soon. So I hope that they will be posted in an open repository without worrying about who looks at it.

My father used to regularly get manuscripts from people claiming they had disproved the general theory of relativity. He always wrote them a nice note in response saying he would give it “all due attention.”

If the model data isn’t significant down to the grid cell level why not just compare with randomly generated data? I assumed that your point was that the models predicted some of the spatial pattern, and that it just happens to correlate with areas of economic activity. Wasn’t this the statement in AR4?

[Response: I was initially thinking that the patterns might be part of the forced response and something related to the land/ocean mask or topography etc. I don’t think that is the case. Then I thought that maybe that it was related to internal patterns of variability (not just the big things like ENSO, but the wider impacts of internal variability). These patterns do have structures that isn’t random. However, your idea has merit – one could generate synthetic fields with about the same level of spatial correlation as in the data and see what you got. Any volunteers? – gavin]

49 years ago, struggling with the first commercial computers, we came to the conclusion that in programming for mass data processing it was worth keeping a detailed record allowing one to retrace and examine every single step. The investment was worthwhile becasue of the time it saved curing bugs and improving programs.As the scale of programming has increased, that investment has become steadily more worth while, and the procedures for keeping the record have improved accordingly.

In no other human activity that I have encountered is that investment worth making. In some engineering and maintenance work, it is worth making a very substantial effort to record every step (think airliners). But I do not think any design or physical systems engineer works with records as precise as those of programmers.

Most of the world’s activities which need records which enable one to trace and replicate operate with systems analagous to a good audit trail in accounting. Such science as I am acquainted with falls in that group. The scientific initiatives to archive source data and procedures have analogies in modern accounting.

In my own principal field of economics, we tend to be a bit slapdash in our provision of the information needed to replicate studies; but we are no worse than many branches of applied science (and only a little worse than mainstream accountants). Our principal methodological faults are frequent failure to try and generate evidence which could disprove our hypotheses, and forgetting assumptions underlying some of our cook-book standard methods (see 91 above, and reams of comment on the assumed data distributions that had a lot to do with the financial world’s failure to assess its risks correctly so far this century). We had better spend time on improving our performance on those points rather than on better archiving of our work.

Would not the Precautionary Principle require that we listen to those who have actually put into practice the concepts they propose for consideration, in contrast to listening to those who have not yet even attempted to apply the concepts?

No. It’s a bit like insisting that before being allowed to fly a Cessna, you must qualify for all ratings up to and including your multi-engine commercial jet certificate.

Would you apply the precautionary principle here?

Have those who developed the idea ever insisted on it?

Especially those who have yet to attempt the practices and insist they cannot under any conditions known to human-kind be useful and simply dismiss them with hand-waving.

I’ve been a professional software engineer for nearly forty years, and am well aware of what’s involved in commercial software production (ran and was principle engineer for a compiler products company for many years), open source software production (I’m the release manager for two open source products). There are others posting here with software engineering experience (John Mashey has a vast background) who disagree with the rabble.

I don’t dismiss the well-meant (I hope) advice by simply hand-waving, but rather by pointing out what any software engineer should know: the requirements for the production and release of software products is vastly different than the requirements for one-off bits of code cobbled together for a particular purpose (for instance, the analysis of data for a single paper).

However, I believe the applied sciences (notably engineering and medicine) actually do better than the pure sciences in terms of track record for progress, precisely because they can’t escape rigorous quality control.

Michael, Look, I’m all for transparency, but transparency does not mean sharing code or even sharing data. If everything worked as you envision it, then you might be right that we could improve science by archiving. The thing is that we have to not only envision how such a development would be useful, but also how it could be misused. We’ve seen plenty of examples of ignorant food tubes who love nothing better than to comb through code for every error, however trivial; every inelegant branch, however inconsequential. Do you really think responding to such idiots wouldn’t place demands on an author’s time. Hell, look at what Eric has gone through on this site.
I can also envision that if code existed for a tedious task, it might find its way into code from other groups, compromising independence and propagating any errors therein. Now maybe you could find ways to address these risks. Maybe you could anticipate other risks and mitigate them, too.
However, there’s no evidence that your solution even has a problem. Science is conservative. It’s the way it is for a reason. It will change, but only as it becomes necessary, and I for one am very leary of any change that has the potential to compromise the independence of research efforts.

Michael #113 about your claim “I believe the applied sciences (notably engineering and medicine) actually do better than the pure sciences in terms of track record for progress, precisely because they can’t escape rigorous quality control.”

You have already been called to account on the veracity of that statement, however, you have not even shown what errors have been removed only by the rigorous quality controls and how they measure up to the ones picked up by other methods.

And rigorous quality control has to include a metric for utility (which the above metric is). Additionally, the cost/benefit analysis should be available so that the correct level of congtrol is maintained. To operate without these elements would be counterproductive to the aims of quality control if not in actual conflict with the procedures themselves.

Oh, and isn’t it unpleasant when someone puts your words back to you with “And as to X’s claim ….”.

As to putting code out there, good caution about others picking it up and using it. Remember the pointer to the protein folding papers that were withdrawn because of a sign error? And that problem was in “legacy code” they’d taken from elsewhere? I wonder how many other papers were based on the same code — wherever and whenever it originated, for however long it had been recycled. Likely a few more that didn’t rise to the level of notoriety of the particular group that had to be retracted so publicly.

As Mark Twain warned about reading medical articles: “Be careful, you could die of a typographical error.”

I can also envision that if code existed for a tedious task, it might find its way into code from other groups, compromising independence and propagating any errors therein. Now maybe you could find ways to address these risks. Maybe you could anticipate other risks and mitigate them, too.

This is precisely the point at which the kind of software methodology processes being described start to make sense, because in such a case you’re moving from a situation where someone is using code they (or a close co-worker) has written for some one-off (or nearly one-off) use to a situation where code’s being shared and used by a potentially wide audience. Used in novel ways not anticipated by the original author, in slightly different operating environments, etc. The “borrower” or user might not be aware of limitations or constraints on data which if not met might lead to error, etc.

This is the point where people start organizing the bundling together of their bits of code, give the bundle a name (“RegEM”, perhaps, I’m not aware of the history of that code but it’s the kind of library set that often begins as a personal tool then grows into something that’s distributed, documented, etc), make clear constraints on the kind of datasets it works well with, make clear limitations of the code, and so forth.

RE 7 Rutheford was a Kiwi joker and most barpersons in today’s deregulated labour market in New Zealand are tertiary students or graduates, so explaining physics to them is probably a bit easier than it was in the past…On the other hand, as someone who had two years of school physics (back in the days when computers were room size) I downloaded Christoph Schiller’s book Motion Mountain The Adventure of Physics in order to improve my knowledge. Its a great book, but physics certainly is a tough subject! One point he makes is clear enough for any layperson to understand “Global warming exists and is due to humans.”

Ok so Dr. Schmidt does use these systems, but not for every random thing. I assume that published papers don’t fall in the random little thing category, or at least I think they shouldn’t. So in the end, no dispute, no problem. If these tools are used, publishing the code is a non-issue. Then it just gets down to philosophy of openness.

[Response: Some papers are little things, some are part of a much larger project, as always it depends. – gavin]

In AR4 what they said was “However, the locations of greatest socioeconomic development are also those that have been most warmed by atmospheric circulation changes.”

This had the sound of circular logic since I assume they are measuring the warming through the same mechanism that MM was saying was affected by the socioeconomic development.

AR4 doesn’t mention the spatial correlation problem.

So I thought that your use of the climate model output was to show the warming in these areas is predicted by the models, and is therefore independent of development. But if that was what you were trying to show then the negative correlation would tend to disprove the position taken by AR4.

If however you are saying that the model output is essentially random at those levels, then I’m not sure what the basis is of the statement is in AR4. The references are to the spatial trend patterns which is again based on the measured results.

[Response: The AR4 statement probably refers to the trend in the NAO over the period (peaking around 1995). It’s a reasonable hypothesis but not what is happening the model runs I looked at. I am not aware of anyone else looking at these statistics with a wider range of AR4 model runs – though that could certainly be a fruitful next step. In fact I would strongly suggest that it be looked at by anyone wanting to claim that my results were a fluke of some sort. – gavin]

…It’s a kind of scientific integrity,
a principle of scientific thought that corresponds to a kind of
utter honesty–a kind of leaning over backwards. For example, if
you’re doing an experiment, you should report everything that you
think might make it invalid–not only what you think is right about
it: other causes that could possibly explain your results; and
things you thought of that you’ve eliminated by some other
experiment, and how they worked–to make sure the other fellow can
tell they have been eliminated.

Details that could throw doubt on your interpretation must be
given, if you know them. You must do the best you can–if you know
anything at all wrong, or possibly wrong–to explain it. If you
make a theory, for example, and advertise it, or put it out, then
you must also put down all the facts that disagree with it, as well
as those that agree with it. There is also a more subtle problem.
When you have put a lot of ideas together to make an elaborate
theory, you want to make sure, when explaining what it fits, that
those things it fits are not just the things that gave you the idea
for the theory; but that the finished theory makes something else
come out right, in addition.

In summary, the idea is to try to give all of the information to
help others to judge the value of your contribution; not just the
information that leads to judgment in one particular direction or
another.

a) SCCS. We needed to have version control for both small and large projects, and easy enough to use to avoid needing a bunch of “program librarians” of the “chief programmer team” vintage. SCCS was *still* too much bother for individual physics researchers.

c) Nroff/troff macro packages to automate repetitive typing …
Troff wizards could do amazing things (for that era), but it needed tbl, eqn, and macros to be usable by researchers. It took really robust, flexible macros to be usable in large BTL typing pools.

d) UNIX “make”, originally done by Stu Feldman in Research to automate drudgery. However, it sometimes needed more work than people were willing to do to set up [because sometimes cc*.c -o myprog seemed enough], which is why people quickly whipped up *makefile generators* to start from existing work.

e) SOLID, which we did originally for one application, but got generalized later. Big projects usually rolled their own configuration management, repositories, workflows, etc, but small projects would rarely use them, because there was just too much machinery. So, they would start simple, … and end up growing their own, so that at one point, there must have been 50 different flavors (all built on top of UNIX, SCCS, etc … but still different).

My team generalized what we’d done for one project, and turned it into an easy-to-set-up-and-customize repository/workflow/templating system for software & documentation. It both needed to be simpel for a tiny team, and be able to scale up. My lab director fortunately was willing to allocate budget to do this and even support its usage outside our division [monopoly money was very nice – we actually got to think long-term.]. In the early 1980s, it was one of the very few such that got wide use around BTL.

f) So, I still think the *real* question is: what tools do *individual scientists* think they need that they don’t have, that reduces the overhead of doing the software-engineering-repetitive-stuff? (Ideally, to the point where it’s so trivial that someone whose real job is science, not software engineering, can just do it without wasting their time. One thinks of the moral equivalent of a makefile generator.)

I don’t expect that will make papers right or wrong, or improve the use and mis-use of statistics, or change variable names to be more meaningful, or eliminate GOTOs in old Fortran, or discover new physics … but maybe there would be less waste of time arguing about it. Remember that Fred Brooks thinks this is the easy part.

OT, but the “FAQ on Climate Models Part II” seems to be closed for comments. More objections from the persistent denialist I am dealing with elsewhere. His general approach is to cherry-pick findings from papers, ignoring their conclusions, and when these are pointed out, cherry-pick a finding from another paper in an attempt to throw doubt on these; and he has a very strong ideological bias as a “libertarian”; but he has clearly spent a lot of time reading the literature.

Specifically:

1) He claims that all current GCMs get the evaporation/precipitation cycle wrong (too little of each); and that they could therefore miss a large negative feedback from increased transport of heat to the upper troposphere where it can radiate away. Have you dealt with this here, or are there relevant papers?

[Response: don’t know where that comes from. Most GCMs are slightly high on precip (and therefore evap) in comparison with the best estimates (~3mm/day vs. 2.8 mm/day in GPCP/CMAP). – gavin]

2) He claims that all current GCMs have underestimated the shift toward earlier NH spring snow melt, hence albedo change, and hence their apparent success at reproducing temperature increase must be concealing some other, significant negative feedback. I have found a paper “The role of terrestrial snow cover in the climate system” by Steve Vavrus (Climate Dynamics 2007, 29:73-88) reporting simulations where snow was turned to rain on reaching the ground (i.e. snow cover was completely removed), with a resulting temperature rise of .8K), so I guess the general form of the answer (assuming he’s right about the GCMs not getting the spring snow melt dates right) is that such an error will not make much difference – but are there any other papers I should look out?

[Response: He’s probably referring to the recent Huybers et al paper, but he’s misreading it. All the models have the onset of sprng earlier as a function of the general level of warming – what they don’t appear to have is an additional shift in the phasing that is un-related to the mean warming. But this is of course is an error of the models being not sensitive enough. Hardly something to make one confident about the future. – gavin]

3) He cited “The Climate Change Commitment”, T. M. L. Wigley (2005) Science 307:1766-9, which used MAGICC to model the “climate commitment” i.e. warming “in the pipeline, as support for his claim that most of the warming up to now could be solar in origin. The paper looks at what would happen if all GHG emissions could be stopped now (i.e. in 2005), and includes the following:
“Past natural forcing (inclusion of which is the default case here) has a marked effect. The natural forcing component is surprisingly large, 64% of the total commitment in 2050, reducing to 52% by 2400.”
This would seem to suggest (but I may be wrong here), that most of the warming since 1970 could be solar. I’ve noted that the history of solar radiance Wigley uses relies on a 1995 paper by Lean, Beer and Bradley, and that more recent work by Lean (2005) “SORCE CONTRIBUTIONS TO NEW UNDERSTANDING OF GLOBAL CHANGE AND SOLAR VARIABILITY”, Solar Physics 230:27-53 suggests much smaller past variability; but is this still a matter of debate?

[Response: Well these things are alway debated, but the general feeling is that solar is smaller than we thought a few years back. More precisely, the reasons that people had for thinking solar was important seem to have gone away with more observations and better data. As for the warming now being solar, the answer is definitely not. You would have had a decelarating trend in that case, what has happened is the opposite, and then you have stratospheric cooling trends which are in complete contradiction with a solar source. – gavin]

You surely can’t be serious in thinking that sharing code and data with the public will result in compromised scientific results. If this is true then there are quite a few scientists I know who need to immediately begin work on coding their own LINPACK routines to avoid contaminating their work with errors from this open source library.

And note that the same argument could be applied to the publication of almost any kind of information, including journal articles and, yes, blogging.

Re: the edit to 120. It’s kind of hard when one side DOESN’T have to keep it polite.

It also doesn’t help when the whole thing becomes like the house of commons in the UK where you can lie cheat and steal, but if you say that someone else is lying in the chamber, you can be in SERIOUS trouble. So you use euphemisms like “I believe the right honourable gentleman is mistaken”.

It ISN’T nice to say “And according the J Smith’s claim “…”. Especially when

a) you’re already in a bit of trouble trying to get someone to do work that is more work for them, less for you
b) already stuffed it up at least once before off your own bat
c) supposed to be a people manager

I mean, don’t MBA’s get taught conflict management? I know the police don’t any more, but I thought it was still a required course for managers-to-be.

Here’s another trick question: does a QA engineer ever design tests without reading the code being tested? Or would you claim that the only way a QA engineer can do their job is to be able to read the code before designing tests?

Rhetorics aside, there are really both modes of testing and they both have their merits. One can do black-box testing without knowing the code and one can do white-box testing using the original code. There is also an in-between mode called grey-box testing where some, but not all knowledge on the internals is available and being used.

Whichever mode one employs, one will need a way of creating test cases as well as a test oracle deciding whether observed behavior and results are correct or not. Creating test cases might be as simple as producing random input (fuzzing) or confronting the program with real users. (As an aside, a RealUsers blog might be funny. ;)) Deciding what is correct and what isn’t is often much more difficult.

According to my personal experience the different approaches to testing tend to work well for different types of errors and different objectives of testing. So if the problem at hand is really comprehensive software testing, a combination of approaches and techniques is the way to go and not having the source code renders some techniques unavailable.

But comprehensive software testing includes testing for e.g. security, real-time properties, or stability under invalid inputs, which seem irrelevant for scientific computations. Furthermore, the (idealized) objective of testing is often to find all relevant instances of defects, including for instance defects in error handling routines or related to technical tasks such as file handling.

To validate the computational functioning of a program alone thus appears to me as a pretty narrow notion of testing and even more so of QA. I conjecture that for the computational aspects alone a black-box along with a decent specification will do the job and I suppose that scientific papers provide such specifications. They would be pretty useless if they didn’t.

How do I determine what went wrong? Is it my code, or has the data
changed in some way? Have I made an error when converting the data?

Known answer test vectors might do the trick if the computation isn’t too complex in terms of input and output. Run the program on a few input vectors and publish those vectors along with the results. This approach is used with cryptographic algorithms to facilitate independent implementations.

Worked as a programmer for a company. For a time (because I’m a quick learner) I was put into the QA team for the same company, so the people who tested the code I did.

One report had canned data.

But I knew it was wrong because I knew the problem space, the code, and the expected answer. A bit of a renaissance man but not necessarily a good one.

Turns out the report was wrong. Transposition of elements meant the values were 48.7% and 205.3%. But because it was canned data, the QA team just assumed that the canned data for testing was just atypical.

Expectations. Need to manage them. But they aren’t part of Michael’s “Rigorous procedures” because a procedure that covers everything of interest must include the procedure itself. And like an ourobouros, this leads to trouble. If only for the procedure itself.

Increasingly influential highlevel DC gossip rag “Politico” employs remarkably credulous stenographer Erika Lovley to spread fertilizer both old and new (31,000 “scientists” waving a petition, 5 decades of cooling in the U.S., etc) on behalf of various entrenched commercial interests:

“Scientists urge caution on global warming

Climate change skeptics on Capitol Hill are quietly watching a growing accumulation of global cooling science and other findings that could signal that the science behind global warming may still be too shaky to warrant cap-and-trade legislation.

While the new Obama administration promises aggressive, forward-thinking environmental policies, Weather Channel co-founder Joseph D’Aleo and other scientists are organizing lobbying efforts to take aim at the cap-and-trade bill that Democrats plan to unveil in January.

[blah-blah, woof-woof redacted]

Armed with statistics from the Goddard Institute for Space Studies and the National Oceanic and Atmospheric Administration’s National Climate Data Center, [Weather Channel co-founder Joseph] D’Aleo reported in the 2009 Old Farmer’s Almanac that the U.S. annual mean temperature has fluctuated for decades and has only risen 0.21 degrees since 1930 — which he says is caused by fluctuating solar activity levels and ocean temperatures, not carbon emissions.

Data from the same source shows that during five of the past seven decades, including this one, average U.S. temperatures have gone down. And the almanac predicted that the next year will see a period of cooling. ”

You surely can’t be serious in thinking that sharing code and data with the public will result in compromised scientific results. If this is true then there are quite a few scientists I know who need to immediately begin work on coding their own LINPACK routines to avoid contaminating their work with errors from this open source library.

Oh, it’s totally serious in the context in which it was stated. Packages like LINPACK are subjected to a great deal of testing, vetting, release management to ensure it works with a variety of compilers and processors, etc. Not just pieces of code that a scientist cobbled together for a particular problem.

Surely you understand the difference? If not, please stay as far away from the implementation and management of software products, open source or closed.

Regarding comment #63: “editor note: this was done in the case of Steig et al with respect to code … some of these data are proprietary (NASA), but will be made available in the near future”

I second the question in comment #67.

If possible, please provide more information about the “proprietary NASA data” used in Steig 2009. There have been several (at least 4) clear statements from Dr. Steig, both at his website and RealClimate, that his study used only “publically available data sources.”

[reply: the raw data are public; the processed data (i.e. cloud masking) are not yet, but will be in due course. so relax]

Oh my GOD!!!
Do you really mean that saving data and methods is obsolete because ten years from now we will have improved our methods?????!!!!!????

That is the most unscientific argument I have ever heard…..
And since I have teaching science to both high school kids, undergraduates and graduates……that says a lot.

Forgive me if I have missed a previous comment. If not, then I am even more surprised that no one has commented on this earlier.

[Response: Well, no one else had your ridiculous interpretation of the comment. Perhaps you think that the ‘state-of-the-art’ on Antarctic temperture trends in 10 years time (2019) will use data-sets that stop in 2006? Please try to have a clue. – gavin]

I’m coming late to the talk – life always seems to get in the way. As a researcher who has feet firmly planted in both fields – science and computer science – I can feel the frustration on both sides. From the science side, I thoroughly agree with Gavin et al… you only need to publish enough to document and replicate your work. Scientific papers are for scientist, after all, not the general public. If you doubt that, try opening a copy of Nature to the original work section, and hand it to your local coffee slinger.

On the other hand, as a computer scientist, replication is an awesome criterion. It can be done in the scientific fields as well, largely NOT by publishing only the code and walking away, but rather by packing up the data set, governing formulae, statistical analysis parameters, and the statistical and other codes required, zipping them, writing a nice front-end user interface so that the replicator doesn’t get just a mishmash of hundreds of brittle files causing them grief and spitting out the number for climate sensitivity or whatever the key parameter of the study is. The computer science fanboys should appreciate that that is a LOT of work, hundreds of hours, and that it should be someone else’s job to do that, not the scientists. After all, they’re trained in *science*, not in scripting; they don’t typically work in such a way that automation is streamlined and easy to do; and they are trying to get papers correct and out, which is their *job*.

With respect to the deep reason *why* there’s an unprecedented amount of oversight on the climate file, looking for gremlins in data analysis shouldn’t buy your conscience a pass from indulging in climate-changing activities. You’d have to ensure yourself that *all* of the papers, models, journal articles, replicates, evidence and whatnot contained invalidating errors. No computer scientist looking at the data in 2009 should really be able to convince themselves of this – even if they did hold the view originally.

Myself, I think that a round-up of “untouched by humans” evidence might be beneficial in convincing those who could be convinced by evidence. The ice caps aren’t melting for nothing, Australia’s not drying up and burning away, and neither are the animals migrating towards the poles for nothing. No human analysis features in those facts, so you will have to contend with them as external (unbiased) evidence in your world-view, or be established as a hypocrite.

Joe S.,
There is a big difference between a commercial or at least widely distributed and tested package and code used within a research group. The former have been validated in a wide range of applications. The latter may be adapted to a specific purpose and/or based on a model that has limited applicability. As an example of scientific code eventually gaining wide applicability, I would suggest looking at the GEANT collaboration for particle physics. Each group uses routines and can make modifications for their own code, but the collaboration has a stringent process of validation for each new piece of code and makes sure the models used and their limitations are defined.

I would suggest that perhaps if you haven’t done programming in an environment that you may not fully understand the limitations and priorities for the code. More important, if you don’t understand the physical models, you will not understand the coded models.

I would expect that eventually, you may have some more formal treatment of code, etc. However, it’s most likely to grow out of climate science as the need arises than it is to be imposed upon it from outside.

Stef, I can’t speak for anyone else, but please don’t confuse scientific curiosity with any particular belief about climate sensitivity. In my case I’m interested in this topic because it is important, and because so much has been written. I am also interested in general in issues of scientific collaboration and openness.

Dr. Steig’s paper uses a unique, and I might add very clever, new method to come to a conclusion that is somewhat different than prior results. I am quite interested to understand more of the details, and to see the comments of others as the data and code are made available.

I actually think it might be possible to apply this technique to global temperatures as a test of the conclusions about the accuracy of the land based temperature network.

To expand on Stef in Canada’s “untouched by humans” comment, I would like to add phenological evidence – particularly those recorded in long-term datasets such as the Marsham Phenological Record & the Kyoto cherry-blossom festival dates (although there could perhaps be an UHI argument for the latter’s abrupt change).

Steve (126),
What does that excerpt have to do with publishing code or data? His points are nothing novel. That is the standard information that is already provided in the discussion/conclusion section of virtually every paper published.

However, the bigger point is that reproducibility of an analysis does not imply correctness of the conclusions.

I agree, however failure to replicate the publish results is an issue and without the possibility to access datasets and scripts you will never know if whether or not reproducibility is an issue. In my field, computer vision, we are plagued by the reproducibility issue because results published are more the product of a particular implementation than the method as expressed in terms of equations in the published paper. I’ve spend countless hours trying to reproduce published methods with promising results without success. In the rare case when the scripts are available, you’ll discover a lot of implementation details and parameter values that are not in the original paper.

I’m a strong advocate of the Open Science 2.0 concept. Standards vary greatly between scientific fields, in Statistics for instance, most published methods are usually accompanied with R or S scripts from which the published results can be replicated. I understand that such a high standard cannot be followed in all fields because of the complexity of a particular research environment but publishing results from black boxes will certainly hurt the science in the long run. Also, an open and transparent Science is the best weapon against skeptics because it will greatly reduce endless speculations about implementation issues.

re: 129. Nope, it will result in LESS SCIENCE being done. And what science is done will be held on hold in its validity until each and every query and nitpick about the code, program, data or phase of the moon has been answered. Given how “Mars is warming, so it’s gotta be the sun” and “It’s happened before, people” ***still*** gets wheeled out, probably several times.

Based on your comment I would say that AR4 would have been better worded with a conditional essentially saying that there results may have been caused by a coincidental warming caused by circulation changes that coincide with areas of maximum economic development.

Section 3.2.2.2 references a number of other studies that support the idea that the instrumental record isn’t polluted, and it isn’t clear why MM got this result. But the reason given in AR4 is just speculation.

I’m sure there will be a lively debate over whether your paper shows that MM07 is fundamentally flawed.

Do you understand the comment “greenhouse-induced warming is expected to be greater over land than over the oceans” as it relates to MM?

[Response: That comment is clearly true, and given that MSU2LT data is more dispersed, similar global trends in both the surface station and MSU-2LT fields will imply that the land-only surf trends would be expected to be larger than the co-located MSU-2LT trends. – gavin]

I can see why the land trends would be higher than the MSU trends in those areas, but how does that relate to the MM study? They are looking at the differences as it relates to economic activity and other variables. Is the thought that economic indicators would be higher at coastal locations?

[Response: It relates to what the true null hypothesis should like. Clearly the differences between MSU and surface stations will not be random or spatially uncorrelated. I used 5 model runs – with the same model – in lieu of having an appropriate null. But I am not claiming that they define the proper null hypothesis. I think looking at more models would be useful if you wanted to do that (but you still wouldn’t be certain). The bigger problem here is that no-one apart from the authors thinks this methodology is valid, regardless of the results. I used it because I was interested in what would happen with model data – and the fact that there are very clear “significant” correlations occurring much more frequently than the nominal power of the test would imply, tells me that there is something wrong with the test. I’m happy to have other people pitch in and give their explanation of reasons for that, but continuing to insist that the test is really testing what they claimed seems to be unsupportable. – gavin]

A major advance of this assessment of climate change projections compared with the TAR is the large number of simulations available from a broader range of models. Taken together with additional information from observations, these provide a quantitative basis for estimating likelihoods for many aspects of future climate change. [ My bolding. ]

Do the numbers from these “large number of simulations available from a broader range of models” GCM calculations have any meaning. My response is that the numbers have yet to be shown to be correct.

One crucial and necessary first step is that application of Verification procedures have shown that the numbers produced by the software accurately reflect both (1) the original intent of the continuous equations for the models, and (2) the numerical solution methods applied to the discrete approximations to the continuous equations. That is, Verification shows that the equations have been solved right. Do the numbers actually satisfy the Verified-to-be-correct-as-coded discrete equations and do the solutions of the discrete equations converge to solution of the continuous equations. Neither of these has been demonstrated for any GCM. I will be pleased to be shown to be wrong on this point.

All software can be Verified. Objective technical criteria and associated success metrics can be developed and applied in a manner that provides assurances about the correctness of the coding of the equations and their numerical solutions. Lack of Verification leaves open the potential that the numbers from the software are simply results of “bugs” in the coding.

The present-day software development community, in all kinds of applications and organizations, is keenly aware that lack of SQA policies and procedures, and successful applications of these to the software, leaves open a significant potential for problems to exist in the software. So far as I am aware, there are no precedents whatsoever for public policy decisions to be based on software for which no SQA procedures have been applied.

There is no other examples of calculations, the results of which guide decisions that affect the health and safety of the public, that are not Independently Verified. All aspects from front-end pre-processor to post-processing of results for presentation, are Verified. Everyone encounters numerous examples every day, all day.

And this applies to all calculations; from one-off data processing and analysis to GCMs. Whenever Press Conferences are called, or Press Releases produced, to announce the results of even the most trivial calculation, the purpose is to influence the public. And the public, it is hoped, is provided information that helps guide and shape their individual thinking about the concepts reported. Under these conditions, the numbers must be Verified to be correct prior to public announcements. Every time.

[Response: And all Adjectives must be Capitalised. Every time. – gavin]

There is no other examples of calculations, the results of which guide decisions that affect the health and safety of the public, that are not Independently Verified. All aspects from front-end pre-processor to post-processing of results for presentation, are Verified.

It is really stunning that climate scientists are subjected to such demands while the denialists are quite content to unskeptically accept any old blatant pseudoscientific rubbish that comes along from any old ExxonMobil-funded propaganda mill.

I think that a statement like “no-one apart from the authors thinks this methodology is valid” is a little hard to prove and unlikely to be true.

But I wasn’t asking about your paper, which did raise an interesting challenge to their approach, I was asking about what the comment in AR4 about oceans and land had to do with. If they were just commenting on the fact that satellite measured anomalies that cover land and ocean were likely to be lower than the surface measured anomalies at coastal locations, then I don’t see why that, in particular, was a comment on MM. Unless there is some correlation between coastal location and economic activity?

Or maybe you think that comment in AR4 wasn’t really on point?

[Response: I didn’t write the comment in the AR4 report, and so my insights into their thought processes are no more likely to be insightful than yours. I imagine that it is a rebuttal to the null hypothesis in the MM07 paper that the presence of correlations between the difference between surf and trop and the economic variables automatically imply extraneous biases. That is clearly mistaken. – gavin]