Science is driven by data. New technologies… blah… publishers, including Science, have increasingly assumed more responsibility for ensuring that data are archived and available after publication… blah… Science’s policy for some time has been that “all data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science” (see www.sciencemag.org/site/feature/contribinfo/)… blah… Science is extending our data access requirement listed above to include computer codes involved in the creation or analysis of data

Well, jolly good. I look forward to them insisting the full code for HadCM3 / HadGEM / whatever is published before accepting any GCM papers using them (which, amusingly, will now include all the papers doing the increasingly fashionable “multi-model” studies using the widely available AR4 data archives).

Come to think of it, it would also prevent S+C (but not RSS?) ever publishing in Science.

It hardly needs to be said that the editors of Science, when writing an editorial entitled “Making Data Maximally Available”, meant the whole thing to be maximally readable, but accidentally forced people through a tedious registration process to get to it. So, as a service to them, I’ll reproduce it here.

Science is driven by data. New technologies have vastly increased the ease of data collection and consequently the amount of data collected, while also enabling data to be independently mined and reanalyzed by others. And society now relies on scientific data of diverse kinds; for example, in responding to disease outbreaks, managing resources, responding to climate change, and improving transportation. It is obvious that making data widely available is an essential element of scientific research. The scientific community strives to meet its basic responsibilities toward transparency, standardization, and data archiving. Yet, as pointed out in a special section of this issue (pp. 692-729), scientists are struggling with the huge amount, complexity, and variety of the data that are now being produced.

Recognizing the long shelf-life of data and their varied applications, and the close relation of data to the integrity of reported results, publishers, including Science, have increasingly assumed more responsibility for ensuring that data are archived and available after publication. Thus, Science and other journals have strengthened their policies regarding data, and as publishing moved online, added supporting online material (SOM) to expand data presentation and availability. But it is a growing challenge to ensure that data produced during the course of reported research are appropriately described, standardized, archived, and available to all.

Science’s policy for some time has been that “all data necessary to understand, assess, and extend the conclusions of the manuscript must be available to any reader of Science” (see www.sciencemag.org/site/feature/contribinfo/). Besides prohibiting references to data in unpublished papers (including those described as “in press”), we have encouraged authors to comply in one of two ways: either by depositing data in public databases that are reliably supported and likely to be maintained or, when such a database is not available, by including their data in the SOM. However, online supplements have too often become unwieldy, and journals are not equipped to curate huge data sets. For very large databases without a plausible home, we have therefore required authors to enter into an archiving agreement, in which the author commits to archive the data on an institutional Web site, with a copy of the data held at Science. But such agreements are only a stopgap solution; more support for permanent, community-maintained archives is badly needed.

To address the growing complexity of data and analyses, Science is extending our data access requirement listed above to include computer codes involved in the creation or analysis of data. To provide credit and reveal data sources more clearly, we will ask authors to produce a single list that combines references from the main paper and the SOM (this complete list will be available in the online version of the paper). And to improve the SOM, we will provide a template to constrain its content to methods and data descriptions, as an aid to reviewers and readers. We will also ask authors to provide a specific statement regarding the availability and curation of data as part of their acknowledgements, requesting that reviewers consider this a responsibility of the authors. We recognize that exceptions may be needed to these general requirements; for example, to preserve the privacy of individuals, or in some cases when data or materials are obtained from third parties, and/or for security reasons. But we expect these exceptions to be rare.

As gatekeepers to publication, journals clearly have an important part to play in making data publicly and permanently available. But the most important steps for improving the way that science is practiced and conveyed must come from the wider scientific community. Scientists play critical roles in the leadership of journals and societies, as reviewers for papers and grants, and as authors themselves. We must all accept that science is data and that data are science, and thus provide for, and justify the need for the support of, much-improved data curation.

Comments

One the one hand… ridiculous. On the other hand… it would be nice to have a version of the code used for a given paper archived properly (not that I’ve done that for my own papers – but sometimes I’ve wished I had done so). The problem being that GCMs are just so huge. Also, that in some cases the code is proprietary, though I think more and more groups are making their code public (though in at least one case, the code authors aren’t allowed to make their code public to anyone who hasn’t passed a basic background check for security reasons).

[Oh I quite agree: it would indeed be nice to have the code archived, and visible. But does Science alone have the power to make it happen, or will they have to cave? -W]

I think authors of papers based on AR4 data archives can argue that the archives themselves serve as the appropriate repositories necessary to understand and/or extend the paper conclusions, rather than the model code underlying the archives.

-M

[They could indeed argue that, but if Science read their own policy they will be forced to say “but we ”include computer codes involved in the creation or analysis of data””. Of course, I’m sure Science has carefully thought this though :-) -W]

Multi-model studies won’t need to submit HadCM. They need to link to CMIP5 (or wherever they are getting their results) and submit their own code.

It’s when the Met Office want to publish yer acshwul HadCM results that they will run into trouble. Or choose another journal. Or they could just publish the model code (in GMD, natch). The sky has resolutely failed to fall on, say, CESM.

[Yes, them publishing the code would be best. But suppose they don’t? Then, according to Science’s logic, no-one can publish and CMIP studies. There is no get-out clause that says “of course, if it was someone else’s model, you don’t need to bother”: it says “computer codes involved in the creation”. HadCM3 was involved in the creation of CMIP data, so accoding to Science it *must* be published -W]

[Yes, them publishing the code would be best. But suppose they don’t? Then, according to Science’s logic, no-one can publish and CMIP studies. There is no get-out clause that says “of course, if it was someone else’s model, you don’t need to bother”: it says “computer codes involved in the creation”. HadCM3 was involved in the creation of CMIP data, so accoding to Science it *must* be published -W]

I don’t think that follows – and certainly Science is not constraining anybody else’s publication policies. Of course, I can’t read the article in Science, so I can’t actually tell. But suppose I use CMIP5 data to do an analysis of likely effects on leprechaun populations in the 21st century, and submitted that to Science. Then I would expect Science to require me to link to CMIP5, and to my own leprechaun model code, and to my leprechaun data, but not to code which was used (by someone else, somewhere else, at another time) to make the CMIP5 data.

[First off, reading the whole article: it is only behind a free-registration-wall, not a paywall, though it is easy to miss that. But as a service to the world, I’ve copied it here. I’m sure Science will thank me :-).

Second, weeeeeeell, I think it is arguable, and I won’t be surprised if Science does indeed weasel out, but for myself, I can’t see why what Science has written doesn’t imply that the code used to make the CMIP data should be available. Doubtless they will clarify this important point to avoid confusion. Of course one could regress further: is the code that allowed you to run the code (viz, your computers OS) included? In which case Windoze users are stuffed -W]

Ah, but the easy way around that is to ensure that a legacy machine with legacy OS used to generate/analyse the data is archived at the same time :-)

That’ll get the computer industry moving again.

And whilst we’re doing that, we might as well archive the researcher(s) who operate said legacy machines as well. Have you tried getting someone who knows how those holey cards work? My optical drives don’t seem to recognise them, and they come out a bit mangled.

Software Preservation is a serious business. Yes, sometimes it involves archiving actual hardware. More often one is into the Configuration Management realm, which can vary from keeping a VM, through keeping some installers, through recording detailed version numbers and managing version changes, to jotting down “GNU Fortran 3.6, Python 2.4″, to basically not recording anything and trusting to luck.

I’d be in favour of generally moving science software from one end of this spectrum towards the other, much as I’ve spent a lot of the last twenty years doing my part to shift the software industry in the same direction. I remember one client where the prevailing views included (a) “SCM is too much trouble” and (b) “why would we keep a copy of the executables we ship?”

Regarding “weasel out”, yes, there is a long and sorry history of fine words and broken promises on publication policies, at journals, institutions, funding bodies, and agencies. I still welcome these fine words – I think the tide is running in this direction, and sooner or later fine words will be followed up by actual action (i.e., for a journal, a paper being rejected because the code is not available). We shall have to wait and see.

[Based on my memory, I think a lot of the actual scientists working n the code would be happy to see it published. Typically, it is the mgt layer that is the problem. So Science having some fine words might help push things the right way, and that would be good all round -W]

Nick, I don’t know enough about gene sequencing to comment on your example.
My point is that there is a key difference between algorithms and programs. The former can be described in many ways, and the latter is a very specific implementation of one or more algorithms.
Science (the method) needs to care a lot about algorithms, and need only care about programs to the extent that they aid in the understanding of algorithms and their implications. Software preservation is all about about preserving programs, not algorithms.
Or at least, that is how I see the issue. Am I missing something?

“the latter is a very specific implementation of one or more algorithms”

A quick response to this as I am in bit of a hurry this evening:

The code actually used to produce the results in any given paper implements a particular function, C (for code). The paper may or may not describe a function, P (for paper). Differences between P and C include various classes of bug (where the intention was P but the programmer, compiler, libraries, and OS combined to make C), writing errors (where the intended function was in fact P’ and it is described incorrectly in the paper), and limitations in precision or expression (for instance, where C would implement P in an ideal world without rounding error).

More often the paper implies an function, I (for implied), or really a family of functions {I1, I2, I3, …}, because usually papers are very skimpy in their descriptions of mathematical methods. C may or may not be a member of this family, but in this case the paper really isn’t providing enough information to reproduce C.

Often the maths in the paper is so skimpy that the interested reader has to guess at a function, G (for guess), and the relationship between this function and C is even more sketchy.

But the results, as presented in the paper, actually depend on C. Readers interested in the results may well want to know what C is. The most efficient way to convey it to them is to show them the source code. It isn’t perfect (I could go on at more length) but it is considerably better than P, I, or the commonplace G.

Nick, if I pick up a paper with an algorithm I need professionally, I expect to be able to match the results reported.
I do understand there is a class of problems where an implementation is the the most concise statement of the results. For such problems, the code is the paper. But at least so far, this class of problems is fairly rare.
I’m very well aware of limits in precision and rounding issues, which can be problems in their own right. I deal with these professionally frequently. But such limits should be understood by any serious implementer. They put limits on what can be computed, and any interpretation of the results must take these limits into account, and the bare code only helps if the original implementation had a problem.
What I expect is that the results are reproducible from the paper, and general understanding of the field. If the results are only reproducible from the code (or on a specific OS or computer or …), then there is a problem. A big problem.
If the results are reproducible with different programming styles, OS, etc, then they are valuable.
For example:http://www.deas.harvard.edu/climate/seminars/pdfs/Tietsche_GRL_2011.pdf
and see paragraph 5.

To add my 0.02$, I totally agree with Nick, and I think you miss the point. Often, papers simply do not contain enough information to reproduce the results. Problems I have personally encountered so far:

Paper says that a parameter x has some value, but I cannot reproduce it. After querying the author, he mentions that the parameter is probably some different value. With this, it works.

Paper says that Optimal-Control algorithm X has been used, but does not specify enough details to actually run the algorithm. By having some code snipplet from the author, I can see that probably algorithm Y has been used. Still not enough information in the details to reproduce results.

Paper uses some potential energy surface from another paper, given as expansion coefficients. After implementing the expansion, I figure that this surface is totally screwed up, so probably the coefficients are plain wrong. Now what is the potential energy surface?

Same problem as above. After contacting the author, he replies that probably the potential energy surface is wrong, and sends the one he used. Works roughly after that.

I figure that in each of the above cases, supplying the code or the raw data would have helped tremendously. So supplying the code/raw data has nothing to do with abstract discussions about implementations and algorithms, it is simply about reproducability due to missing or wrong data.

Hat tip (see this for good discussion):
_youknowwhatgoeshere_www.metafilter.com/100893/Cut-paste-cut-paste

“‘Churnalism’ is a news article that is published as journalism, but is essentially a press release without much added.” Churnalism.com is a site created by the British charity Media Standards Trust, which lets you input the text of a press release to compare it with the text of news articles in the British media.