Wednesday, July 3, 2013

Archiving of bioinformatics software

Some months ago I wrote a blog post about what is perceived to be the rather poor quality of many computer programs in bioinformatics (Poor bioinformatics?), noting that many bioinformaticians aren't taking seriously the need to properly engineer software, with full documentation and standard programming development and versioning.

An obvious follow-up to that post is to consider the archiving of bioinformatics software. If programs are written well, then they should be permanently archived for future reference. A number of bloggers have commented on what is perceived to be the poor current state of affairs here, as well, and I thought that I might draw your attention to a few of the posts.

In many ways, this issue is the computational equivalent of storing biological data, about which I have also written recently (Releasing phylogenetic data). My comments about this were:

There is a difference between storing / releasing the original data (eg. raw DNA sequences) and the data as analyzed (eg. aligned sequences)

There are sustainable and accessible archiving facilities for raw data that are almost universally used (eg. GenBank)

Many people do not release the processed data as analyzed (some of them will if directly asked to do so)

Many of the people who do release their analyzed data do so on the homepage of one of the authors, which is better than nothing but is rarely sustainable

There are sustainable and accessible archiving facilities for processed data, such as TreeBASE and Dryad.

Analogous comments can be made about the archiving of bioinformatics software.

The first question to ask is this: what proportion of the bioinformatics software referred to in publications is actually stored in sustainable and accessible archives? A corollary to this question is: what archive facilities are being used? Casey Bergman, at the I Wish You'd Made Me Angry Earlier blog, has attempted to answer both of these questions (Where Do Bioinformaticians Host Their Code?).

In answer to the first question, Casey notes:

of the many thousands of articles published in the field of bioinformatics, as of Dec 31 2012 just under 700 papers (n=676) have easily discoverable code linked to a major repository in their abstract.

While many papers may have the code URL in the Methods or Results sections but not the Abstract, this does suggest that repository archiving is not the mode actually employed by bioinformaticians. Instead, they are archiving (if at all) on personal or institutional homepages.

Sadly, the reported rate of decay of URLs ("Error 404: Page not found") indicates that this is rarely a sustainable approach to archiving (eg. see the Google+ comment by Dave Lunt). The relevance of the similar situation with the TreeBase / Dryad type of repository has not gone unnoticed, for example by Hilmar Lapp. These repositories require and enforce standards of data and software archiving, as well as providing persistence.

The answer to the second question, about which repositories, seems to be (see also the data provided by MRR in the comments to Casey's blog post):

SourceForge has been vastly predominant

Google Code has a large number of projects, but many of them have never made it to publication

GitHub has had a rapid recent growth rate, and therefore appears to be becoming the preferred repository.

This leads to the issue of how permanent the archiving is at the major repositories. It turns out that there is a major difference in policies, as noted by Casey Bergman:

SourceForge has a very draconian policy when it come to deleting projects, which prevents accidental or willful deletion of a repository. In my opinion, Google Code and (especially) GitHub are too permissive in terms of allowing projects to be deleted.

A clear trend emerging in the bioinformatics community is to use GitHub as the primary repository of bioinformatics code in published papers. While I am a big fan of Github and I support its widespread adoption, I have concerns about the ease with which an individual can delete a published repository. In contrast to SourceForge, where it is extremely difficult to delete a repository once files have been released, and this can only be done by SourceForge itself, deleting a repository on GitHub takes only a few seconds and can be done (accidentally or intentionally) by the user who created the repository.

This is an important issue, as exemplified by Christopher Hogue in the comments section of that blog post:

In my case SourceForge preserved the SLRI toolkit my group made in Toronto. As the intellectual property underlying the code was sold to Thompson-Reuters in 2007, my host institution and the dealmakers pressured me to delete the repository. SourceForge policy kept it on the site ... [However,] the aftermath of all this is that, of everything my group did under the guise of open source, only about 30% is preserved and online, and the rest is buried in an intellectual property shoebox at Thompson-Reuters. Host institutions have a lot of power of ownership over your intellectual property. If you win the right to post work into open-source, the GitHub delete policy means that your host institution can over-ride this, and require you to take your code out of circulation. GitHub is great, but for the sake of preservation, SourceForge has the right policy, protecting your decision to go open source from later manipulations by your host institution when it becomes "valuable".

Casey Bergman's response to this issue has been to create the Bioinformatics Archive on GitHub. This is based on the idea used by the journal Computers & Geosciences, in which the journal editor forks the GitHub code into a journal "organization" for all accepted papers — this creates a permanent repository, which is necessary because deleting a private GitHub repository will delete all forks of the repository but deleting a public repository will not do so. So, Casey has been personally forking the code for all publications that come to hand (currently 147 repositories) into the Bioinformatics Archive, thus creating a public repository for all of the relevant GitHub code.

A publisher driven version of the Bioinformatics Archive; journals should have a policy for the hosting of published code in a sustainable and accessible archive in a standardized manner

Redundancy to ensure persistence in the worst case scenario; archive persistence is the key requirement, and this can only happen in public repositories, with the published URL and/or DOI pointing to a public copy of the code

The community to initiate actual action; authors need to pressure the publishers to adopt a Dryad-like strategy, in which a large group of ecology and evolutionary biology journals agreed to require the use of a public database for storing the biological data associated with their publications.

At a minimum, a persistent public repository is a snapshot of the code at the time of publication, just as a sequence alignment is a snapshot of the processed data at the time of its publication. This does not preclude further work on the code, and further publications based on the newly modified code, just as new sequence alignments can be created by adding newly acquired sequences. Open-source code can still be newly forked, and there can be user-contributed updates and public issue tracking. Multiple snapshots of code related to different publications through time is not necessarily an issue, but it will need to be handled in some sensible manner.

The main reason for requiring the public archiving of code is to deal with the all-too-common situation when code is no longer being maintained (the scholarship ran out, the grant ended, the author retired, etc). For example, Jamie Cuticchia & Gregg Silk (2004, Bioinformatics needs a software archive, Nature 429: 241) mention the loss of part of the code when the multi-million dollar Genome Database lost funding in 1998. These two authors seem to be the first to have proposed a Bioinformatics Software Archive, "in which an archival copy of bioinformatics software would be maintained in a secure central repository supported by public funding." Personal and institutional homepages are too ephemeral (suffering what is known as URL decay) and too prone to politics to be considered acceptable for the storage of data and software in high-quality science.