The (im)permanance of online biological resources

As the number of online biological databases explodes, how will we ensure their availability over time?

I received a help desk request this morning from a user looking for an old bioinformatics project. The user had stumbled across a broken link for the project’s supporting website listed in the publication just a few years ago.

This project is now completed, no longer funded and no longer staffed. It may be over but expectations that the resources it created should continue to exist live on — in links from the original publication, in citations, and in search engines.

This particular project happens to be hosted on a heavily taxed machine that I administer. This machine itself is nearly obsolete, out of warranty and chugging along on its last legs. And as an essentially redundant production node in a cluster, it isn’t backed up.

Now, I know in part I should be responsible for anything that’s on a machine under my purview. But I’m not a system administrator by interest, job description, or training. It’s just one of the hats I necessarily wear.

But software and operating systems evolve, security updates are released and legacy software breaks. Without dedicated maintainers who know precisely what they are to maintain and how, legacy resources will quickly become obsolete. Ironically, perhaps only the printed record will testify to the fact that these online services once existed.

This conundrum raises a number of interesting questions.

What responsibility do we have to ensure that online resources generated (directly or indirectly) as part of publically funded projects remain available after their funding has run dry?

If we have a duty to ensure that these resources remain available — and I believe that we do — what is the easiest way to do it?

Should there be conditions on grant funds that code be documented, hardware requirements stated, and maintenance details described? Like reagent sharing requirements, perhaps there should be a burden-of-proof of resource longevity provided as a condition of publication?

Should there be a final review from funding agencies at the time of project completion to ensure that suitable plans exist for maintenance of the resource?

Minimally, when a project winds down, there should be a final document drafted describing in detail the maintenance of the resource. It should include a simple manifest describing the data, the website, software version dependencies and hardware requirements. And an accompanying tarball when feasible for facile restoration would be appreciated, too.

Virtualization is also an attractive approach. But virtualization — like the transfer of data from one storage medium to its successor to prevent obsolesence — has maintenance overhead to be factored in.

But after 10+ years of the development and disappearance of online biological resources, perhaps we need to consider a consolidated public repository, one established under the auspices of Ensembl, NCBI, DDBJ, or NHGRI — to host and maintain orphaned projects.

About Todd Harris

Do you have a data management, analysis, or visualization problem you need some help with? Do you need to connect with the best people to build out your team of data scientists, bioinformaticians, or curators? Drop me a line -- I'd be happy to chat with you about your project.

Welcome!
My name is Todd Harris. A geneticist by training, I now work at the intersection of biology and computer science developing tools and systems to organize, visualize, and query large-scale genomic data across a variety of organisms.

I'm driven by the desire to accelerate the pace of scientific discovery and to improve the transparency and reproducibility of the scientific process.