Jaz drives, spiral notebooks, and SCSI: how we lose scientific data

In this multipart series we take a look at why a simple principle—scientific …

Let's say you've got a nice, digitized version of some scientific data, and you've already made reasonable choices about how close to the raw data you want to get in what you preserve. Better yet, you've hounded your students often enough that they've placed it in a single format and provided all the annotations that are needed to make sense of the data. You're all set to preserve it and share it with the rest of the scientific community. Except you aren't, because doing so creates its own challenges.

Saving the digits

Once the data is digitized, the next step is saving it. And here, the same issues that everyone else faces—bit rot, obsolete media, incompatible data formats—cause problems for scientists as well. For large organizations like the LHC computing grids or a genome sequencing center, these issues are handled at an institutional level. But for most of the small research groups, backups and archiving are handled on an ad-hoc basis, and usually left up to whichever current member of the staff happens to be most computer literate; organizing the archiving and ensuring it was complete was left up to individual users.

Over the course of my research career, archiving involved magneto-optical disks, a flirtation with Zip and Jaz drives (which ended when some data was lost by said drives), a return to big magneto-optical disks, and then a shift to CDs and DVDs. Interfaces also went from SCSI to Firewire to USB. Anything that wasn't carefully moved forward to the new formats was simply left behind. At least some of that is now effectively lost without ever going missing. Somewhere in a box in New York is a magneto-optical disk. Somewhere in Colorado is a notebook that describes its contents. The drive that would actually read it was thrown out years ago.

And these were recent issues; some institutions have projects that go back decades, with critical information stored on punch cards or old magnetic tapes. Even with the best of intentions, retrieving that sort of material may simply be impossible.

But the system is generally set up so that the best of intentions can be hard to muster. Funding for data preservation is generally taken out of the research budget, which forces investigators into awkward choices. Many of the preservation and backup issues are best solved at the institutional IT level, but there's no mechanism for funding it at that level, either. Many research groups have rather specific needs when it comes to their data, so a one-size-fits-all solution isn't always ideal. Research groups also switch institutions with some regularity, which could create additional hassles. In short, there's no simple, idealized solution for figuring out how to preserve data.

Even if the data's around, sharing it isn't simple, either.

Going public

As we hinted at above, even if scientific data is preserved, it would be difficult to make any sense out of it without the a description of where the samples came from and how they were processed. That's true generally—most scientific data isn't very informative in and of itself, but requires additional detailed information before it's useful to anyone else. Dealing with this issue is essential if the data's ever going to be shared with the scientific community, but the process of adding metadata, attaching relevant information, ensuring it's organized and in a format that can be read by others, etc.—all of that is a lot of work.

Doing that work means time not spent doing research, and that's a tradeoff that most grad students and post-docs will be uninterested in making. The same thing goes for supporting a server to host the material, which represents money not spent on getting research done.

Then there's the question of how best to share it. For sharing simple chunks of data with collaborators, anything from setting up a server to shipping DVDs will work. For sharing with the wider scientific community, a server is generally the way to go, but these tend to be similar to the archiving solutions: prone to ad hoc solutions that are run by the most computer literate person in the lab, and always operating on a minimal budget. The groups that have the resources to create a useful Web interface to the underlying data are few and far between.

Sharing vs. the law

As a matter of principle, most scientists would agree that data that's been published should be made available to the wider research community. Unfortunately, there are lot of reasons why it can't.

In at least one case, a group I was working with received a grant that actually specified that data needed to be shared, and allocated money for doing so—something that is unfortunately rare. The obvious issue that might make doing so a legal hassle—data from humans—wasn't a problem, since the test subjects were mice. But that didn't mean that there were no legal issues. The project was based at a research hospital, so patient records could be accessed from the same network as the server, which meant that, by law, strict security measures had to be put in place before exposing it to the outside world.

The hospital's intellectual property office then got involved. Host institutions are given the right to patent anything that comes out of research in their facilities, and the lawyers who handled this were concerned that some of the genes that were being described in the data could be patentable. It took a few days to explain that the information that was being made public was insufficient to compromise the institution's IP.

All of that changes, of course, when human subjects are involved, since strict privacy controls have to be put in place. In many cases, it's possible to simply anonymize the data and be assured that nothing can be traced back to its original owner. But the explosion of genomics data has changed that; it now may be possible to take anonymized data from a medical study and make some reasonable inferences about the pattern of genetic markers in disease carriers. As the amount of genomic data increases, the possibility of actually identifying individuals will rise.

There are a whole host of other issues that can keep scientists from releasing the data they work with. Informed consent for samples taken from humans; security issues like the potential for use in bioweapons; reliance on commercial or patented materials and data; etc. (This last issue is what has kept the Climatic Research Unit from releasing its full set of temperature data.)

None of this should be viewed as an excuse for keeping scientific data from being shared with the community. The digital era may have added some new wrinkles to the process, but overall, it has drastically lowered the barriers to open access to data. However, the barriers aren't gone. Some data, for legal reasons, can never be shared. In other cases, the time spent to prepare it for public access greatly outweighs the value of providing access. Simple rules, such as "all data should be available," simply don't recognize these complex realities.