Preserving science: what to do with raw research material?

In the first part of a multipart series, Ars looks at why a simple principle— …

The recent attention paid to the handling of data and computer code by the UK's Climatic Research Unit revealed an underside to science that many found disturbing. Poorly commented computer code. Data scattered among files with difficult-to-fathom formats. Old data on punch cards discarded. A complete absence of any standardized procedure for sharing data. Within the audience that already mistrusted climatologists, there was a strong sense that longstanding accusations of conspiracy had been confirmed.

Among the scientific community, the response was somewhat different. There was a general sense that the scientists involved should have been more responsive to requests for sharing data. But the chaos, confused record keeping, and data that's gone missing-in-action sounded unfortunately familiar to many researchers, who could often supply an anecdote that started with the phrase "if you think that's bad..."

The fact is, preparing, documenting, storing, and sharing scientific data is not only a very difficult challenge—it's an entire series of challenges, many of which involve significant tradeoffs. We've touched on this a bit in the past, but it pervades nearly every aspect of science. For the past several months, I've been reading expert reports and talking with people who are involved in trying to set policies for the archiving and sharing of scientific data. In a series of articles, we'll look at the challenges involved at every step of the way, from the production of scientific data to its dissemination and analysis.

Starting from scratch

Ultimately, scientific results are the product of observations, either done in the lab or outside in the real world. Many of the events being observed are ephemeral—animals interact and move on, the first photons from a supernova shoot past the Earth and are gone. In these situations, the best science can do is record the event and preserve those recordings. Preservation creates its own set of issues, which we'll examine in the next installment. But many other areas of science, from paleontology to materials science, produce samples that can be preserved anywhere from weeks to centuries.

For things like paleontology, where samples are rare, stable, and precious, the decision is easy: you preserve everything. For just about everything else, however, judgement calls have to be made: space is finite, especially freezer space, and things will ultimately have to be discarded.

In some cases, it's not a hard decision. A lot of biology and chemistry samples come with built-in timers: chemicals that degrade or diffuse, radioactive isotopes that decay, etc. Even if there is value in confirming results or reanalyzing these samples, after a certain period of time, the signals that made them valuable in the first place will simply be gone. That said, there's still room for individual judgements here.

Should a researcher simply date-stamp everything and throw it out after a set period of time, or do you go back and check how well a sample has held up to storage conditions before making that decision? Do you simply discard short-lived samples once you extract the data you think you need? These sorts of decisions need to be made on a regular basis in labs around the country, often by graduate students without the sort of perspective that experience can provide.

Thinking long-term

For samples that are stable indefinitely, the decisions get even harder. Scientific samples vary wildly in quality, from the ones generated as people are learning a new technique or procedure to publication-quality material produced by experienced hands. Saving all of this material doesn't make sense, but not preserving it can leave researchers vulnerable to charges of bias or selectivity should their work prove controversial after the fact.

And there's no way of predicting when new technology will suddenly make old samples valuable. Perhaps the most famous example of this came when researchers found samples generated by origin-of-life researcher Stanley Miller after his death. The samples were generated in 1953, but Miller kept them around; a reanalysis with current equipment showed that they contained 22 different types of amino acids, the building blocks of proteins. That's a more complex mix than the one he described in his seminal paper.

Even a clear policy for material preservation, however, is no guarantee it will be there indefinitely. The 2003 Northeast blackout left many labs in the New York City area scrambling to find emergency power outlets in order to preserve heat-sensitive materials. During my own career, I lost a lot of samples that hadn't even be analyzed yet when someone had the Environmental Safety group clear out a refrigerator that had been placed in an unattended room. Accidents happen all the time, and if a lab is around long enough, one will inevitably strike it.

Ultimately, money can also play a role. Freezers and liquid nitrogen storage are expensive to purchase and maintain. Keeping a mouse line around can cost a small fortune, and require constant attention from technicians. During the period where science funding was growing rapidly, this sometimes forced researchers to make judgment calls on what they could afford to save. But the period of growth is long since over, and many labs are facing more of these hard decisions; some have ended up shutting down entirely, leaving no way of maintaining what, in some cases, are one-of-a-kind reagents.

In an ideal world, most scientists would probably agree that we'd be able to preserve and share the ultimate sources of scientific information: the samples we used to produce it in the first place. In the real world, however, most researchers have to make decisions on what to save based on a combination of scientific judgment and practical considerations—decisions that effectively limit how much science we can confirm or reanalyze. Accidents limit it even further.

And all of those issues kick in at the earliest steps of data production. As we'll see in the next installment, they don't end there.