In a field rife with drug-addicted industries that derive billions of dollars from a single product, and stocked with researchers who scramble for government grants (sadly cut back by the recent US federal budget), the open sharing of genetic data and tools may seem a dream. But it must be more than a dream when the Sage Commons Congress can draw 150 attendees (turning away many more) from research institutions such as the Netherlands Bioinformatica Centre and Massachusetts General Hospital, leading universities from the US and Europe, a whole roster of drug companies (Pfizer, Merck, Novartis, Lilly, Genentech), tech companies such as Microsoft and Amazon.com, foundations such as Alfred P. Sloan, and representatives from the FDA and the White House. I felt distinctly ill at ease trying to fit into such a well-educated crowd, but was welcomed warmly and soon found myself using words such as “phenotype” and “patient stratification.”

Money is not the only complicating factor when trying to share knowledge about our genes and their effect on our health. The complex relationships of information generation, and how credit is handed out for that information, make biomedical data a case study all its own.

Update, May 25, 2011: presentations from the Sage Commons Congress are
now available online.

The complexity of health research data

I listened a couple weeks ago as researchers at this congress, held by Sage Bionetworks, questioned some of their basic practices, and I realized that they are on the leading edge of redefining what we consider information. For most of the history of science, information consisted of a published paper, and the scientist tucked his raw data in a moldy desk drawer. Now we are seeing a trend in scientific journals toward requiring authors to release the raw data with the paper (one such repository in biology is Dryad). But this is only the beginning. Consider what remains to be done:

It takes 18 to 24 months to get a paper published. The journal and author usually don’t want to release the data until the date of publication, and some add an arbitrary waiting period after publication. That’s an extra 18 to 24 months (a whole era in some fields) during which that data is withheld from researchers who could have built new discoveries on it.

Data must be curated, which includes:

Being checked for corrupt data and missing fields (experimental artifacts)

Normalization

Verifying HIPAA compliance and other assurances that data has been properly de-identified

Possible formatting according to some standard

Reviewing for internal and external validity

Advocates of sharing hope this work will be crowdsourced to other researchers who want to use the data. But then who gets credited and rewarded for the work?

Negative results–experiments showing that a treatment doesn’t work–are extremely important, and the data behind them is even more important. Of course, knowing where other researchers or companies failed could boost the efforts of other researchers and companies. Furthermore this data may help accomplish patient stratification–that is, show when some patients will benefit and some will not, even when their symptoms seem the same. The medical field is notorious for suppressing negative results, and the data rarely reaches researchers who can use it.

When researchers choose to release data–or are forced to do so by their publishers–it can be in an atrocious state because it missed out on the curation steps just mentioned. The data may also be in a format that makes it hard to extract useful information, either because no one has developed and promulgated an appropriate format, or because the researcher didn’t have time to adopt it. Other researchers may not even be able to determine exactly what the format it. Sage is working on very simple text-based formats that provide a lowest common denominator that will help researchers get started.

Workflows and practices in the workplace have a big effect on the values taken by the data. These are very hard to document, but can help a great deal in reproducing and validating results. Geneticists are starting to use a workflow documentation tool called Taverna to record the ways they coordinate different software tools and data sets.

Data can be interpreted in multiple ways. Different teams look for different criteria and apply different standards of quality. It would be useful to share these variations.

A repeated theme at the Congress was “going beyond the narrative.” The narrative here is the published article. Each article tells a story and draws conclusions. But a lot goes on behind the scenes in the art and science of medicine. Furthermore, letting new hypotheses emerge from data is just as important as verifying the narrative provided by one’s initial hypothesis.

One of the big questions raised in my mind–and not covered in the conference–was the effect it would have on the education of the next generation of scientists were teams to expose all those hidden aspects of data: the workflows, the curation and validation techniques, the interpretations. Perhaps you wouldn’t need to attend the University of California at Berkeley to get a Berkeley education, or risk so many parking tickets along the way. Certainly, young researchers would have powerful resources for developing their craft, just as programmers have with the source code for free software.

I’ve just gone over a bit of the material that the organizers of the Sage Commons Congress want their field to share. Let’s turn to some of structures and mechanisms.

Of networks

Take a step back. Why do geneticists need to share data? There are oodles of precedents, of course: the Human Genome Project, biobricks, the Astrophysics Data System (shown off in a keynote by Alyssa A. Goodman from Harvard), open courseware, open access journals, and countless individual repositories put up by scientists. A particularly relevant data sharing initiative is the International HapMap Project, working on a public map of the human genome “which will describe the common patterns of human DNA sequence variation.” This is not a loose crowdsourcing project, but more like a consortium of ten large research centers promising to release results publicly and forgo patents on the results.

The field of genetics presents specific challenges that frustrate old ways of working as individuals in labs that hoard data. Basically, networks of genetic expression requires networks of researchers to untangle them.

In the beginning, geneticists modeled activities in the cell through linear paths. A particular protein would activate or inhibit a particular gene that would then trigger other activities with ultimate effects on the human body.

They found that relatively few activities could be explained linearly, though. The action of a protein might be stymied by the presence of others. And those other actors have histories of their own, with different pathways triggering or inhibiting pathways at many points. Stephen Friend, President of Sage Bionetworks, offers the example of an important gene implicated in breast cancer, the Human Epidermal growth factor Receptor 2, HER2/neu. The drugs that target this protein are weakened when another protein, Akt, is present.

Trying to map these behaviors, scientists come up with meshes of paths. The field depends now on these network models. And one of its key goals is to evaluate these network models–not as true or false, right or wrong, because they are simply models that represent the life of the cell about as well as the New York subway map represents the life of the city–but for the models’ usefulness in predicting outcomes of treatments.

Network models containing many actors and many paths–that’s why collaborations among research projects could contribute to our understanding of genetic expression. But geneticists have no forum for storing and exchanging networks. And nobody records them in the same format, which makes them difficult to build, trade, evaluate, and reuse.

The Human Genome Project is a wonderful resource for scientists, but it contains nothing about gene expression, nothing about the network models and workflows and methods of curation mentioned earlier, nothing about software tools and templates to promote sharing, and ultimately nothing that can lead to treatments. This huge, multi-dimensional area is what the Sage Commons Congress is taking on.

More collaboration, and a better understanding of network models, may save a field that is approaching crisis. The return on investment for pharmaceutical research, according to researcher Aled Edwards, has gone down over the past 20 years. In 2009, American companies spent one hundred billion dollars on research but got only 21 drugs approved, and only 7 of those were truly novel. Meanwhile, 90% of drug trials fail. And to throw in a statistic from another talk (Vicki Seyfert-Margolis from the FDA), drug side effects create medical problems in 7% of patients who take the drugs, and require medical interventions in 3% or more cases.

Featured Video

The Internet of Things That Do What You Tell Them: Cory Doctorow passionately explains how computers are already entwined in our lives, which means laws that support lock-in are much more than inconveniences.