Today, a scientific collaboration called the Encyclopedia of DNA Elements (ENCODE) published some of its data. When I say “collaboration,” I mean more than 400 scientists working in 32 different labs, and when I say “some of its data,” I mean over 1,600 experiments involving 24 types of analyses on 147 cultured cell lines.

A typical ENCODE experimental result. Links to the original figure.

ENCODE didn’t publish this massive data set in a paper. They published it in 30 papers that came out simultaneously in three journals, plus additional commentary elsewhere. To help people make sense of this information glut, Nature, the main publisher, set up a special web page where all of the papers are freely available, released an iPad app that lets users explore the results through different “threads” of inquiry, and held a press conference that featured several of the consortium’s principal researchers as well as an interpretive dance performance inspired by the results. Yes, really.

As regular readers know, I’m always ready to call out publishers who engage in excessive hype. In this case, though, I think Nature‘s hoopla is entirely appropriate. This is a $185 million project that’s trying to figure out how humans work at a molecular level, and the current batch of publications presents both a rough sketch of an answer and a whole new list of big questions.

Most science news stories on ENCODE will probably begin and end with an observation about “junk DNA,” and how the new data apparently overturn the notion that most of the human genome is just taking up space. Perhaps acknowledging that this is the most easily-digested result, the press materials and many of the commentary articles highlight it. Molecular biologist Joseph Ecker puts it this way in his synopsis:

One of the more remarkable findings … is that 80% of the genome contains elements linked to biochemical functions, dispatching the widely held view that the human genome is mostly ‘junk DNA.’ The authors report that the space between genes is filled with enhancers (regulatory DNA elements), promoters (the sites at which DNA’s transcription into RNA is initiated) and numerous previously overlooked regions that encode RNA transcripts that are not translated into proteins but might have regulatory roles.

But neither that result nor any other individual piece of the data is really the main point. What matters about ENCODE is the totality of it, and what the scale of the data says about the future of biology.

When the Human Genome Project released its draft sequence 11 years ago, it was a bit like Deep Thought reporting that the answer was in fact 42. By itself, the genome sequence told us that we only had about 20,000 genes, and that most of our DNA didn’t look like it had any function at all. There was obviously a lot more going on than we’d be able to glean just by looking at the sequence.

ENCODE is a follow-up project, in which researchers used a huge variety of techniques to probe the functions of all of the parts of our DNA, not just the segments that contain obvious genes. They looked for enhancers that can control the expression of genes in other parts of the genome. They screened all of the RNA in cells to find new pieces of micro-RNA, a type of gene-controlling molecule we didn’t even know about when I went to graduate school. They tested which parts of the genome were wrapped up in chromatin, a sort of deep storage system, and which were open for business in different types of cells. And on and on. In short, they examined what every piece of the genome was doing under as many different conditions as they could.

Besides finding that most of the genome is probably doing something to earn its keep, ENCODE has illuminated the scope of the problem biologists now face. It’s huge.

A graphic accompanying Brendan Maher’s excellent news feature on the project shows what ENCODE has accomplished so far, and how much work remains just to finish its initial phase. For example, the investigators have looked at only 120 of an estimated 1,800 transcription factors, proteins that control gene expression directly, and they’ve only looked at those factors in a subset of the cell lines they set out to study. That one snippet of the work produced a massive amount of information by itself.

Even after ENCODE finishes, what we’ll have will be more of a pamphlet than an encyclopedia. Cultured human cell lines are a great tool for laboratory studies, but they only partly mimic the behavior of the cells that make up a real human, which in turn vary from person to person and within a single person over time. ENCODE is giving us a two-dimensional view of a system that’s at least five-dimensional. That’s not to minimize the project; the team has made astonishing progress, but it’s just a start.

After doing the rest of the cultured cell experiments, biologists will have to figure out the results, which raises a whole new problem. I can’t tell you what all of the ENCODE data mean. Neither can the people who generated them. Besides the 30 new papers (and their supplementary online sections), the project has also produced databases, software, and other analytical tools so scientists can dive into the results directly. The conclusions of the new papers are just the bits that the experimenters thought were most interesting. As happened with the human genome sequence, people will be digging new publications out of these data for years.

Right now, we’re like astronomers looking at millions of smudges of light we can see with a new telescope, and it’s just dawning on us that those aren’t stars. They’re galaxies.

ENCODE is also part of a trend that’s raising tough ancillary questions for scientists and science publishers. Though its principal investigators undoubtedly see the project as worthwhile, $185 million is a lot of money, and the reality of government-sponsored science is that it’s a zero-sum game. Despite what some big science proponents claim, funding for consortium-based “factory research” studies such as this necessarily comes at the expense of individual investigator-led projects. In an environment where thousands of promising young researchers are scrambling for grants, can we be sure this was the best way to spend those funds?

From the publishers’ perspective, big science is fraught with disputes over credit, concerns about oversight and data integrity, and fundamental questions regarding the proper length and format for a paper. It’s not even clear that a project like this should be published in a conventional journal; perhaps the data should simply go online, accompanied (or not) by a few comments from the lead scientists. As the ENCODE juggernaut keeps rolling along, and as subsequent, even bigger projects follow it, it might not even be possible to crank out papers for each new batch of work.

But this, too, is an expected result. This is what science does: uses what’s possible to redefine what’s possible. The ability to sequence a gene becomes the ability to sequence a genome becomes the ability to sequence a thousand genomes. When our minds can’t accomodate the new information, we’ll just have to expand them.