A Fishfinder for the “Junk DNA” Seas

In a way, the Human Genome Project had it easy. Sure, mapping the roughly 23,000 genes active in humans was one of the most important scientific achievements of all time, but those genes are only part of the story. In fact, the protein-coding sequences only occupy about 1.5% of the roughly 3 billion base pairs present in human DNA; the actual genes are small islands afloat in a vast sea of largely unknown sequences.

Those mysterious stretches were once referred to dismissively as “junk DNA,” as scientists presumed that all those base pairs between the genes must be largely the cold leftovers of evolution – sequences that may have been important to other species but had lost their utility in humans. But now that we have a pretty reliable map of our 23,000 genes, it’s become apparent that rich treasure lies hidden in the junk DNA. Since all cells of the human body share the same DNA, a set of instructions must be present to direct Cell A to become a neuron and Cell B to become a heart cell and so forth. Increasingly, these “switches” are understood to be key to both construction of a functioning body and the ways that process can break down in genetic diseases. But finding those switches is no easy task.

“These sequences are literally in the middle of nowhere, these tiny things in a sea of anonymous sequences,” said Marcelo Nobrega, assistant professor of human genetics at the University of Chicago. “The question was: How are you going to find those?”

In a recent paper in the journal Genome Research, a team of researchers from the University of Chicago and the National Institutes of Health may have made that search much easier. Just as modern fishermen use computerized fishfinders to help them spot prize catches in the waters below, Nobrega, Ivan Ovcharenko and colleagues have developed a computer tool to scan the DNA depths for the tiny switches important for cell determination. First demonstrating the model’s usefulness by tracking down sequences important for heart development, the authors said the method can be used to sniff out molecular switches that control the fate of every kind of cell, in humans or other organisms.

“The Human Genome Project gave us a book with 3 billion letters, of which 3 million are known words,” said Nobrega, assistant professor of human genetics at the University of Chicago. “But that doesn’t tell the story, and in the time that it’s taking to unravel the other things hidden in genome, we’re learning just how complicated it’s going to be.”

The secret of this “switchfinder” model is a matter of code-breaking, training the computer to find sequences of base pairs that suggest a segment of DNA is not mere “junk,” but a switch. Similar code-breaking was at the heart of the Human Genome Project – every protein-coding gene starts with a representative sequence known as the “start codon.” But the rules that identify a switch were largely unknown until this research.

“We can finally say that there is a well-defined genetic code hardwired in our genomes that can be used to specifically identify heart regulatory elements in the vast sequence that makes up the human genome,” said Ovcharenko, of the NIH’s National Center for Biotechnology Information. “With the advance of computational methods, we can use computers to break this code, learn its encryption, and understand the signals heart cells receive to regulate genes.”

What the team found was not a simple matter of a single on/off switch for heart development. The model turned up 42,000 potential switches related to heart development, nearly twice the number of total genes in human DNA. These tens of thousands of elements work together in elaborate choreography to orchestrate the development of a cell into a fully-functioning heart cell, turning the right genes on and off at just the right times.

The green heart of the zebrafish embryo (courtesy: Nobrega lab)

To confirm that the computer model was working, Nobrega’s laboratory picked a handful of sequences identified as switches and tested them in an animal model – one that might seem like a bizarre choice. The zebrafish, a tiny little striped species commonly found in pet stores, offers the unique advantage of having transparent embryos and well-studied genetics, making it the ideal critter for testing the role of genes in development. Nobrega’s laboratory attached some of the potential enhancers generated by their computer model to genes that produce either fluorescent green or blue color when activated. If the switch was truly important to the development of the heart, the organ would light up green or blue – as it did for more than 60 percent of the switches tested by the laboratory. Those results were encouraging evidence that the “switchfinder” developed by the team was working.

“If you go randomly in the genome and pull out a sequence to test, the chance you’re going to hit a heart enhancer is probably going to be a fraction of a percent,” Nobrega said. “Yet with our list of sequences, you have a 60 percent chance. It’s tremendously better.”

The model is not specific to the heart; a shorter experiment in the paper demonstrated that the program can also be used to detect switches important for certain types of brain cells, and the authors note that it can be applied to any organ or tissue. And characterizing the chorus of switches that orchestrate cell development is more than just mere code-breaking; knowledge about what needs to go right can turn up clues about what goes wrong in cases of heart disease or other illnesses. One such instance of a “junk DNA” sequence causing heart disease was described yesterday in Nature – deletion of a non-coding sequence in mice dramatically affected the expression of two genes and caused the mice to die earlier than normal.

“For some of these diseases, there’s nothing wrong with the protein sequence itself,” Nobrega said. “But there may be an alteration in how and when the protein is made, which can lead to disease as well. That’s the urgency and need for this kind of work to basically crack the other codes that are present in the genome.”