Better Analytical Tools for Genome Researchers

Life scientists are producing a mountain of DNA data thanks to recent advances in the instruments used for sequencing a person’s genome. DNA sequencing can identify the order of all 3 billion base pairs of DNA in a set of human chromosomes. The Sanger sequencing method, used for this since it was developed by Frederick Sanger and colleagues in 1977, has been supplanted by next-generation sequencing (NGS) methods.

The high-throughput NGS instruments, which can sequence a large number of short DNA fragments at relatively low cost, have revolutionized the process over the past decade. These instruments can deliver hundreds of gigabases—the nucleotides that make up nucleic acids of DNA—of sequence data in a single run. But the ability to read millions to billions of sequences has generated trillions of bytes of data, overwhelming current computer systems.

“The data coming from different types of high-throughput instruments are exceeding the capabilities of standard computer platforms,” says IEEE Fellow Srinivas Aluru, a professor of computer engineering at Iowa State University, in Ames.

Aluru and a team of researchers are developing a toolbox of sorts for the scientists, using high-performance computers (HPCs). The project recently received a three-year, US $2 million grant from the Big Data program run jointly by the U.S. National Science Foundation and the U.S. National Institutes of Health. With the toolbox researchers will be able to extract and apply the knowledge gleaned from collections of large data sets to accelerate progress in science and engineering research. The program will fund research to develop and evaluate new algorithms, statistical methods, and tools for improved data collection, management, and analytics.

“We need to find ways to help researchers crunch a large amount of data very quickly,” Aluru says. “Our goal is to develop a broad array of tools to analyze high-throughput sequencing using high-performance computers in as easy a way as possible.” Researchers at Stanford, Virginia Tech, and the University of Michigan, in Ann Arbor, are also involved in the project.

MOTIVATION Statisticians and researchers from life sciences and bioinformatics both in academia and industry have singled out data analysis as the main hurdle in harnessing the full potential of NGS instruments, Aluru says. Once researchers can mine large databases of human genome data, they may be able to do numerous new things, including uncovering complex disease traits; determining whether microorganisms inhabiting different parts of the body are leading to increases in obesity, heart disease, and diabetes; and examining DNA gathered at bioterrorism sites for genetically modified or even unknown organisms.

Even more new applications are expected as the cost of mapping genes continues to drop. A full sequence of a human genome done with an NGS instrument currently costs a few thousand dollars, according to Aluru. That’s quite a drop from the billions it cost to map that first human genome over a decade ago. In three years, he predicts, that price could drop to just a few hundred dollars.

“It could become so cheap that I expect all of us eventually will be sequenced as part of an annual exam,” he says.

BUILDING BLOCKS Aluru’s survey of current NGS tools shows they already share common core index and data structures, algorithmic techniques, and application components. And advances in computer platforms over the past decade such as multicore processors, many-core graphics processing units used as general-purpose high-performance computer accelerators, and cloud computing will help tackle the data challenges.

But computational, statistical, and machine-learning techniques still need further development, he continues. His team will begin this process by defining the main set of computational building blocks with the potential to support not only current but also future bioinformatics needs. The building blogs will be organized into a three-tiered hierarchical structure, with each higher layer taking advantage of one or more of the previous layers. Because all types of data coming from NGS instruments are ultimately DNA sequences modeled as strings over a fixed alphabet, index and data structures operating on strings are at the heart of many bioinformatics algorithms. So Aluru’s first layer will focus on developing new tools for these structures, including lookup tables, suffix trees, and overlap graphs.

Next come core algorithms, which support common constructs that frequently occur in NGS data analytics. The final layer, the application component, will reflect the key steps taken in NGS bioinformatics analyses and also match the researchers’ high-level processing tasks.

All three layers will be summarized in software libraries that rely on a simple and intuitive syntax. The layers will then be mapped to a variety of HPC platforms, including multicore CPUs, CPU/GPU hybrid platforms, and clouds.

The project will also develop a domain-specific language (DSL) to enable analytics software to be developed and prototyped, even by nonexperts. The DSL will also provide an efficient way to perform data-driven algorithmic optimizations that otherwise would be too time-consuming or impossible for general-purpose applications, according to Aluru.

“By identifying the core functions and developing parallel algorithms for them, and encapsulating them in software libraries mapped to a variety of HPC platforms, we can empower the informatics community to achieve big data analytics,” he says.