Charting the progress of the various large-scale genome-sequencing projects as researchers working separately on their chosen species begin to pool analytical resources

Apr 24, 2014

Anna Azvolinsky

Scientists working to sequence all manner of bacteria, Archaea, plants, and animals and to make these genomes publicly available hope to use the data to inform health, industrial, and environmental issues. Large-scale sequencing consortia have been churning out data at an impressive rate, yet significant gaps remain in the genomic tree of life. And while these groups have largely been working independent of one another, together they might address more far-reaching questions, such as how life has evolved, how it currently functions, and how it might look down the line.

“We are still in the developmental stage, where every consortium focuses on a specific domain and is building up their own data and making sure it’s in good enough shape,” said Igor Grigoriev, head of the fungal genomics program at the US Department of Energy (DOE) Joint Genome Insitute (JGI) in Walnut Creek, California, and part of the 1,000 Fungal Genomes project. “Some dialog between the consortia is happening but grand-scale data integration remains to happen.”

Although there is still relatively little crosstalk among consortia, some of their data are being collected in central repositories. Aside from the National Center for Biotechnology Information’s genome database, there is the JGI-funded Genomes Online Database (GOLD), which functions as a hub for completed and ongoing genome sequencing initiatives and metagenome projects. GOLD is mainly focused on microbial genomes, but includes some eukaryotic genomes. Data from many of these projects are integrated in to JGI’s databases and can also be uploaded into newly developed KnowledgeBase tools funded by the DOE.

“Both talking across communities and coming up with creative tools to ask broader scientific questions across the domains of life is important,” said Grigoriev. “As a scientific community, I think we are just now at the moment . . . moving towards this.”

Amidst this early collective momentum, however, some groups are still working to sequence critical species within their own domains. In 2011, members of the microbial genomics community were ready to publish a manuscript, rallying the scientific community to fund a large-scale genomic sequencing project covering important bacterial and archaea strains. The goal was to fill gaps in the microbial tree of life. The decree proved unnecessary, however, as progress in genome sequencing initiatives—including those on thousands of bacterial and archaea genomes, such as the Genomic Encyclopedia of Bacteria and Archaea (GEBA) pilot project—gave the geneticists and microbial biologists reason to believe that the sequencing would be completed.

“We thought there was a turning point three years ago,” said Nikos Kyrpides, who heads up the microbial genomics and metagenomics program at the JGI. “The community believed that more funding agencies would begin to support microbial sequencing studies, not just for public health and industry applications, but to cover the reference genomes of the phylogenetic tree.”

Investigators at the JGI and several international institutions have since sequenced the full genomes of 3,000 additional microbes, but coverage of the bacterial and archaea domains remains fairly sparse. So Kyrpides and his colleagues are now submitting an updated manuscript to raise awareness of the importance of their Microbial Earth Project, which aims to sequence 7,830 representative type strains from the 11,000 species available in culture collections over the next three years. “Only about 10 percent to 15 percent of the diversity of cultured Archaea and bacterial species has been captured by sequencing so far,” said Kyrpides. “That’s enormously small.”

Part of the problem is that many government and private funding agencies are most interested in supporting scientists sequencing the genomes of species that impact human health, industry concerns, and environmental issues.

The genome of golden star tunicate (Botryllus schlosseri) was published in July 2013.WIKIMEDIA, PARENT GERY

“Many times it is easier to receive funding to sequence species important for agriculture, for example. These types of projects go faster through the pipeline because there is more funding from governments or companies interesting in funding directed efforts,” said Toni Gabaldon, the head of bioinformatics and genomics at the Center for Genomic Regulation in Barcelona, Spain. One issue, said Kyrpides, is that funding agencies don’t often work together, and it typically takes more than a single funding body to support broad, encyclopedic sequencing efforts. “We are pushing for funding agencies to change: to stop delineating projects by application, and to work together.”

Plenty of consortia dedicated to sequencing specific branches of the tree of life have cropped up as researchers working within the same domains have recognized that pooling resources can boost scientific progress. Among these groups are the Global Invertebrates Alliance (GIGA), the 5,000 Insect Genome Project (i5K), the 1,000 Fungal Genomes Project, the US National Science Foundation (NSF) Plant Genome Research Program, and the Genome 10K Project, which aims to sequence 10,000 vertebrate genomes. There is also the Smithsonian Institution-led Global Genome Initiative (GGI)—a collaborative effort to sequence at least one species from every one of the 9,500 described invertebrate, vertebrate, and plant families.

As more and more long-read sequencing technologies hit the market and the overall costs of decoding genomes drop, an emerging challenge is attracting and coordinating experts to collect, annotate, and place sequencing data in their biological contexts, according to Kevin Hackett, a national program leader at the US Department of Agriculture (USDA) and one of the leaders of the i5K project.

And these analytical efforts are important; rather than having researchers compete for funding, they unite those with common goals, eliminating redundancies and lowering overall costs. According to Stephen Goff, the project director of the iPlant Collaborative, a culture of true cooperation in genomics is just beginning to evolve.

For its part, rather than generating new genomic sequencing data, the iPlant team is making cloud computing, data storage, and genomic analysis tools available to the broader plant community. For example, iPlant is providing the cyberinfrastructure and analysis tools that will help the African Crops Consortium sequence 101 crops important to the continent’s agriculture. IPlant has also volunteered to provide infrastructure for the i5K project and other insect sequencing projects, said Goff.

Other consortia are creating their own data storage and analysis tools. Through its Plant Genome Research Program, the NSF aims not only to generate new genome sequences, but to provide a platform to integrate all existing genomic data for evolutionary and species diversity analyses. In Europe, the members of the European Life-Sciences Infrastructure for Biological Information (ELIXIR) group intend to create a resource for scientists to store and share large data sets, such as whole genomes.

Grigoriev’s team at JGI has developed a web-based public fungal genomics resource. “MycoCosm is an example of integration of fungal genomics data and computational tools, and the bringing together of the fungal biologist research community,” he explained. “From here, we can go to the next step of integrating across multiple domains.”

Bioinformatics tools will need to evolve to keep pace as genomic analyses become more complicated—covering complex inter-domain relationships, such as the symbiotic interplay between certain plants, fungi, and endobacteria. But even within a single consortium’s database, as the number of genomic sequences increases from tens to many hundreds, scaling the storage and analytical tools has been a challenge.

“Many computational scientists and bioinformaticians are working alongside biologists to analyze and organize the sequencing data. This is a major challenge but I have a lot of optimism because there is plenty of innovation and energy in this field,” said Klaus-Peter Koepfli, one of the principle investigators of the Genome 10K project and visiting scientist at the Smithsonian Conservation Biology Institute in Washington, D.C. “There are many obstacles to reconstructing the phylogeny of all living things, but it’s a great goal.”

How Many Species Have Been Sequenced?

During the last 250 years, 1.2 million eukaryotic species have been identified and taxonomically classified. Number of species estimated to exist on Earth: bacterial and archaea species, from 100,000 to 10 million1,2; eukaryotic species, approximately 8.7 million (including 2.2 million marine organisms; ± 1.3 million, total)1.