Jack A. Gilbert and Folker MeyerJack A. Gilbert is an Environmental Microbiologist at the Institute for Genomic and Systems Biology at Argonne National Laboratory, Lemont, Ill., and Adjunct Professor in the Department of Ecology and Evolution, University of Chicago, Chicago, Ill., and Folker Meyer is a Computational Biologist at the Institute for Genomic and Systems Biology at Argonne National Laboratory and the Senior Fellow at the Computation Institute, University of Chicago.

Summary

●The Earth Microbiome Project (EMP) is a collaborative initiative to create data sets describing microorganisms from a broad range of ecosystems.●An important EMP goal is to enable comparisons of many environments across the planet, leveraging data sets that are generated following standard protocols.●Another important EMP goal is to determine how analytic biases might affect researchers when they reconstruct microbial communities from sequencing data.●Yet another goal is to develop and then validate predictive mathematical models of ecosystems.●Linking the microbial assemblage prediction and predictive relative metabolic turnover approaches enables us to extrapolate relative levels of metabolites as a function of the predicted community structure.

Gllobally, microbial cells are 1 billion times more abundant than stars in the universe. Such numbers make it a daunting task to understand microbial complexity in useful ways. Because cataloguing this vastness does not immediately provide useful products, we should design surveys with specific questions that will help lead to specific products while also refining those questions.

Microbes, ubiquitous as they are, are important to ecosystem function. Indeed, scientists made great strides during the past 100 years describing how microbial species, consortia, and communities interact with the biological, physical, and chemical world. Progress came from investigating both ends of the perspective ladder. At its base is intracellular metabolic dynamics, exploring gene to transcript expression, the folding of macromolecules, protein function, and biochemistry at the level of individual cells, usually from individual taxa (Fig. 1). Thus, we can visualize metabolic pathways that allow cells to interact with the environment while producing more and more cells. From the other end of the ladder, the "30,000 feet perspective," we explore the sum of an ecosystem's taxonomic and functional capabilities.

The Earth Microbiome Project (EMP, www.earthmicrobiome.org) is a collaborative initiative to create multiple comparable data sets describing microorganisms from a broad range of ecosystems. The EMP is in a pilot phase, with several small projects running in parallel. The main aim of one of these projects is to generate 16S rDNA amplicon and shotgun metagenomic sequence data from 10,000 environmental samples-a significant task, yet within our grasp. It will lead to the generation of 15 trillion base pairs of sequence data from a diverse array of microbial ecosystems- creating an informatics challenge that will require a coordinated data management plan. Hence, the EMP depends on direct and frequent interactions between bioinformatics specialists and microbial ecologists.

Flexible Goals of the Earth Microbiome Project

Goals are moving targets in science, subject to change as new evidence becomes available. An important EMP goal is to enable comparisons of many environments across the planet, leveraging data sets that are generated following standard protocols. For nucleic acid sequencing analysis, this approach will require researchers to follow standard DNA extraction protocols to lyse cells and release DNA for analysis. Doing otherwise risks obtaining results in which the significant discriminatory factor between two samples depends on the extraction protocol, yielding data of limited relevance (Fig. 2).

This need to produce datasets that are fully comparable also will require researchers to develop and follow a standard amplicon protocol to overcome biases when amplifying DNA in samples. For instance, the many steps of PCR can introduce biases because of variables such as primer selection, the brand of taq polymerase being used, the temperature or timing of amplification cycles, and the ionic strength in the reaction mixture. Additionally, investigators will need to follow standard sequencing protocols, which is especially important for shotgun metagenomic sequencing because, in the absence of an amplification step, sequencing itself can be a source of bias. Because several sequencing platforms are now available, each with different biases, it will be essential to standardize this choice to meet the EMP goal of having comparable sequence data sets from many different ecosystems across time and space.

Another important EMP goal is to determine how analytic biases might affect researchers when they reconstruct microbial communities from sequencing data. Developing this understanding is vital, as it is not sensible to assume that the research community will adopt prescriptive approaches. Because of the pace of technological advance, we need to better understand how data from updated protocols will compare to earlier data. One alternative approach, to resequence all samples, does not address other protocol changes that might arise.

Other goals of the EMP include developing a database for analyzing environmental samples in terms of their niche space characteristics, a global atlas of protein functions, and a catalogue of reassembled genomes and their taxonomic distributions.

Predictive Ecosystem Models for Ocean Dispersal

Another fundamental goal is to develop predictive mathematical models of ecosystems. Such models usually provide abstract representations of ecosystems. Their predictions range typically from the level of intracellular dynamics to that of regional and global scales. Invariably these dynamics involve networks of interactions among the biological, chemical, and physical variables in a system overlaid with algorithms, which describe those relationships. Thus, models predict how changing one variable will likely generate responses among other variables. Models not only enable us to predict how an ecosystem will behave without inappropriately changing it, they also help us to predict how changes might affect the ability of an ecosystem to deliver vital services upon which we rely.

Many ecosystem models are available, and they span a wide range of ecosystems. A majority covers marine systems, probably because the fluid dynamics in bodies of water fits well with the assumption that organisms disperse freely within the spatial or temporal range. Indeed the ubiquity and age of microbial life means that, in a dynamic ocean environment, there should be very few limitations to absolute dispersal. For example, according to the One Ocean Model of Biodiversity, which was developed by Ron O'Dor of the Consortium for Ocean Leadership in Washington, D.C., and his collaborators, a microbial ‘species' with a dispersal rate of approximately 1 mile per year could reach any location in a global ocean in approximately 10,000 years. Because ocean currents are agents of and potential barriers to dispersal, this rate likely is underestimated for some organisms but overestimated for others.

This dispersal could lead to development of a predictive model of global microbial community composition and structure. For instance, ocean currents distribute specific eukaryotic and bacterial taxa on the basis of how each taxon survives when subject to different environmental constraints, such as variations in temperature, according to a model developed by Michael Follows and Stephanie Dutkiewicz from the Massachusetts Institute of Technology in Cambridge, Mass.

Gregory Caporaso of Northern Arizona University in Flagstaff and his collaborators used an existing dataset of 10,000 16s rDNA sequencing reads from 72 consecutive time samples from surface waters at the English Channel L4 research station. From one of those samples, they generated an additional 10 million 16S rDNA reads. Strikingly, they found 99.96% of all the taxa from the initial survey in the deep-sequenced data set from one time point, yet this complement species accounted for only 5% of the total taxonomic diversity in the 10 million reads. Apparently because of the dynamic flow of the English Channel, the same microbial community was present in every sample, validating the assumption that there is no barrier to dispersal.

Modeling Terrestrial Dispersal Proves Challenging

Modeling the spatial characteristics of microbial community structure in terrestrial ecosystems proves extremely difficult, primarily because of the heterogeneity in soil and other communities. Moreover, the static nature of terrestrial systems results in patchiness, suggesting that predictions for a pasture will differ from those for a forest even if the two sites are separated by only a few meters.

However, this perceived heterogeneity might not be as patchy as once thought. Just as marine systems might allow microbes to redistribute fully every 10,000 years, microbes in terrestrial settings are likely also to redistribute over some finite period. Moreover, soil is not static. Animals, plants, water, and wind disrupt it, as does longer-term erosion, continental drift, and hydrothermal activities, leading to local, regional, continental, and global redistributions (Fig. 3).

Determining the time for a microbial species to travel 10,000 miles may prove extremely difficult but is not impossible. By creating a global inventory of microbial taxa from thousands of disparate ecosystems, the EMP can start to elucidate the degree of overlap in taxonomic composition among different terrestrial systems across different spatial scales.

This dataset also can help to determine whether there is a universal microbial community for moderate ecosystems. To test this question, it will be necessary to sequence extremely deeply (millions to billions of reads) in taxonomic space in many hundreds of different soil systems.

EMP Analysis Should Help To Validate Distribution Assumptions

The EMP is generating additional studies to aid in validating assumptions about how microorganisms are distributed across ecosystems. While taxonomic evolution of regionally isolated populations may suggest barriers to their dispersal, turnover of systems across geological time could render such differentiations moot. And while short-term predictions must consider such variations, it remains to be seen how this affects predictions about functional capabilities, which are a major goal of ecosystem modeling.

It is essential to validate any model, yet doing so can be extremely difficult. However, sampling strategies can help to fill this gap. A model that predicts how an ecosystem responds to change depends either on a fundamental understanding of the biochemical mechanisms by which the species in that system respond to changing variables or on a set of observed correlations of changes in biological variables that result from physical or chemical changes. Both should (a) predict changes in the system from conditions not used to train the model and (b) characterize the environment by predicting the impact of biological changes on physical and chemical variables.

The choice of method depends on the data available. For example the first strategy requires a comprehensive understanding of the reaction limits and interactions within the variables of a particular biological unit. The second strategy requires a comprehensive survey of the community through time and space to define the range within the community and correlations of changes in its structure. Both methods rely on in situ and experimental observations of how the community responds to change.

However, very few long-term research stations define how particular microbial communities respond to full suites of environmental variables within, for example, a full seasonal cycle. Models that use limited data resources to predict through time and space are immensely powerful, and can inform sampling strategies. In the short term, they can identify anomalies to explore using small-scale sampling trips to refine the model. In the longer term, validated observations can be used to identify where to locate ecosystem observatories.

Other Types of Ecosystem Models Are Needed

Available ecosystem models typically deal with microbes as a black box with inputs and outputs, including for carbon, energy, and nitrogen. However, we lack models that use environmental parameters to predict microbial taxonomic community structures and to define their metabolic capabilities. Having such models would provide feedback to the physical and chemical parameters in those ecosystems.

Peter Larsen and other colleagues of ours recently developed two approaches that, when combined, could help to address those modeling needs. The first of those approaches, predictive relative metabolic turnover (PRMT) modeling, relies on relative abundances of enzyme activities, based on annotated comparative metagenomic studies, to calculate the biochemical fates of more than 900 metabolites that marine microbial communities might generate. We validate this approach by comparing the predicted turnover of carbon and phosphorus to in situ measurements of those two chemical constituents. This cyclical process provides an approach for determining how environmental conditions shape a microbial community and for predicting its functional capabilities.

The second of those approaches, microbial assemblage prediction (MAP), uses Bayesian networks to define relationships among physical, chemical, and biological units as a direct acyclical graph, and then overlays an artificial network of nonlinear mathematical descriptions. MAP enables us to predict the relative abundance of taxonomic units from environmental parameters. Linking MAP to PRMT enables us to extrapolate relative levels of metabolites as a function of the predicted community structure (Fig. 4). Together, these two modeling tools provide a feedback loop from environment to taxon abundance, and from metabolite turnover to environmental parameters again.

Microbial ecology research and modeling depend on coordinated sampling efforts to minimize redundancy and to improve analytic comparability. Modeling has an important role to play in improving these efforts. For example, good experimental design can lead to informative models, which then can be used to direct future experimental design and also to identify appropriate sampling strategies. We urge this research community to embrace the full gamut of such tools when designing their experiments. It is no longer acceptable to define a microbial community based on measuring only a few samples because this approach offers only limited value without a defined data management plan. To make such limited measurements more useful, it is necessary to make wider comparisons and to use models that refine sampling efforts.

ACKNOWLEDGMENTS

This work was supported by the U.S. Dept. of Energy under Contract DE-AC02-06CH11357.

SUGGESTED READING

Caporaso, J. G., D. Field, K. Paszkiewicz, R. Knight, and J. A. Gilbert. 2011. Evidence for a persistant microbial community in the Western English Channel. ISME J., in press.