Why sequence multiple species and strains?

- A Flood of Microbial Genomes–Do We Need More?

A wide variety of
microbial sequencing projects having been completed or being
implemented throughout the world has created a rich and diverse
‘mega-database’ of microbial genomes. However, to fully gauge the
prevailing diversity and stratification patterns of all bacterial
species, it will be required to sequence hundreds and thousands of
genomes representing all branches and lineages within the bacterial and
archaeal part of the tree of life wherein each of the phylum provides
an opportunity to capture evolutionary footprints of billions of years.
It is estimated that there are at least 35 different phyla of bacteria
according to the rRNA gene sequence based tree of life [12]. 1The
genome sequences of bacteria that have accumulated so far represent
only three phyla, thus leaving major gaps in the genomic representation
of the bacterial diversity of our biosphere. It is therefore
urgently required to sequence genomes from underrepresented phyla and
to improve resolution of deep branches in the bacterial tree so as to
enable biological studies of important lineages and to decipher novel
functions thereof. In view of these facts more systematic approaches to
the sequencing of the microbial genomes are needed to leverage data for
the interpretations of environmental surveys as well as to facilitate
comparative genomic analyses and annotations of different genomes and
microbiomes. The GEBA (Genomic Encyclopedia of Bacteria and Archaea)
project is one such ‘community phylogenomics’ initiative that is being
implemented at the Joint Genome Institute (http://www.jgi.doe.gov/programs/GEBA/).
This program aims at filling the genomic gaps pertaining to bacterial
and archaeal branches of the tree of life while using the tree itself
as a guide to identify which target microorganisms need to be sequenced
completely. Some of the potential benefits of the GEBA project include
identification of new protein families across different lineages of
bacterial phyla so as to provide a comparative genomics and proteomics
platform towards annotation of forthcoming genomes and microbiomes of
the same or different phyla. Also, it will facilitate improved
phylogenetic anchoring of metagenomic data-sets besides providing
better understanding of the processes underlying the evolutionary
diversity and functional stratification of different microbes
inhabiting various different niches in the environment.

Many of the
pathogenic bacterial species are monomorphic meaning that they present
very little diversity upon genetic fingerprinting or limited sequence
profiling. Gaining insights into their dispersal patterns, evolutionary
genetics, emergence and reemergence in different communities and
catchments poses a great challenge for molecular epidemiologists.
Multiple genome sequences from across strains of a single species offer
more fine scale resolution of genetic differences that enable tracking
and identification of species and development of additional genetic
markers.

Prokaryotes evolve largely by horizontal gene acquisition, vertical genome reduction and in-situ gene duplication strategies to shape an optimal repertoire of the genes and elements to support a successful lifestyle [7].
Lateral gene flow is widespread among different strains of a single
species and most bacterial organisms acquire novel functions through
harnessing functional attributes of some of the genes gained through
such recombinational processes. One important message that has emerged
from the analyses of complete genomes is–microbes are diverse and
highly adaptable. To know why it is so, we need further insights
through individual and community level genomics. Such federated
genomics approaches are also likely to help us answer several
outstanding questions such as, how virulence evolves as a function of
genome optimization under different compulsions offered by a colonized
niche; how microbes regulate their genomic streamlining; what
environmental stimuli are responsible for the diversification and
stratification of microbial lineages; what is the functional
significance of prokaryotic genomic diversity especially in the context
of host and tissue tropism and towards understanding parasitism versus
commensalism; and how can microbial genome data and the observed
diversity be experimentally harnessed for the generation and selection
of optimally adapted microorganisms? These questions clearly underpin
case for sequencing additional representatives from different
pathogenic microbial species.

Novel genes constantly emerge from newly sequenced replicate genomes [13], [14]
and thus the concept of a ‘dockyard’ of genes (of presumably unknown
functions) that each of the strains harbors. This paradigm was
supported by the analyses wherein the pan-genome of a true bacterial
species is described to be ‘open’ and each new genome sequence would
identify dozens of new genes in the existing pan-genome of Steptococcus agalactiae for example [14]. It is clear also from previous studies that such pool of strain specific genes in pathogens such as Helicobacter pylori, termed the ‘plasticity region cluster’, could be useful in adaptation to a particular host population [15].
This pathogen shows a very strong geographic adaptation and is known
for harboring up to 45% strain specific genes with most of them gained
through horizontal gene transfers [7], [15].
Recently the members of the plasticity region cluster were shown to be
likely involved in promoting proinflammatory potentials of some of the
strains thus providing a survival advantage [16], [17].

Another important
reason to sequence replicate genomes of a prokaryotic species entails
need to study chronological evolution of bacterial pathogens within
their hosts. The nature and extent of genetic polymorphisms accumulated
in the genome of bacterial pathogens across wide timescales and during
the colonization of different host niches are not known. The advantages
of polymorphisms linking to fitness in pathogens or commensals need
additional in-depth studies. While some studies have explored
chronological strain diversity through genetic fingerprinting [18], microarrays [19] and limited sequencing [20],
whole genome profiling of isolates obtained at different time points
and sampled from different sites is required to investigate the
frequency and timing of the emergence of small insertions, deletions
and substitutions and their functional significance in terms of
adaptive mechanisms.

With complete
genomes of multiple variants of a closely related group (genus or
species), it is possible to test evolutionary hypotheses based on the
core genes of the group. The phylogenetic relatedness of such core
genes could then be harnessed to examine larger collection of strains
by multilocus sequence typing (MLST). This genome sequence based
approach has already revolutionized molecular epidemiology and
evolutionary genetics of many bacterial pathogens as previously
reviewed [21]. The most noteworthy case is of Leptospira interrogans
whose genome sequences enabled significant insights into the question
as to how virulence evolves during the traverse of pathogens from one
intermediate host to the other. This has been facilitated through
comparative genomics with saprophytic L. biflexa genome sequence [22] as well as genome guided insights into phylogeny of various species of the pathogen [23] and through differences between saprophytic and pathogenic species [22]. Based on the core genome of pathogenic and saprophytic strains, a sensitive and accurate MLST [24]
method was developed to track and analyze individual strains of
different species at population levels; a task which was otherwise
impossible by using traditional serotyping approaches. This is because
the serotype is often influenced by frequent lateral gene transfer
events within the loci that determine repertoire of cell surface
antigens.

Leaving aside
genetic diversity of naturally occurring populations, important
differences in the isolates of even a single laboratory strain might be
highly significant in genetic experiments. Using whole genome sequence
determination, several important polymorphisms were detected in
replicate genomes of a single strain of Bacillus subtilis[25].
Such approaches allow rapid identification and mapping of single
nucleotide polymorphisms and mutations linked to different phenotypes
because they are less laborious and definitely cheaper than genetic
mapping experiments.