Comparison of 61 sequenced Escherichia coli genomes.

1Center for Biological Sequence Analysis, Department of Systems Biology, The Technical University of Denmark, Lyngby, Denmark.

Abstract

Escherichia coli is an important component of the biosphere and is an ideal model for studies of processes involved in bacterial genome evolution. Sixty-one publically available E. coli and Shigella spp. sequenced genomes are compared, using basic methods to produce phylogenetic and proteomics trees, and to identify the pan- and core genomes of this set of sequenced strains. A hierarchical clustering of variable genes allowed clear separation of the strains into clusters, including known pathotypes; clinically relevant serotypes can also be resolved in this way. In contrast, when in silico MLST was performed, many of the various strains appear jumbled and less well resolved. The predicted pan-genome comprises 15,741 gene families, and only 993 (6%) of the families are represented in every genome, comprising the core genome. The variable or 'accessory' genes thus make up more than 90% of the pan-genome and about 80% of a typical genome; some of these variable genes tend to be co-localized on genomic islands. The diversity within the species E. coli, and the overlap in gene content between this and related species, suggests a continuum rather than sharp species borders in this group of Enterobacteriaceae.

Phylogenetic tree based on extracted 16S rRNA sequences. a Comparison of 20 different Enterobacteriaceae, based on extracted 16S rRNA sequences from the GenBank sequence files. E. coli and Shigella are shown in green. b Tree of 61 sequenced E. coli (black) and related species (colored), based on the alignment of the 16S rRNA gene sequence. Apart from Shigella spp., the genes from E. albertii and E. fergusonii are also included (arrows). The 16S rRNA gene of S. enterica Typhimurium LT2 was used as the root. Bootstrap values, indicated in red, show that most nodes are predicted with uncertainty; nevertheless, the genera Escherichia spp. and Shigella spp. are not separated in this tree, and the three Escherichia species are also mixed

Pan-genome clustering of E. coli (black) and related species (colored), based on the alignment of their variable gene content. The genomes now cluster according to species and a relatedness between E. coli K12 derivatives (green block) and group B isolates (orange block) is visible

Pan- and core genome plot of the analyzed genomes. The blue pan-genome curve connects the cumulative number of gene families present in the analyzed genomes. The red core genome curve connects the conserved number of gene families. The gray bars show the numbers of novel gene families identified in each genome

BLAST atlas. In the middle, a genome atlas of E. coli O157:H7 strain EC4115 is shown, around which BLAST lanes are shown. Every lane corresponds to a genome, with the following colors (going outwards): green E. coli O157:H7 (15 lanes); light blue E. coli LANL strains (two lanes); dark blue Shigella spp. (eight lanes); red E. coli K12 and derivatives (six lanes); orange E. coli strain B phylogroup (four lanes); followed by all other E. coli genomes in different colors. The outermost three lanes represent E. fergusonii, E. albertii, and S. enterica Typhimurium LT2. Lack of color indicates that the genes at that position in strain EC4115 were not found in the genome of that lane. The position of replication origin and terminus is indicated