Estimating variation within the genes and inferring the phylogeny of 186 sequenced diverse Escherichia coli genomes.

Abstract

BACKGROUND:

Escherichia coli exists in commensal and pathogenic forms. By measuring the variation of individual genes across more than a hundred sequenced genomes, gene variation can be studied in detail, including the number of mutations found for any given gene. This knowledge will be useful for creating better phylogenies, for determination of molecular clocks and for improved typing techniques.

RESULTS:

We find 3,051 gene clusters/families present in at least 95% of the genomes and 1,702 gene clusters present in 100% of the genomes. The former 'soft core' of about 3,000 gene families is perhaps more biologically relevant, especially considering that many of these genome sequences are draft quality. The E. coli pan-genome for this set of isolates contains 16,373 gene clusters.A core-gene tree, based on alignment and a pan-genome tree based on gene presence/absence, maps the relatedness of the 186 sequenced E. coli genomes. The core-gene tree displays high confidence and divides the E. coli strains into the observed MLST type clades and also separates defined phylotypes.

CONCLUSION:

The results of comparing a large and diverse E. coli dataset support the theory that reliable and good resolution phylogenies can be inferred from the core-genome. The results further suggest that the resolution at the isolate level may, subsequently be improved by targeting more variable genes. The use of whole genome sequencing will make it possible to eliminate, or at least reduce, the need for several typing steps used in traditional epidemiology.

Progress of Homolog Gene Cluster calculation as each genome is added. Two circles exist (red & blue) for each genome added from genome no. 9 up to and including genome no. 186. Red represents the number of core HGCs after the addition of a genome and blue represents the number of pan HGCs after the addition of a genome.

HGC Variation plot. A Density plot was created from the calculation of nucleotide diversity within each HGC. The blue plot was created from all the HGCs. The red plot only includes the strict core HGCs. The green plot includes the soft core (95%) HGCs. Intersection between core plots is yellow.

Box plot of MLST gene variation. A box plot presenting the distribution of nucleotide diversity within each of the three MLST schemes. The red line represents the median of percent identity for HGCs in the core (~0.018 substitutions per site).

Core-gene tree close-up on O157:H7 strains. The tree is a close-up of the O157:H7 clade from the core-gene tree presented in Figure . The names has been colored according to the three outbreaks described in []. Blue strains represent the spinach outbreak, red strains represent the Taco Bell outbreak and the green strains represent the Taco John outbreak. Branch lengths have been modified to create the best visual output and thus have no value.

General function of conserved and variable HGCs. The difference in functional annotations between conserved and variable HGCs. Conserved here defined as the quarter of HGCs with the lowest nucleotide diversity (red bars) and variable defined as the quarter of HGCs with the highest nucleotide diversity (blue bars). Each HGC has a functional profile. A functional profile consists of one or more functional categories. The bars represent the percentage of HGC profiles, which contain the functional category listed to the immediate left of the bars.

Core-gene tree. The E. coli tree was created from the alignment of 1,278 core-genes from the 186 E. coli genomes. MLST types are annotated to the far right of each genome name. The Escherichia genus tree was created from 297 core-genes. The phylotypes, as determined by the in silico Clermont [] method, are marked with the colors blue (A), red (B1), purple (B2), green (D), and the Shigella genomes are marked with the color brown. At each node a black circle indicates a bootstrap value of 1, a grey circle a bootstrap value between 1 and 0.7 and a red number indicate an actual bootstrap value below 0.7. The dashed line in the figure represents a branch, which has been manually shortened by the authors to fit the figure on a printed page. The original tree with all bootstrap values can be seen in Additional file 2. Both trees are unrooted, but the E. coli tree has been visually rooted on the node leading to Clade I.

Pan genome tree. The tree was created based on the presence or absence of 16,373 HGCs in the 186 E. coli genomes. MLST types are annotated to the far right of each genome name. The phylotypes are marked with the colors blue (A), red (B1), purple (B2), green (D), and the Shigella genomes are marked with the color brown. Bootstrap values are annotated at each node as a percentage between 0 and 100. At each node a black circle indicates a bootstrap value of 100, a grey circle indicates a bootstrap value between 100 and 70 and a red circle indicates a bootstrap value below 70. The original tree with all bootstrap values can be seen in Additional file .