The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact journals.permissions@oxfordjournals.org

Abstract

We sequenced and analyzed the genome of a commensal Escherichia coli (E. coli) strain SE11 (O152:H28) recently isolated from feces of a healthy adult and classified into E. coli phylogenetic group B1. SE11 harbored a 4.8 Mb chromosome encoding 4679 protein-coding genes and six plasmids encoding 323 protein-coding genes. None of the SE11 genes had sequence similarity to known genes encoding phage- and plasmid-borne virulence factors found in pathogenic E. coli strains. The comparative genome analysis with the laboratory strain K-12 MG1655 identified 62 poorly conserved genes between these two non-pathogenic strains and 1186 genes absent in MG1655. These genes in SE11 were mostly encoded in large insertion regions on the chromosome or in the plasmids, and were notably abundant in genes of fimbriae and autotransporters, which are cell surface appendages that largely contribute to the adherence ability of bacteria to host cells and bacterial conjugation. These data suggest that SE11 may have evolved to acquire and accumulate the functions advantageous for stable colonization of intestinal cells, and that the adhesion-associated functions are important for the commensality of E. coli in human gut habitat.

Key words: Escherichia coli, commensal, human gut, genome sequencing

1. Introduction

Microbial communities (microbiota) inhabiting the human body sites have long been recognized to play critical roles in human health and disease. Collective genomes (microbiome) of the human microbiota have now become important targets to be studied in both microbiology and human biology.1 Among the human microbiota, the gut microbiota are most abundant in number of microbial species accounting for ≥1000 species, which shape a very complex and dynamic microbial community with high interindividual variations.2 The large-scale bacterial 16S ribosomal RNA sequence and metagenomic analyses of the gut microbiome have provided a great progress for a better understanding of the ecological and biological natures of the human gut microbiota.3–7 However, genomic sequences of individual members constituting the microbiota are also needed, and important to more precisely interpret the enumerative data that will be accumulated in future studies including the International Human Microbiome Project.8

Escherichia coli (E. coli) is one of the common members in the human gut microbiota. Over the past decades, there have been many reports on the phylogenetic and genomic analyses of E. coli strains isolated from various sources including humans, animals and various environments.9–19 Among isolated E. coli strains, the whole-genome sequencing analysis has been extensively performed for pathogenic strains to explore the pathogenicity and identify virulence-associated genes in these strains.20–26 In contrast, the whole-genomic sequencing of non-pathogenic E. coli strains has been limited for several E. coli K-12 strains that have long been used in genetic studies and recombinant DNA technologies.27–30 Since E. coli K-12 strain was originally isolated from the stool of a convalescent diphtheria patient in 1922, these sequenced K-12-derived strains, MG1655, W3110 and DH10B, may have undergone spontaneous genetic changes during preservation and successive passages at the laboratory, resulting in the accumulation of mutations in genes and loss of many features representing the wild-type commensal E. coli.31–33 Nevertheless, the genomic sequencing analysis of human commensal E. coli strains is quite scarce. Only two human commensal strains HS and Nissle 1917 have been completely or partially sequenced.26,34,35 This is surprising because the wild-type commensal strain is a good reference genome in the human gut microbiota research and useful to explore the genetic and functional features adapted to human gut habitat, and the comparison with the laboratory strain K-12 or pathogenic strains will provide new insights into the structural and evolutionary aspect of commensal E. coli strains.

In this study, we sequenced the genome of commensal E. coli strain-designated SE11 isolated from feces of a healthy adult and performed the comparative analysis with other sequenced E. coli genomes. This paper may be the first report demonstrating the complete genome sequence analysis of the wild-type commensal E. coli strain belonging to phylogenetic group B1, distant from strains K-12 and HS belonging to phylogenetic group A in E. coli reference (ECOR) collection.36

2. Materials and methods

One gram of feces collected from a healthy adult human was suspended in 9.0 mL of phosphate-buffered saline (pH 7.0). Serially diluted solutions were inoculated on deoxycholate hydrogen sulfide lactose (DHL) agar (Eiken Chemical Co. Ltd.) and incubated at 37°C for 24 h. Eight red colonies on the DHL agar plates were picked up and subjected to single colony isolation twice on Luria-Burtani (LB) agar plates. The eight isolates were identified as E. coli on the basis of the following characteristics: gram negative, rod shape, growth under the aerobic and anaerobic conditions, spore formation negative, motile, production of gas/lactic acid from glucose/lactose. Each isolate was grown in LB broth at 37°C for 24 h and stored in the LB medium containing 10% glycerol at −85°C until used for further analysis.

2.2. Random amplification of polymorphic DNA fingerprinting

Eight E. coli isolates were analyzed by random amplification of polymorphic DNA (RAPD) fingerprinting method using three primers (1247, 5′-AAGAGCCCGT-3′; 1254, 5′-CCGCAGCCAA-3′; 1290, 5′-GTGGATGCGA-3′).37 A fresh colony grown on LB agar plate was transferred to a 1.5 mL microtube. The cells were disrupted using microwave (500 W for 1 min) and suspended in 5.0 µL of double-distilled water (ddH2O). After spindown, the supernatant was used as template DNA in RAPD analysis. The 50.0 µL polymerase chain reaction (PCR) mixture contained 5.0 µL of template DNA, 4.0 µL of each primer (10 µM), 5.0 µL of 10× PCR buffer, 4.0 µL of dNTP mixture, 0.25 µL of Ex Taq polymerase (Takara Bio Inc.), and 27.75 µL of ddH2O. PCR amplification was performed in the iCycler Thermal Cycler (Bio-Rad) according to the following protocol: 1 cycle of 10 min at 94°C; 30 cycles of 1 min at 94°C, 1 min at 55°C, and 2 min at 72°C; and 1 cycle of 10 min at 72°C. Amplified DNA fragments were separated on 1.0% agarose gels (100 V for 30 min) and stained with ethidium bromide (0.2 µg/mL) for 30 min.

2.3. Genome sequencing

The genome sequence of SE11 was determined by a whole-genome shotgun strategy. We constructed small-insert [2 kilobases (kb)], large-insert (10 kb) and fosmid (40 kb) genomic libraries, and generated 55 296 sequences using ABI 3730xl sequencers (Applied Biosystems), giving eightfold coverage from both ends of the genomic clones. Sequence reads were assembled with the Phred–Phrap–Consed program38 and gaps were closed by direct sequencing of clones that spanned the gaps or of PCR products amplified with oligonucleotide primers designed to anneal to each end of neighboring contigs. The overall accuracy of the finished sequence was estimated to have an error rate of <1 per 10 000 bases (Phrap score of ≥40).

3. Results

3.1. Isolation and phylogenetic analysis of SE11

We isolated eight E. coli strains from feces of a healthy adult as described in Materials and methods, and examined them by the RAPD method. Seven strains exhibited the same RAPD patterns in respective experiments using three different primer sets, thus revealing their structural identity of genomes. We therefore selected one E. coli strain-designated SE11 for further analysis and sequencing. The 16S rRNA sequence of SE11 showed the highest similarity (98.8% identity) to that of E. coli ATCC 11775T (accession no. {"type":"entrez-nucleotide","attrs":{"text":"X80725","term_id":"1240022","term_text":"X80725"}}X80725). From the MLST analysis based on the nucleotide sequences of seven housekeeping genes,42 SE11 was found to belong to phylogenetic group B1 whose members predominate in the human gut microbiota,9 and is phylogenetically distinct from K-12 and HS strains in group A and more from human pathogenic strains mostly belonging to group B2 or E in ECOR collection (Supplementary Fig. S1). Most of commensal E. coli strains belonging to groups A and B1 were shown to be avirulent in mice.43 SE11 has an O152:H28 serotype, which is less frequently found in enteroinvasive E. coli.44,45

3.2. General features and gene content in SE11

The genome of E. coli SE11 consists of a circular chromosome of 4 887 515 bp and six plasmids (100.0, 91.2, 60.6, 6.9, 5.4 and 4.1 kb) (Figs 1 and ​and2).2). General features of the SE11 genome were shown in Table 1. The chromosome size of SE11 is larger than those of the laboratory K-12 strains, and smaller than those of pathogenic strains sequenced to date (Supplementary Table S1). The SE11 chromosome contained 4679 predicted protein-coding genes, 86 tRNA genes, and 22 rRNA genes, and the six plasmids contained a total of 323 predicted protein-coding genes. Of all protein-coding genes predicted in SE11, we could assign 2944 (59%) protein-coding genes to known functions, 1895 (38%) to genes of unknown function conserved in many bacterial genomes, and 163 (3%) to novel hypothetical genes. We identified 52 copies of insertion sequence (IS) elements in the SE11 genome (Supplementary Table S2). These IS elements are classified into 27 families, and the IS677 family (10 copies as intact forms) was most predominant in SE11.

Circular representation of the SE11 chromosome. From the outside in: circles 1 and 2 of the chromosome show the positions of protein-coding genes on the positive and negative strands, respectively. Circles 3 and 4 show the positions of protein-coding...

Comparison of all 5002 protein-coding genes in SE11 with those in the strain K-12 MG1655 identified 1186 genes absent in MG1655, 62 poorly conserved genes and 3754 highly conserved genes. Classification of the 5002 protein-coding genes in SE11 was summarized in Fig. 3. Of the 3754 highly conserved genes, 2802 were also conserved in all 14 sequenced E. coli genomes. Of the 1186 genes, 170 were unique to SE11 among the 14 E. coli genomes. The 1186 genes absent in MG1655 comprised 438 mobile elements-related, 356 conserved function-unknown, 108 hypothetical and 284 genes with assigned functions including metabolic genes of oligosaccharides such as sugar, cellobiose, mannose and N-acetylgalactosamine, a gene for bile salt hydrolase, tetracycline-resistant genes and genes associated with fimbriae on the bacterial cell surface (discussed later). The 356 conserved hypothetical genes and 284 genes with assigned functions in SE11 were listed in Supplementary Table S3. On the other hand, the 317 genes that were present in MG1655 but absent in SE11 comprised 186 (59%) mobile elements-related and 131 unique genes including those involved in the restriction/modification system and acetoacetate metabolism (Supplementary Table S4). The 62 poorly conserved genes between SE11 and MG1655 may have the higher mutation rate than other conserved E. coli genes and included some of genes of the lipopolysaccharide (LPS) biosynthesis (Supplementary Table S5). Outer core oligosaccharide in LPS is highly variable in structure and five distinct outer core types in E. coli are known.46 The outer core oligosaccharide of SE11 was found to be of R3 type in this study. The comparative analysis with the phylogenetically closest E24377A, an enterotoxigenic E. coli (ETEC) isolate, also revealed that both strains shared the highest number of 4112 orthologs, of which only 41 protein-coding genes were not found in other sequenced E. coli strains and may be regarded as group B1-specific genes (Supplementary Table S3). From these comparative analyses of E. coli genes, it was found that SE11 lacked genes homologous to known or suspected toxins and extracellular enzymes such as Shiga toxins, alpha-hemolysin and enterohemolysin that are involved in the virulence of pathogenic E. coli strains as well as genes for heat-labile and heat-stable enterotoxins encoded on the plasmids in ETEC E24377A.26

Classification of all 5002 protein-coding genes in SE11 based on comparison with those in MG1655 and 12 other E. coli strains. The 5002 protein-coding genes annotated in SE11 were compared with those in 13 other sequenced E. coli strains and classified...

3.3. Prophages and integrative elements

In the SE11 chromosome, there are seven prophage regions (PP_SE11-1 to -7; 36–53 kb in length) and three integrative elements (IE_SE11-1 to -3; 7–33 kb in length) that contained an integrase gene but no genes for apparent phages, transposons and integrative conjugative elements. Many of these integrated regions were flanked by short sequence duplications that are hallmarks of the lateral transfer event (Table 2). Comparative analysis with MG1655 showed that the SE11 chromosome contained 47 regions (>5 kb) that are absent in MG1655 (Fig. 4). Of these additional regions in the SE11 chromosome, we identified nine large segments (>30 kb), eight of which overlapped with all seven prophage regions (PP_SE11-1 to -7) and one integrative element (IE_SE11-1) described earlier (Table 2). Only a large segment (ECSE_0213–0239) near the aspV tRNA gene contained no apparent integrase gene, phage-related gene, transposase gene, and direct repeat, and mostly encoded proteins with unknown function. Integrative elements corresponding to this large segment were also retained at the same loci in the chromosomes of E24377A and phylogenetically distant strain EHEC O157 and UPEC 536, suggesting that MG1655 might have lost this locus during evolution (Supplementary Fig. S2).

Locations and lengths of the strain-specific segments. Horizontal axis represents the MG1655 chromosome location and vertical axis shows lengths of the strain-specific segments (>5 kb) compared with the MG1655 chromosome. The positions of PP_SE11...

The genomes of pathogenic E. coli strains contain many prophages and other genetic elements that are the major sources for genes encoding virulence factors, such as toxins, type III secretion systems (TTSS), and effector proteins secreted by the TTSS.47,48 We compared the highly conserved prophages in SE11, MG1655 and O157 Sakai to analyze differences in structure and gene contents. PP_SE11-1 exhibits structural features similar to those of lambda-like prophage Sp8 in O157 Sakai and lambda-like prophage DLP12 in MG1655 at the same integration loci, but both PP_SE11-1 and DLP12 lacked the virulence-related catalase gene in Sp8 (Fig. 5A). PP_SE11-1 and DLP12 share 19 genes including the nmpC gene encoding a porin protein that allows small metabolites such as sugars, ions and amino acids to permeate. In addition, PP_SE11-1 and PP_SE11-5 shared a 23 kb almost identical segment that contains the additional nmpC homolog, suggesting that a very recent duplication of these regions may have occurred in SE11 (data not shown). The integrated locus of PP_SE11-2 is the same as those of lambda-like prophage Sp10 in O157 Sakai and lambda-like prophage Rac in MG1655 (Fig. 5B). These three prophages share many conserved genes but genes for three TTSS effectors and a Cu/Zn-superoxide dismutase encoded by Sp10 were missing in PP_SE11-2 and Rac. PP_SE11-3 was also found to integrate at the same locus as those of lambda-like prophages Sp11–Sp12 in O157 Sakai and lambda-like prophage Qin in MG1655, but the genes for TTSS effectors and a transcriptional regulator (PchB) encoded in Sp11–Sp12 were missing in PP_SE11-3 and Qin (Fig. 5C). Taken together with the results obtained from the analysis of genes in SE11, these data further indicate that SE11 and MG1655 do not possess prophage-borne virulence-associated genes found in O157 Sakai, despite the high conservation of these integrated elements among the three evolutionarily distant strains (Supplementary Fig. S1). At present, it is unknown whether the ancestral E. coli acquired virulence genes by prophage integration and thereafter they have been retained in O157 Sakai and lost in SE11 or it acquired non-virulent prophages and thereafter O157 Sakai has independently acquired the virulence genes but SE11 has not.

Comparisons of the genomic location of three SE11 prophages with the corresponding location of the related prophages of K-12 and O157 strains. Genomic organizations of PP_SE11-1 (A); PP_SE11-2 (B) and PP_SE11-3 (C). Genes and their orientations are depicted...

3.4. Plasmids

The six plasmids in SE11 encoded a total of 323 protein-coding genes (Table 1 and Fig. 2). Copy numbers of each plasmid in SE11 were estimated to be one copy for pSE11-1, pSE11-2, pSE11-3, and pSE11-4, and ∼2 copies for pSE11-5 and pSE11-6 by the number of sequence reads assembled in respective plasmids. Three small plasmids (pSE11-4, pSE11-5 and pSE11-6) were found to be cryptic. Four plasmids (pSE11-1, pSE11-2, pSE11-3 and pSE11-6) had the genes encoding replication protein. The replication proteins of pSE11-1, pSE11-2 and pSE11-3 showed the high sequence similarity to those of IncFII, ColV and F plasmids, respectively, and pSE11-6 had the replication protein 100% identical to that of the plasmid pSMS35_130 of E. coli SMS-3-5. Two plasmids (pSE11-4 and pSE11-5) have the genes for mobilization protein. Thus, the six plasmids found in SE11 are compatible in a cell. The pSE11-1 (100 021 bp) contained almost identical gene sets to those in the conjugate plasmid ColIb-P9 (93 399 bp, accession no. {"type":"entrez-nucleotide","attrs":{"text":"AB021078","term_id":"4512437","term_text":"AB021078"}}AB021078) except for several genes including tetracycline resistance genes tetR (ECSE_P1-0010) and tetA (ECSE_P1-0011). Both pSE11-1 and ColIb-P9 encoded the same set of genes for conjugational transfer (tra and trb genes), biogenesis of type IV pili (pil genes), and colicin Ib production and immunity. It has been reported that type IV pili encoded by IncI1 group plasmids of enteric bacteria (e.g. ColIb-P9) are required both for plasmid conjugation and adherence to host epithelial cells.49 The pSE11-2 (91 158 bp) is a conjugative plasmid containing the fimbrial operon (ECSE_P2-0001-0005) homologous to that of the F1 (Caf1) pili biogenesis whose genes are encoded on the virulence plasmid pMT1 in Yersinia pestis.50 The gene products encoded by the fimbrial operon in pSE11-2 and the caf1 operon in pMT1 showed 31–70% amino acid sequence identities. The pSE11-3 is a non-conjugative plasmid of 60 555 bp, and contained two chaperone-usher fimbrial operons. One is the fae operon encoding F4 (or K88) fimbriae (ECSE_P3-0031-0037), which was flanked by transposase genes. F4 fimbriae are the major colonization factors in some ETEC strains associated with porcine neonatal and postweaning diarrhea.51 The other fimbrial operon (ECSE_P3-0060-0066) showed no strong similarity to entries in public databases.

3.5. Genes for fimbriae and autotransporter in SE11

Three loci of chaperone-usher pathways and one operon of type IV pilus encoded on the SE11 plasmids were almost completely missing in other sequenced E. coli strains. Certain E. coli strains were shown to be fimbriated and conferred the ability to adhere to host intestinal cells by the presence of a plasmid encoding fimbrial genes.52 SE11 also contains at least 13 loci for the fimbrial biosynthesis on the chromosome, accounting for a total of 17 loci, many of which were missing or present as truncated forms in other sequenced E. coli genomes (Table 3). MG1655 lacked two of 13 chromosomal loci encoding the fimbrial biosynthesis in SE11. One of these two loci is the lpf operon (ECSE_4015–4018) for the synthesis of long polar fimbriae that are known to mediate bacterial cell adhesion to host epithelial cells.53 Of sequenced E. coli strains, E24377A, SMS-3-5 and O157 Sakai contained the lpf operon at the position between glmS and pstS. The lpf operon in SE11 is almost identical with 99–100% amino acid sequence identity of that in E24377A and divergent from those of SMS-3–5 (83–98%) and O157 (34–63%). Another fimbrial operon locus (ECSE_3375–3378) is similar to CS1 fimbriae that are a major colonization factor of some ETEC strains.54 The CS1-like fimbrial operon in SE11 is also conserved in E24377A, HS, ATCC 8739, UPEC strain 536 and SMS-3-5 with the sequence similarity of 71–100% amino acid sequence identities. Of the fimbrial operons conserved between SE11 and MG1655, only three genes (ECSE_2643–2645) in the yfc operon showed low similarities of 52–59% amino acid sequence identities between them, while the three genes showed 98–99% amino acid sequence identities with those of Shigella flexneri.

Several autotransporters such as E. coli AIDA-I and Ag43 are also known to have the function as fimbrial adhesions.55 Autotransporters are a large and diverse superfamily of proteins that are composed of an N-terminal variable passenger domain translocated across the membrane and a C-terminal beta domain. SE11 possesses at least eight genes encoding intact autotransporters (Table 4), of which five autotransporters (ECSE_1215, ECSE_1251, ECSE_1600, ECSE_2459 and ECSE_2494) contained the pertactin motif (Pfam PF03212) and thus may function as adhesins like pertactins of Bordetella.56 Other sequenced E. coli strains also possess these homologous genes in various combinations but encode many of them as pseudogenes or completely lacked (Table 4). For instance, MG1655 has six orthologous genes of the eight autotransporter genes encoded in SE11, but three of the six genes were fragmented and seemed to no longer function. O157 Sakai and three UPEC strains (CTF073, UTI89 and 536) possess only four autotransporter genes homologous to four intact genes (ECSE_0327, ECSE_0393, ECSE_2494 and ECSE_3884) in SE11. Absence of orthologs of three genes (ECSE_1215, ECSE_1251 and ECSE_1600) is common in the three strains (CTF073, UTI89 and 536) belonging to phylogenetic group B2, suggesting that these orthologous genes may have been lost only in the lineage to the B2 group after divergence of the ancestor of the B2 group from the common ancestral E. coli (see Supplementary Fig. S1). Relative abundance of autotransporter homologs was observed in ETEC E24377A and a commensal HS belonging to groups B1 and A, respectively, both of which possess six intact autotransporter genes. The orthologs of ECSE_3884 are widely conserved and distributed throughout E. coli and Shigella, and its passenger domain contains the short repeats (PF05658) and motifs (PF05662) found in hemagglutinins, suggesting that the autotransporter encoded by ECSE_3884 may also be involved in the mechanism of bacterial attachment to host cells.57 It is also noteworthy that SE11 contained no autotransporter that exports host-damaging proteins with the serine protease activity such as Sat and Pic produced by UPEC strains.58,59

4. Discussion

From the detailed analysis of the genome sequence of the wild-type commensal strain SE11, we found that SE11 is notably abundant in the adhesion functions such as fimbriae and autotransporters that have been originally identified as virulence-associated functions in pathogenic E. coli strains.60,61 Although many of these adhesion-associated genes are also conserved in pathogenic E. coli strains, our data indicated that SE11 does not accompany other known virulence-associated genes found in the pathogenic strains. Furthermore, many of these adhesion-associated genes were encoded in the integrated regions on the chromosome and in the transmittable plasmids in SE11, indicating that they have been horizontally acquired in SE11. Lack of known virulence-associated genes in SE11 was also evident from the structural comparisons of several conserved prophages in SE11, MG1655 and O157 Sakai, showing that virulence-associated genes present in the prophages of O157 were completely missing in those of SE11 and MG1655, while other genes were retained. The SE11 plasmids also encoded many genes associated with bacterial conjugation. This feature may be advantageous for the efficient distribution of plasmids through cell–cell contacts in the gut environment with the high microbial density.5 These data suggest that the adhesion-associated genes are transferable genetic elements between E. coli and rather serve as a versatile function enhancing the ability of E. coli to colonize the gut. This notion is consistent with the recent finding that commensal and pathogenic E. coli strains use a common pilus adherence factor for the colonization.62

The comparison of SE11 with the laboratory-adapted strain MG1655 revealed that SE11 possessed more genes involved in the metabolism of carbohydrates as well as the genes for the adhesion than MG1655. These genes are associated with uptake of available nutrients, allowing E. coli to survive in the intestinal tract rich in oligo- and polysaccharides.63 The genomic features of SE11 shown here may indicate the consequence of adaptation of the commensal E. coli strain to human gut habitat.

Supplementary Data

Funding

This research was supported by Grants-in-Aid for Scientific Research on Priority Areas ‘Comprehensive Genomics’ (M.H.) and ‘Applied Genomics’ (T.H.) from the Ministry of Education, Culture, Sports, Science, and Technology of Japan.