Abstract

Streptococcus thermophilus ACA-DC 2 is a newly sequenced strain isolated from traditional Greek yogurt. Among the 14 fully sequenced strains of S. thermophilus currently deposited in the NCBI database, the ACA-DC 2 strain has the smallest chromosome, containing 1,731,838 bp. The annotation of its genome revealed the presence of 1,850 genes, including 1,556 protein-coding genes, 70 RNA genes and 224 potential pseudogenes. A large number of pseudogenes were identified. This was also accompanied by the absence of pathogenic features suggesting evolution of strain ACA-DC 2 through genome decay processes, most probably due to adaptation to the milk ecosystem. Analysis revealed the existence of one complete lactose-galactose operon, several proteolytic enzymes, one exopolysaccharide cluster, stress response genes and four putative antimicrobial peptides. Interestingly, one CRISPR-cas system and one orphan CRISPR, both carrying only one spacer, were predicted indicating low activity or inactivation of the cas proteins. Nevertheless, four putative restriction-modification systems were determined that may compensate any deficiencies of the CRISPR-cas system. Furthermore, whole genome phylogeny indicated three distinct clades within S. thermophilus. Comparative analysis among selected strains representative for each clade, including strain ACA-DC 2, revealed a high degree of conservation at the genomic scale, but also strain specific regions. Unique genes and genomic islands of strain ACA-DC 2 contained a number of genes potentially acquired through horizontal gene transfer events, that could be related to important technological properties for dairy starters. Our study suggests genomic traits in strain ACA-DC 2 compatible to the production of dairy fermented foods.

Keywords

Introduction

The use of microorganisms in food fermentations is the means for converting perishable and frequently inedible raw materials into safe, shelf-stable and nutritionally upgraded foods [1]. The economic importance of starter cultures for the food industry has led to the continuous search for the discovery of new microorganisms with important technological characteristics. In many cases it has been proven that traditionally fermented foods represent a natural reservoir of undiscovered microbial strains for possible diverse food applications [2, 3].

Streptococcus thermophilus is among the species commonly used in the dairy industry, mainly in the fermentation of yogurt and several cheese varieties, contributing to the desirable organoleptic characteristics of the final product [4, 5]. It is the sole species considered GRAS within the Streptococcus genus, which includes mostly pathogens and opportunistic pathogens [6]. Due to the industrial significance of the species, a plethora of studies has been conducted for a number of strains, revealing information about their diverse technological features [7, 8]. Furthermore, during the last 15 years, the advance of high-throughput sequencing techniques along with the development of novel bioinformatics tools facilitated the analysis of complete genome sequences, providing information for the overall genetic content of S. thermophilus [9–12]. These studies have demonstrated that S. thermophilus strains have been adapted to the milk environment through extensive reductive evolution as indicated by the large number of pseudogenes found in all strains. Adaptation to the milk environment is also supported by the loss of genes related to carbohydrate metabolism and virulence.

In this study, we present the analysis of the complete genome sequence of S. thermophilus ACA-DC 2. The genomic insights acquired could be proven useful for the exploitation of the specific strain in the production of fermented dairy products.

Organism information

Classification and features

Streptococcus thermophilus ACA-DC 2 is classified within the order Lactobacillales of the class Bacilli. It is a non-sporulating, Gram-positive bacterium with coccus-shaped cells (Fig. 1). The strain was isolated from traditional Greek yogurt manufactured through backslopping [13, 14]. Its optimum growth takes place in M17 medium at 42 oC under microaerophilic conditions within 24 h. Information about the classification and the features of S. thermophilus ACA-DC 2 is summarized in Table 1. The phylogenetic analysis was based on 16S rRNA gene sequences and places S. thermophilus ACA-DC 2 in the distinct cluster formed by the S. thermophilus strains and within the salivarius group, as shown in Fig. 2.

aEvidence codes - IDA inferred from direct assay, TAS traceable author statement (i.e., a direct report exists in the literature), NAS non-traceable author statement (i.e., not directly observed for the living, isolated sample, but based on a generally accepted property for the species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [54]

Fig. 2

Phylogenetic tree highlighting the position of S. thermophilus ACA-DC 2 relative to other Streptococcus species. The tree was constructed based on 16S rRNA gene sequences. GenBank accession numbers are presented in parentheses and type strains are indicated with a superscript T (type strains = T). Strains with complete genome sequence are marked with an asterisk. 16S rRNA gene sequences were aligned using MUSCLE [55]. The phylogenetic tree was built by the Maximum Likelihood method within MEGA7 software [56] using the Tamura-Nei substitution model [57]. Lactococcus lactis subsp. lactis NCDO 604T served as the outgroup. Bootstrap values derived after 1,000 replicates. The scale bar indicates an estimated 0.01 nucleotide change per nucleotide position

Genome sequencing information

Genome project history

S. thermophilus ACA-DC 2 is deposited in the ACA-DC culture collection of the Laboratory of Dairy Research, Agricultural University of Athens, Athens, Greece. The strain was selected for sequencing in order to obtain information about its technological and probiotic potential, having as basic aim its application as a starter culture in the production of dairy fermented foods. The project was carried out in 2015 and the genome was sequenced, fully assembled and annotated. The genome sequencing project was registered in the European Nucleotide Archive database under accession number LT604076. The summary of the project information is shown in Table 2.

Table 2

Project information

MIGS ID

Property

Term

MIGS 31

Finishing quality

Finished

MIGS-28

Libraries used

Illumina genomic Nextera XT library;

PacBio 10 kb genomic library

MIGS 29

Sequencing platforms

Illumina HiSeq2500; PacBio RSII

MIGS 31.2

Fold coverage

636x

MIGS 30

Assemblers

ABySS v1.5.1; BLASR; SSPACE v1.0; GapFiller v1.10

MIGS 32

Gene calling method

Prodigal; MeteGeneAnnotator; FGENESB

Locus Tag

STACADC2

Genbank ID

LT604076

GenBank Date of Release

29-Jul-2016

GOLD ID

NA

BIOPROJECT

PRJEB14916

MIGS 13

Source Material Identifier

ACA-DC 2

Project relevance

Dairy isolate

Growth conditions and genomic DNA preparation

S. thermophilus ACA-DC 2 was grown in M17 medium (Biokar Diagnostics, Beauvais, France). For the isolation of the genomic DNA, 2 ml from an overnight culture incubated at 42 °C were used and the extraction procedure was performed according to the protocol of Pitcher et al. [15]. The purity and the concentration of the extracted DNA were measured with a UV-Vis spectrophotometer (Q5000, Quawell, San Jose, USA) while its integrity was evaluated electrophoretically in a 0.8% agarose gel.

Genome sequencing and assembly

Whole-genome sequencing was performed using the Illumina HiSeq2500 and the PacBio RSII platforms at BaseClear service laboratory for DNA-research (Leiden, The Netherlands). Paired-end sequence reads were generated using the Illumina HiSeq2500 system. FASTQ sequence files were obtained using the Illumina Casava pipeline v1.8.3. Initial quality assessment was based on data passing the Illumina Chastity filtering. Subsequently, reads containing adapters and/or PhiX control signal were removed using an in-house filtering protocol. The second quality assessment was based on the remaining reads using the FASTQC quality control tool v0.10.0 resulting in 4,403,680 reads.

The data collected from the PacBio RSII instrument were processed and filtered using the SMRT Analysis software suite. The Continuous Long Read data were filtered by Read-length (>50), Subread-length (>50) and Read quality (>0.75) resulting in 117,020 reads.

The quality of the Illumina FASTQ sequences was enhanced by trimming off low-quality bases using the program bbduk, which is part of the BBMap suite v34.46. The quality-filtered sequence reads were puzzled into a number of contig sequences. The analysis was performed using ABySS v1.5.1. The contigs were linked and placed into super-scaffolds based on the alignment of the PacBio CLR reads with BLASR [16]. The alignment was further used to estimate the orientation, order and distance between the contigs by the SSPACE-LongRead scaffolder v1.0 [17]. The gapped regions within the super-scaffolds were closed in an automated manner using GapFiller v1.10 [18]. The method takes advantage of the insert size between the Illumina paired-end reads. The assembly resulted in one circular chromosome of 1,731,838 bp.

Genome annotation

Prediction of genes was carried out with the online programs Prodigal [19], MetaGeneAnnotator [20] and FGENESB [21], for comparison and verification of the obtained results. Genome annotation was performed using RAST v2.0 [22]. Annotation anomalies, including pseudogenes, were identified using GenePRIMP [23]. All data acquired were combined and subjected to manual curation. The WebMGA server [24] and the EggNog v4.5 [25] were used for COG annotation, the Phobius web server was used for the identification of genes with transmembrane helices and genes with signal peptides [26] and the Pfam database was used for the identification of genes with Pfam domains [27]. Potential pathogenic features were identified using the MP3 tool [28]. The CRISPRs﻿, the restriction-modification systems and the﻿ putative antimicrobial peptides were predicted using the CRISPRFinder web tool [29], the REBASE database [30] and BAGEL3 [31], respectively. The KODON software (Applied Maths NV, Sint-Martens-Latem, Belgium) was utilized for the visualization of synteny among the CRISPR regions of ACA-DC 2 and LMD-9 strains. The EDGAR server [32] was used for whole genome phylogeny and Venn diagram analysis. Circoletto [33] was employed for whole genome alignment among S. thermophilus strains. Finally, the genomic islands were identified through the IslandViewer 3 web-based resource [34].

Genome properties

The complete genome of S. thermophilus ACA-DC 2 consists of one circular chromosome containing 1,731,838 bp. The average GC content of the chromosome is 39.2%. A total of 1,850 genes were predicted after manual curation, including 1,556 protein-coding genes, 70 RNAs (56 tRNAs and 14 rRNAs) and 224 potential pseudogenes (Table 3). A circular map of the genome was generated using the CGView comparison tool [35] as shown in Fig. 3. Function was assigned to 1,182 genes (63.89%), while 1,318 genes (71.24%) had one or more conserved Pfam domains. The distribution of protein-coding genes into COG functional category is shown in Table 4. The analysis revealed that approximately 28.5% of the protein-coding genes do not have any described function.

The total is based on the total number of protein coding genes in the genome

Insights from the genome sequence

Main genome sequence characteristics

The genome of S. thermophilus ACA-DC 2 is the smallest one described to date among the fully sequenced strains of the species deposited in NCBI and it is approximately 200 kbp smaller than the larger described genome. The majority of potential pseudogenes encode hypothetical proteins, transposases and proteins involved in carbohydrate transport and metabolism. Analysis of the genome for virulence factors with the MP3 tool revealed a number of hits (data not shown). Detailed inspection of these hits with EDGAR demonstrated that several such genes are conserved among S. thermophilus strains indicating that it is rather unlikely to be related to pathogenicity, given the safe history of the species. The high percentage of pseudogenes along with the absence of typical virulence factors for streptococci suggest that the ACA-DC 2 strain evolved through genome decay during the adaptation to the rich in nutrients dairy environment [9, 11].

S. thermophilus ACA-DC 2 carries a complete lactose-galactose operon containing the galR, galK, galT, galE, galM, lacS and lacZ genes (STACADC2_1195-1189) and it is able to ferment lactose and galactose, the latter in a fairly slow rate (data not shown). It has been reported that fermentation of galactose is limited among the strains of S. thermophilus [11]. As mentioned above, several genes responsible for the transport and degradation of sugars, such as fructose, maltose and trehalose, have been identified as pseudogenes in the genome of ACA-DC 2, further supporting the specialization of the bacterium to catabolize lactose.

Similar to other dairy bacteria, S. thermophilus ACA-DC 2 is able to synthesize exopolysaccharides (EPS) that may confer improved viscosity and texture to yogurt [4]. The EPS cluster is flanked by a deoD gene encoding a purine nucleoside phosphorylase (STACADC2_0949) and a pseudogene originally encoding a beta-glucosidase. Four of these genes, namely epsA (STACADC2_0948), epsB (STACADC2_0947), epsC (STACADC2_0946) and epsD (STACADC2_0945) are implicated in the regulation, polymerization, chain length and export of the EPS and are conserved among several S. thermophilus strains [36].

Two candidate CRISPRs were detected in the chromosome of strain ACA-DC 2. Intriguingly, both CRISPRs contained only one spacer. One CRISPR was found surrounded by cas proteins (STACADC2_0849-0856) while the other was orphan. The CRISPR-cas system of strain ACA-DC 2 exhibited the same organization and high degree of identity to that described previously for strain LMD-9 (Fig. 4) [37]. The two CRISPR-cas systems differed mainly in the csm6 gene, which in the case of strain ACA-DC 2 is a potential pseudogene as well as in csm2 gene that seems to be distinct in the two strains. S. thermophilusLMD-9 carries three CRISPR-cas systems and the system that is similar to that of ACA-DC 2 carries the lowest number (three) of spacers. Combined these findings could indicate low activity or even inactivation of the entire CRISPR-cas system in strain ACA-DC 2. Another possibility that cannot be excluded concerns low exposure of strain ACA-DC 2 to foreign DNA. Of course, any deficiency in the activity of the CRISPR-cas system may be compensated by restriction-modification (RM) systems. Strain ACA-DC 2 carries four putative RM systems according to the REBASE database (data not shown) belonging to RM types I (STACADC2_0642, STACADC2_0645, STACADC2_0648), II (STACADC2_0597-0598), III (STACADC2_0788-0789) and IV (STACADC2_0626).

Fig. 4

Synteny plot of the CRISPR loci between S. thermophilus strains ACA-DC 2 and LMD-9. The synteny of the two regions was calculated by the KODON software. In both strains the cas genes are denoted in blue. Gene csm6 in strain ACA-DC 2 is a potential pseudogene and it is denoted in yellow. The pyrD and pyrF genes colored in beige define the upstream and downstream limits of the CRISPR loci. Percentages displayed in the ribbon areas correspond to the % identity among the nucleotide sequences

Resolution of phylogenetic trees based on 16S rRNA gene sequences is limited due to high sequence identity, especially for strains of the same species. For this reason, we also performed whole genome phylogeny as implemented in EDGAR, using all available complete genomes of S. thermophilus. The phylogenetic tree produced revealed that S. thermophilus strains could be clustered in two distinct branches, the second of which could be also split in two sub-branches (Fig. 5). Strain ACA-DC 2 formed one of the branches along with strains CNRZ1066, LMG 18311, S9 and CS8. We chose strains ACA-DC 2, JIM 8232 and KLDS 3.1003 as representatives of each branch for comparative genomic analysis (Fig. 5). Whole genome alignments revealed extensive regions of high identity (>98%) among the genomes. However, regions of lower identity (between 80 and 98%) as well as strain specific regions were also identified. Using Venn diagram analysis as implemented in EDGAR, we determined a core genome of 1,303 genes among the three genomes as well as 137, 185 and 236 unique genes for strains ACA-DC 2, KLDS 3.1003 and JIM 8232, respectively.

The 137 unique genes of strain ACA-DC 2 were found to be involved in diverse functions (Fig. 6). At least some of those genes may be the result of horizontal gene transfer (HGT). HGT acquired genes may play a role in the technological properties of S. thermophilus strains [11]. Another analysis that may also reveal regions of HGT in the bacterial chromosome is the identification of GIs [38]. Twelve integrated GIs were predicted in the genome of S. thermophilus ACA-DC 2 (Fig. 6), containing a total of 213 genes also involved in diverse functions (Fig. 6). Detailed analysis of genes either unique or in the GIs could relate some of them to important technological traits. For example, we determined genes coding for cold shock proteins CspA and CspG (STACADC2_0749-0750), acid resistance locus arl7 (STACADC2_0743), putative bacteriocin peptides (STACADC2_1453 and STACADC2_1458) and a type I RM system (STACADC2_0642, STACADC2_0645, STACADC2_0648). A putative agmatinase gene (STACADC2_0818) that may play a role to protocooperation of S. thermophilus and L. bulgaricus during polyamine metabolism, was also detected in ACA-DC 2 strain [10]. Furthermore, genes implicated in zinc and heavy metals transport (STACADC2_0165-0166, STACADC2_0752), in DNA repair and metabolism (STACADC2_1696, STACADC2_1716, STACADC2_1719, STACADC2_1754) as well as several ribosome binding proteins, were also identified (STACADC2_0137, STACADC2_1568-1569, STACADC2_1667, STACADC2_1669-1671, STACADC2_1675-1695, STACADC2_1717, STACADC2_1732-1733, STACADC2_1752, STACADC2_1755).

Fig. 6

Additional genomic features of S. thermophilus ACA-DC 2. a Circular map of the S. thermophilus ACA-DC 2 genome as generated by IslandViewer 3. Highlighted regions correspond to GIs. GIs are colored within the circular map according to the prediction method used: five GIs in orange and eight GIs in blue were predicted with SIGI-HMM and IslandPath-DIMOB, respectively. Twelve integrated GIs are presented on the periphery in red. The black line plot represents the GC content (%) of the genomic sequence. b Distribution of genes in GIs and unique genes of S. thermophilus ACA-DC 2 into COG categories

Conclusions

The genome of S. thermophilus ACA-DC 2 presents characteristics in accordance with its adaptation to the milk environment including a high percentage of pseudogenes and absence of pathogenic features. Our analysis revealed that the strain carries genes involved in lactose and galactose catabolism and protein degradation that may accommodate its growth during milk fermentation. Stress response related genes that may contribute to survival under technological hurdles were also detected. Whole genome phylogeny suggested that S. thermophilus strains may diversify in three phylogenetic clades. Comparative analysis of genomes representative of each clade, including strain ACA-DC 2, revealed a number of unique genes for the latter. Furthermore, certain unique genes or genes belonging to GIs could be related to technological properties important for starter cultures. Theoretically, such genes could have been acquired through HGT. These findings render S. thermophilus ACA-DC 2 an appropriate candidate for use in the production of fermented dairy products.

Abbreviations

ACA-DC:

Agricultural College of Athens - Dairy Collection

CLR:

Continuous long read

COG:

Clusters of Orthologous Groups

CRISPR:

Clustered regularly interspaced short palindromic repeats

ENA:

European nucleotide archive

EPS:

Exopolysaccharide

GenePRIMP:

Gene prediction improvement pipeline

GI:

Genomic Island

GRAS:

Generally regarded as safe

HGT:

Horizontal gene transfer

PTA:

Phosphotungstic acid

RAST:

Rapid annotation using subsystem technology

RM:

Restriction-modification

SMRT:

Single molecule real time

Declarations

Acknowledgements

We would like to thank Dr. Nikos Kyrpides at the Joint Genome Institute (United States Department of Energy) for the analysis of the ACA-DC 2 genome with the GenePrimp server. Furthermore we would like to thank Prof. Konstantinos Fasseas at the Faculty of Crop Science (Agricultural University of Athens) for the transmission electron microscopy experiment.

Funding

The present work was co-financed by the European Social Fund and the National resources EPEAEK and YPEPTH through the Thales project.

Authors’ contributions

VA performed genome analysis and participated in the writing of the manuscript. MK performed genome analysis and participated in the writing of the manuscript. JB performed genome analysis. BP performed genome analysis. ET characterized strain ACA-DC 2, conceived the project and participated in the writing of the manuscript. KP conceived the project, performed genome analysis and participated in the writing of the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.