Rapid high throughput serotyping of Streptococcus pneumoniae from WGS data

Full Description

SeroBA

SeroBA is a k-mer based Pipeline to identify the Serotype from Illumina NGS reads for given references. You can use SeroBA to download references from (https://github.com/phe-bioinformatics/PneumoCaT) to do identify the capsular type of Streptococcus pneumoniae.

Tutorial

Usage

Since SeroBA v0.1.3 an updated variant of the CTV from PneumoCaT is provided in the SeroBA package. This includes the serotypes 6E, 6F, 11E, 10X, 39X and two NT references. It is not necessary to use SeroBA getPneumocat.

For SeroBA version 0.1.3 and greater, download the database provided within this git repository:

Output

In the folder 'prefix' you will find a pred.tsv including your predicted serotypeas well as a file called detailed_serogroup_info.txt including information aboutSNP, genes, and alleles that are found in your reads.After the use of "seroba summary" a tsv file called summary.tsv is created thatconsists of three columns (sample Id , serotype, comments).Serotypes that do not match any reference are marked as "untypable"(v0.1.3).

In the detailed information you can see the finally predicted serotype as well asthe serotypes that had the closest reference in that specific serogroup accordingto ARIBA. Furthermore you can see the sequence identity between the sequence assemblyand the reference sequence.

Troubleshooting

Case 1:

SeroBA predicts 'untypable'. An 'untypable' prediction can either be areal 'untypable' strain or can be caused by different problems. Possible problems are:bad quality of your input data, submission of a wrong species or to low coverageof your sequenced reads. Please check your data again and run a quality control.

Case 2:

Low alignment identity in the 'detailed_serogroup_info' file. This canbe a hint for a mosaic serotpye.

The third column in the summary.tsv indicates "contamination". This means thatat least one heterozygous SNP was detected in the read data with at least10% of the mapped reads at the specific position supporting the SNP.

Possible solution: please check the quality of your data and have a lookfor contamination within your reads

Database

You can use the CTV of PneumoCaT by using seroba getPneumocat. It is alsopossible to add new serotypes by adding the references sequence to the"references.fasta" file in the database folder. Out of the information provided by this database a TSV file is created while using seroba createDBs. You can easily put in additional genetic information for any of these serotypes in the given format.

Installation

CentOS 7

Ensure you have a development environment setup (you may have done this already):