Datasets and Algorithm

Datasets

The training dataset was constructed by manual curation of sequences of bacterial pathogenic/virulence proteins retreived from
MvirDB (Zhou, 2006) which is a database that integrates DNA and protein sequences from several virulent factor databases. In this version, the positive dataset contains 1,625 and the negative dataset contains 5,715 proteins.
Two different metagenomic datasets were constructed for training by using the main dataset.

SVM Implementation

Support Vector Machines (SVM) technique was implemented via SVM_Light package (http://svmlight.joachims.org/).
This software package has the options to adjust the parameters and kernel functions (polynomial, linear, radial basis function, sigmoid). SVM works on the principle that separate a set of data-points
into two with a boundary that maximises the distance between the data-points of the two sets.The dipeptide frequency of the protein sequences was used as the input for training SVM.

The performance of SVM was evaluated by fivefold cross validation technique. In this method, the dataset was divided into five almost equal parts with
four parts clubbed together for training and the fifth part for testing. This process was repeated such that every part was tested once.

HMM Implementation

The Hidden Markov Model (HMM) was implemented using HMMER3 software (http://hmmer.janelia.org/). Pfam database contains multiple aligmnets and hidden markov models of protein domains and families.
All pathogenic and non-pathogenic proteins were searched against the Pfam database.
To construct a local database of pathogenic and non-pathogenic domains from the Pfam database, the protein sequences of the main dataset were searched against the Pfam database using hmmscan at an E-value of 1e-5.
The resulting domains were classified into three categories; (i) domains present only in pathogenic proteins,
(ii) domains present only in non-pathogenic proteins and, (iii) domains occurring in both pathogenic and non-pathogenic proteins ("shared domains").

Hybrid Implementation

A hybrid approach using SVM and HMM is used for the development of MP3 tool to achieve higher accuracy and sensitivity.
In this approach all the proteins are screened using both SVM and HMM modules.
Using the hybrid approach, all the protein sequences in the blind dataset were screened using both SVM and HMM modules.
Among the two methods, SVM can classify a protein as either pathogenic or non-pathogenic, whereas, HMM can classify a protein as pathogenic, non-pathogenic or unclassified.
The criteria used to carry out the assignments is shown below.

Parameters used for Performance Evaluation

The performance of our SVM was checked via following threshold dependent
parameters
Sensitivity (Sn):Sensitivity measures the ability of the process to predict
correct results

Specificity (Sp):Specificity measures the ability of a process to predict
incorrect results.

Accuracy (Acc):Accuracy measures the degree of correctness of the predicted
results to its actual value or the experimental value

Mathews Correlation Coefficient (MCC): In the machine learning MCC measures the
degree to which the binary classification is correct.

Blind Datasets
The performance of MP3 was tested on genomic and metagenomic blind datasets.
MP3 showed an accuracy of 98.21 % in case of genomic blind dataset and an accuracy of 83.03% and 91.13% in case of metagenomic blind datasets BlindA and BlindB, respectively.

Genomic Datasets
15 species of known pathogenic and non-pathogenic bacteria were selected for which complete genome sequences were available at NCBI. MP3 was run on the complete set of proteins from these genomes. It is apparent from the following table that the percentage of pathogenic proteins
is higher in pathogenic genomes as compared to the non-pathogenic genomes.

Real Metagenomic Datasets
The performance of MP3 was also examined using real metagenomic datasets.
The human gut microbiome datasets for a healthy European male individual (MH0050, Age 49) and a diseased European male individual (O2.UC-18, Age 48) were obtained from (ftp://public.genomics.org.cn/BGI/gutmeta​/High_quality_reads/).
A total of 8,026,105 and 6,952,195 ORFs (length between 30-50 amino acids) were predicted in healthy and diseases datasets using MetaGeneMark [20]. MP3 was run on the predicted ORFs of the two datasets. It took ~180 CPU hours using Intel Xeon 2.4 GhZ CPU to carry out the assignment which is really reasonable considering the size of input data.
It predicted 16.51% and 19.37% proteins as pathogenic in healthy and diseased individuals, respectively.
These results validate the efficiency and capability of MP3 in predicting pathogenic proteins in metagenomic datasets.

Comparison of MP3 with other available programs (using independent dataset)

There performance of MP3 was compared with publicly available VirulentPred web-server which can predict virulent proteins in genomic datasets. Three test sets were used to compare the performance of MP3 with VirulentPred.
In the first test set, 200 proteins from a pathogenic Mycobacterium Tuberculosis strain, Mycobacterium tuberculosis Beijing NITR203 uid197218 (known as Beijing strain), were used, of which, 100 are known and confirmed pathogenic proteins such as drug resistance proteins, MCE-family proteins and PE-PPE family proteins. The remaining 100 proteins were non-pathogenic and included polymerase proteins, ribosomal proteins and other proteins from essential genes which are not known to play a role in pathogenesis. The sensitivity (97%), specificity (97%), accuracy (97%) and MCC (0.94) achieved by MP3 on this set is much higher than the (81%), specificity (34%), accuracy (57.5%) and MCC (0.16) obtained by VirulentPred.
The second set consisted of the blind dataset constructed in this study. MP3 showed a much higher sensitivity (100%), specificity (95.24%), accuracy (97.26%) and MCC (0.95) as compared to VirulentPred (sensitivity (74.70%), specificity (49.49%), accuracy (60.99%) and MCC (0.24)).
The third set consisted of the independent dataset provided by Virulent Pred. On this set, MP3 exhibited an accuracy of 90% whereas VirulentPred showed an accuracy of 85%.
The higher accuracy shown by MP3 on an independent dataset used for the evaluation of VirulentPred attests to the accuracy of MP3 on any unknown dataset.
These results indicate that MP3 is perhaps the most sensitive, specific and accurate among the available methods.