Abstract

Motivation: With the rapid increase of infection resistance to antibiotics, it is urgent to find novel infection therapeutics. In recent years, antimicrobial peptides (AMPs) have been utilized as potential alternatives for infection therapeutics. AMPs are key components of the innate immune system and can protect the host from various pathogenic bacteria. Identifying AMPs and their functional types has led to many studies, and various predictors using machine learning have been developed. However, there is room for improvement; in particular, no predictor takes into account the lack of balance among different functional AMPs.

Results: In this paper, a new synthetic minority over-sampling technique on imbalanced and multi-label datasets, referred to as ML-SMOTE, was designed for processing and identifying AMPs’ functional families. A novel multi-label classifier, MLAMP, was also developed using ML-SMOTE and grey pseudo amino acid composition. The classifier obtained 0.4846 subset accuracy and 0.16 hamming loss.

1 Introduction

With rapid increase in the infection resistance of antibiotics, it is urgent to find novel infection therapeutics. Over the past decade antimicrobial peptides (AMPs) have been utilized as potential alternatives for fighting infectious diseases. AMPs are key components of the innate immune system and can protect the host from various pathogenic bacteria. In invertebrates and vertebrates, AMPs have dual roles: rapid microbial killing and subsequent immune modulation (Wang, 2014). These effects result from AMP inducing multiple damages in bacteria by disrupting bacteria membranes (Malmsten, 2014), by inhibiting proteins, DNA and RNA synthesis, or by interacting with certain intracellular targets (Bahar and Ren, 2013). Therefore, AMPs were developed increasingly for new drugs. Some examples of using AMPs in therapeutics have been reported. Popovic et al. (2012) found that peptides with antimicrobial and anti-inflammatory activities had therapeutic potential for treatment of acne vulgaris. Yancheva et al. (2012) synthesized a novel didepsipeptide with antimicrobial activity against four of five tested bacterial strains of Escherichia coli. Conlon et al. demonstrated that peptides with antimicrobial activity from frog skin could stimulate insulin release, and hence had potential as an incretin-based therapy for Type 2 diabetes mellitus (Conlon et al., 2014). In addition, AMPs have been used as anticancer peptides in cancer therapy (Gaspar et al., 2013).

A surge in research on AMPs has promoted the development of various databases and prediction tools. APD2 (Wang et al., 2009) is a system dedicated to establishing a glossary, nomenclature, classification, information search, prediction, design, and statistics of AMPs. It gathered 2544 AMPs from the literature. CAMP (Thomas et al., 2010; Waghu et al., 2013) holds 6756 antimicrobial sequences and 682 3D structures of AMPs, together with prediction and sequence analysis tools. Niarchou et al. (2013) tested all subsequences ranging from 5 to 100 amino acids of the plant proteins in UniProKB/Swiss-prot and constructed an AMP database for plant species, named C-PAmP. Zhao et al. (2013) developed LAMP, a database used to aid the discovery and design of AMPs as new antimicrobial agents. The database contains 3904 natural AMPs and 1643 synthetic peptides. DBAASP was a manually curated database built by Gogoladze et al. (2014), and it collected those peptides for which antimicrobial activities against particular targets have been evaluated experimentally.

Although these methods have their own advantages and did play an important role in the research, they have following problems. First, most models only identified whether a new sequence is AMP, but not its type. Second, it is hard to search short peptides in the database because AMPs usually have only 5–50 amino acids. Methods based on Blast search and gene ontology (Lin et al., 2013) are often ineffective. Last but not least, classifying AMPs’ functions is a multi-label classification (MLC), especially when the number of AMPs with different activities does not distribute evenly. From APD2 (Wang et al., 2009), it is seen that antibacterial peptides occupy more than 90% of all AMPs, which is a highly unbalanced MLC. None of aforementioned automatic models considered the unbalanced amounts among various activities.

Although the aforementioned methods have some success in addressing unbalanced datasets, they have not achieved a satisfactory result in processing multi-labeled and imbalanced datasets simultaneously. Few works address the imbalance problem in MLC. He et al. (2012) took into account the imbalance in predicting subcellular localization of human proteins. Charte et al. (2015) built an under-sampling and oversampling algorithm on MLDs. Those studies improved the multi-label classification performance; however, they have some drawbacks in how to address the multi-label character of the new synthetic instance. In this paper, we tackle the imbalanced problem by a novel oversampling model referred to as ML-SMOTE, which is a synthetic minority oversampling on MLDs. We developed a new tool as a two-level AMP predictor based on ML-SMOTE. For a peptides sequence, we first identify whether it is an AMP. If yes, we then predict what potential activities it has. The first-level is a binary predictor, and the second-level predictor is an unbalanced and multi-labeled multi-classes predictor. The result shows ML-SMOTE can adjust the label set distribution to improve the performance of the predictor.

2 Methods

2.1 Benchmark dataset

The benchmark dataset SBench used in this study was taken from Xiao et al. (2013). The dataset can be formulated as

SBench=S+∪S-

(1)

where S+ contains 879 AMPs, and S- contains 2405 non-AMPs. The 879 AMPs are formulated as

2.2 Sequence encoding scheme

To develop a powerful method for classifying AMPs and their functional families according to the sequence information, one of the keys is to formulate the peptides with an effective mathematical expression that can truly reflect the intrinsic correlation with the target to be identified. However, when comparing with other protein functional predictions, the challenge is identifying how AMPs deal with shorter peptides. For a peptides sample P of L amino acids

P=R1R2R3…RL

(3)

where Ri(1≤i≤L) represents the ith residue, L is usually between 5 and 50.

In this study, we formulated an amino acids sequence by using Chou’s PseAAC(Chou, 2001, 2005) with the grey model (GM) (Deng, 1989). According to Chou’s general PseAAC formula (Chou, 2009, 2011), the peptides P in Eq. 3 can be represented as

P=p1p2⋯pk⋯pΩT

(4)

where T is a transpose operator, while the subscript Ω is an integer and its value as well as the components p1, p2, … depend on how to extract the desired information from the amino acid sequence of P.

In our study, we use the GM(1,1) model, which is an important and generally used model in GM. GM(1,1) firstly converts a series without any obvious regularity into a strict monotonic increasing series by using the accumulative generation operation (AGO). This process can reduce the randomness and enhance the smoothness of the series and minimize any interference from the random information. Let us assume that

X(0)=x01,x02,…,x0(n)

(5)

is a non-negative original series of real numbers with an irregular distribution. Then

X(1)=x11,x12,…,x1(n)

(6)

is viewed as the first-order accumulative generation operation (1-AGO) series for X(0), and the components in X(1) are given by

x1k=∑i=1kx0i,k=1,2,…,n

(7)

The GM(1,1) model can be expressed by the following grey differential equation with one variable:

dX(1)dt+aX(1)=b

(8)

where a and b are elements of parameters vector a^, that is

a^=[a,b]T

(9)

In Eq. 8,-a is the developing coefficient and b the influence coefficient. They can be solved using a least square estimator.

a^=a,bT=[BTB]-1BTY

(10)

where

B=-0.5x11+x1(2)-0.5x12+x1(3)⋮11⋮-0.5x1n-1+x1(n)1

(11)

Y=x0(2)x0(3)⋮x0(n)

(12)

The coefficients-a and b should carry some intrinsic information contained in the discrete data sequence X(0) sampled from the system investigated. In view of this, we incorporate these coefficients into the general form of PseAAC (Eq. 4) to reflect the correlation between the peptide sequence and prediction labels. In order to translate an amino acid sequence expressed with alphabets in Eq. 3 into a non-negative real series in Eq. 5, we need the amino acid numerical codes. In the same manner as that shown in (Xiao et al., 2013), we also use the numerical value of the following five physical-chemical properties for each of the 20 amino acids: (1) hydrophobicity; (2) pk1 (Cα-COOH); (3) pk2 (NH3); (4) PI (25 °C); and (5) molecular weight. Finally, we used a 30-D features vector to represent a peptide; i.e. instead of Eq. 4, we now have

P=p1,p2,…,p20,p21,…p30T

(13)

where pi(1≤i≤20) are the frequencies of 20 amino acids; and p21 and p22 are the coefficients of Eq. 10 when amino acids are coded by hydrophobicity numerical values; p23 and p24 are the coefficients of Eq. 10 when amino acids are coded by pk1 numerical values, and so on.

2.3ML-SMOTE algorithm

In Eq. 2, the AMP function family dataset is an unbalanced MLD, in which the antibacterial peptides have nine times the amount of the anti-HIV peptides. How to handle the MLC in unbalanced MLD is essential for improving prediction performance.

Let X⊂Rm denote an m-dimensions real vector of instance and let

Y=l1,l2,…,lq

(14)

be a class label set. MLD can be represented as

D=x,y|x∈X,y⊆Y

(15)

We define the sample set with the j-th 1≤j≤q label as

D(j)=x(j),y(j)|(x(j),y(j))∈Dandlj∈y(j)

(16)

If‖Dj1‖≫‖Dj2‖, the class lj1 is a majority class and the class lj2 is a minority class.

Different from SMOTE (Chawla et al., 2002) in a single label dataset, the new synthetic instance maybe have one or more labels. Hence, in (Charte et al., 2015), Charte et al. compared random undersampling (RUS) and random oversampling (ROS) based on Label Power-set (LP) and Multi-Label (ML), respectively. However, their LP-RUS and LP-ROS methods can only work well when the label density is low. Moreover, because their ML-ROS just clones the minority class samples, it is ineffective when these samples simultaneously have the majority class label, which happens often in MLD. In this study, we propose a novel oversampling model named ML-SMOTE. In the following algorithm description, we express a multi-label dataset (see Eq. 15) with N samples as

D=ti=xi,yi|xi=xi,1,…,xi,m),yi=(yi,1,…yi,q,1≤i≤N

(17)

where yi,j=1,ifxihasljlabel0,otherwise(1≤j≤q)

and the subset D(j) in which each sample is labeled lj class:

D(j)={tij=(xij,yij)|xij=(xi,1j,…xi,mj),yij=(yi,1j,…yi,qj),

(18)

andyi,jj=1}(1≤j≤q)

Algorithm ML-SMOTE algorithm’s pseudo-code

Inputs: Dataset: D with m features and q labels (see Eq. 17); k (the number of nearest neighbors)

3 Results

After the sequence feature retrieval and ML-SMOTE preprocessing as described above, a two-level AMP predictor named MLAMP was constructed, in which the Ensemble of Classifier Chains (ECC) algorithm (Waghu et al., 2014) was adopted as the prediction method (Fig. 1). We used the canonical implementation of ECC provided by the MULAN (Tsoumakas et al., 2010; Tsoumakas et al., 2011) multi-label learning in the Weka (Hall et al., 2009) library And for ECC, the binary and multi-class learners are implemented on the Weka platform using the Random Forest (RF) algorithm (Breiman, 2001).

This flowchart shows the training process of MLAMP. T1 represents the data taken from the dataset SBench for training the 1st-level predictor; T2 represents those from the dataset S+ for training the 2nd-level predictor

MLAMP is a two-level prediction engine (See Fig. 1). The first level of MLAMP predicts a query peptide as AMP or non-AMP by using the RF algorithm. It belongs to the case of single-label classification. The following four measures were used for examining the performance of a single-label predictor, they are: (i) overall accuracy or Acc; (ii) Mathew’s correlation coefficient or MCC; (iii) sensitivity or Sn; and (iv) specificity or Sp.

Although Eq. 20 was often used in the literature to measure the prediction quality of a method, they often lack intuitiveness, especially to biologists, particularly the MCC. According to Chou’s formulation, these four measures can be expressed as (Chen et al., 2016; Lin et al., 2014)

where N+stands for the total number of AMP samples investigated, whereas N-+ for the number of AMP samples incorrectly predicted to be of non-AMP; N- for the total number of non-AMP samples investigated, whereas N+- for the number of non-AMP samples incorrectly predicted to be of AMP. With such a formulation as given in Eq. 21, the meanings of sensitivity, specificity, overall accuracy and Mathew’s correlation coefficient and their scopes would become more intuitive and easier-to-understand, particularly for the Mathew’s correlation coefficient, as concurred by a series of studies published very recently (Jia et al., 2016b,c,d; Lin et al., 2014, Liu et al., 2016a,b,c; Qiu et al., 2016; Xiao et al., 2016)

If a query peptide is predicted as AMP, the second level of MLAMP will start to classify its functional families. This process belongs to the case of multi-label classification. Hamming loss, Subset Accuracy, Accuracy, Precision and Recall are the mostly used evaluation metrics for the performance of a multi-label classifier (Lin et al., 2013; Tsoumakas and Katakis, 2007; Tsoumakas et al., 2010; Xiao et al., 2013). Suppose Lk is the subset that contains all the labels for the kth sample Pk; Lk* is the subset that contains all the predicted labels for the kth sample Pk; N is the total number of samples; and M is the total number of labels. In this study, N=879 and M=5. The five metrics have been clearly defined as follows (Chou, 2013):

where ‖‖ is the operator acting on the set therein to count the number of its elements, and

ΔLk,Lk*=1,ifallthelabelsinLkareindenticaltothoseinLk*0,otherwise

(23)

When assessing a predictor, the following three cross-validation methods are often used in the literature: independent dataset test, subsampling (K-fold cross-validation) test and jackknife test. However, as elaborated in (Chou and Zhang, 1995), among the three cross-validation methods, the jackknife test is deemed the least arbitrary and most objective because it can always yield a unique result for a given benchmark dataset. Hence, the jackknife test was adopted in this study to examine the anticipated success rates of the current predictor. The process of jackknife test can be explained as follows:

Input: multi-label dataset T={Pi | 1≤i≤N}.

Output: predicted label set.

For i: 0→N

T is divided into testing dataset Ts={Pi},

and training dataset Tr=T-Ts.

Generate new training dataset Tr’ by using ML-SOMTE on Tr.

Train model on Tr’ by using ECC algorithm.

Predict the label set of Pi by the model trained above.

End For

Table 1 compares the performance of MLAMP with an existing method iAMP-2L in the first-level result on the benchmark SBench (Eq. 1), where overall accuracy Acc and MCC achieved by MLAMP are higher than those achieved by iAMP-2L.

Comparison of MLAMP with iAMP-2L and CAMP on the independent dataset SInd

Furthermore, in the second level prediction, MLAMP also obtained better performance than iAMP-2L. Some different metrics were used from single-label classification, in particular-Hamming loss, Accuracy, Precision, Recall and Subset Accuracy (Tsoumakas et al., 2010) were commonly applied in MLC. Table 3 gives the detailed jackknife test results on the AMP dataset S+ (Eq. 2). Especially MLAMP gained a 0.4846 success rate in the strict assessment of subset accuracy and this performance was 5% higher than that by iAMP-2L.

Performance metrics achieved at the 2nd-level by MLAMP on the AMP dataset .S+

Why can these metrics be improved so remarkably by using MLAMP? There are two key reasons. The first reason is probably the new peptide feature coding model (see Eq. 13). Table 4 sorts the 30 features in decreasing order after analyzing the benchmark dataset SBench by the feature selection tool minimal-redundancy-maximal-relevance (mRMR) (Kolde et al., 2016). As shown in Table 4, those features generated by the grey model include more information than amino acids frequency, especially their biochemical properties. And one can draw a conclusion that some physicochemical properties of amino acids may play an important role in AMP, such as molecular weight, PI and Pk2. The second reason points to the new ML-SMOTE model. The AMP dataset S+ is an imbalance MLD, and previous studies did not take it into account. After processing the training dataset S+ by the ML-SMOTE model, the balance property of the new synthetic training dataset was improved, which can help the machine learning obtained a better performance.

4 Conclusion

Due to increasing antibiotic resistance, AMPs, which are key components of innate immune system, are becoming more and more important in drug development. Efficiently and effectively identifying AMPs and their functional types has become an urgent research topic. The results reported in this study indicate that the novel predictor, MLAMP, provides an accurate and useful tool for researchers to find new infection therapeutics.

MLAMP obtained a better prediction performance than that of a previous method. The primary reason for our good performance is our formulation model’s peptide extraction features. Since AMPs usually have 5–50 amino acids, our model (Eq.13) is good for formulating short peptides. It includes the internal relationship of amino acids sequence in various physical-chemical properties. The second reason is the ML-SMOTE model, which does a good job of handling the lack of balance problem in multi-label datasets. Compared with other methods, the sample synthetized by using ML-SMOTE retains the multi label distributions. It not only accumulates minority samples but also keeps the label density of MLD. In the future, the MLSMOTE model can be extended to assist with imbalance and multi-label datasets for other problems.

For practical applications, a user-friendly web-server for MLAMP has been established at http://www.jci-bioinfo.cn/MLAMP, which allows users to easily obtain their desired results without the need to follow the complicated mathematical equations involved in developing the predictor. Users can submit a peptide sequence to the webserver and subsequently the webserver will return the predicted result in real time. Alternatively, users can choose the batch prediction by entering their e-mail address and their batch input file of many peptide sequences. They will quickly receive an email showing the predicted results from seconds to hours depending on the number of sequences.

Funding

This work has been supported by the National Natural Science Foundation of China (No. 61462047, 31560316) and by the US National Institutes of Health (GM100701)