ABSTRACT :Bioinformatics has todeal with exponentially growing steps ofhighly interrelated and rapidly evolvingtypes of data that are heavily used byhumans and computers. A bioinformaticslanguage should therefore offer power andscalability at run time and should be basedon a flexible and expressive model. Therelatedness of proteins is of extremeimportance to modern biologists as proteinswith similar structure often have similarfunction. The prediction of the secondary,tertiary, and quaternary structures ofproteins remains a daunting problem whichhas been attacked by numerous methods.One method is to classify the protein into afamily based on

sequence, shape, or otherfeatures and assume that it has a similarfunction to the other members of that family.The general problem of machine learning isto search usually very large space ofpotential hypotheses to determine the onethat will best fit

the data and any priorknowledge. The data may be labeled orunlabelled. If labels are given then theproblem is one ofsupervised learning, inthat the true answer is known for a given setof data. If labels are not given then theproblem is one ofunsupervised learning

and the aim is characterize the structure ofthe data, e.g. by identifying groups ofexamples in the data that are collectivelysimilar to each other and distinct from theother data.There are many computationaltools to achieve this classification, amongthem are the supervised learning methods ofK-nearest neighbors, neural nets, decisiontrees and support vector machines. We shallcompare the performance of these fouralgorithms in the task of classifying proteindomain profiles into the appropriate family.Our preliminary results indicate that theempirically optimal versions of eachalgorithm are returning similar results.

Bioinformatics is the art and scienceof electronically representing and integratingbiomedical information in a way that makesit accessible and usable across the variousfields of biological research. World Wideresearch in molecular biologyis producinglarge amounts of data that stand in need ofcomputerized analysis. The data is not onlysequence data of DNA and of proteins butalso the three-dimensional structures ofsome proteins, simultaneous levels ofactivity of large numbers of genes

and dataon the binding affinities of many antibodiesof many substances. Protein structure andhence function is of extreme scientific,medical, and economic importance. With afew exceptions, proteins are the physical andchemical machines at the heart

and the predictions of these structuresremains a challenging problem. Artificialintelligence techniques, especiallysupervised learning ones, are well suited forthis problem as long range dependencies andextremely subtle nuances are involved in aprotein forming its three dimensionalstructure. Some AI techniques such assupport vector machines and neural netsmay be able to discern these patterns.

This project aims to analyze theperformance of four different supervisedlearning algorithms. The algorithms weretrained and tested using a set of proteinsequences which were profiled using theiramino acid composition. The proteinsequences were obtained from SCOPdatabase. Three of the four algorithmsdecision trees, neural networks, and k-nearest neighbors, were implemented usingthe XLMINER software package, while theSVMTorch package was used for the fourthalgorithm, Support Vector Machines.Evaluation was doneby cross validation andby computing a ROC curve for the predictedclasses given by the classifiers.

2. METHODS

The problem is to test how fourdifferent machine learning techniquesclassify protein domains based on a profile.The profile is a vectorof attributes, wherethe attributes may be, but are certainly notlimited to, amino acid composition, mass,volume, charge, hydrophobicity, etc. Crossvalidation of the learning was implementedin a 5-fold fashion whereby 80% of the datawas used for training and 20% for trainingfive separate times on different partitions. Abrief discussion of the four algorithmsfollows.

2.1. Decision Trees

Generally preferred over othernonparametric techniques because of thereadability of their learned hypotheses andthe efficiency of training and evaluation.Altering parameters such as those forpruning and cutoff may help to build smaller,quicker trees that are just as robust as a fulldecision tree.

2.2. Neural Networks

Inspired by densely interconnected,parallel structure of the mammalian brain,neural nets are made up the fundamentalperceptron unit, which was first invented byRosenblatt. There are a multitude ofparameters that may be changed in neuralnets. The number of input, hidden, andoutput nodes isvariable. The number ofhidden layers or the activation function canbe changed. The step value associated withgradient descent can also be customized.Also, there are different learning algorithmssuch as feed-forward, back propagation, andcounter propagation. So depending on the

nature of your data set suitable parametersmay be selected to get a good performance.

2.3. K-Nearest Neighbors

A very intuitive method, the k-nearest neighbors simply memorizes thetraining set and then when given a test data

point, calculates the distance from the testdata point to every point in the training setin the n-dimensional space. It then averages,sometimes a weighted average, theclassifications of the closest k neighbor(s).There are different ways to classify whenthere are more than one neighbor.

2.4. Support Vector Machines (SVM)

Support vector machines (SVMs)were first suggested by Vapnik in the 1960sfor classification and have recently becomean area of intense research owing todevelopments in the techniques and theory

3

coupled with extensions to regression anddensity estimation. The choice of a kernelfunction is the most common way to fit anSVM to a given problem and this remains ahot research topic. SVMs separate the inputsinto positive and negative examples bycalculating the hyper surface in the space ofpossible inputs that will divide the tworegions and also have the largest distancefrom the hyper surface to the nearest of thepositive and negative examples. Intuitively,this makes the classification correct fortesting data that is near, but not identical tothe training data.

3. RECEIVER OPERATINGCHARACTERISTIC CURVE(ROC)

An excellent method for evaluating aclassifier is to us the ROC curve or score.The ROC curve of aclassifier shows itsperformance as a trade off betweenselectivity

andsensitivity.

Typically it is aplot between false

positive

rateversustruepositive

rate.

The area under the ROC is a convenient wayof comparing classifiers. A random classifierhasan area of 0.5, while and ideal one hasan area of 1.

3.1. An ROC curve demonstrates severalthings:

1.

It shows the tradeoff betweensensitivity and specificity (anyincrease in sensitivity will beaccompanied by a decrease inspecificity).

2.

The closer thecurve follows the left-hand border and then the top borderof the ROC space, the more accuratethe test.

3.

The closer the curve comes to the45-degree diagonal of the ROCspace, the less accurate the test.

4.

The slope of the tangent line at acutpoint givesthe likelihood ratio(LR) for that value of the test.

The graph above shows three ROCcurves representing excellent, good, andworthless tests plotted on the same graph.The accuracy of the test depends on howwell the test separates the group being testedinto those with and without the disease inquestion. Accuracy is measured by the areaunder the ROC curve. An area of 1represents a perfect test; an area of .5represents a worthless test. A rough guidefor classifying the accuracy of a diagnostictest is the traditional academic point system:



.90-1 = excellent (A)



.80-.90 = good (B)



.70-.80 = fair (C)



.60-.70 = poor (D)



.50-.60 = fail (F)

4

ROC curves can also be constructedfrom clinical prediction rules. The graphabove come from a study of how clinicalfindings predict strep throat (Wigton RS,Connor JL, Centor RM. Transportability of adecision rule for the diagnosis ofstreptococcal pharyngitis. Arch Intern Med.1986;146:81-83.) In that study, the presenceof tonsillar exudate, fever,

adenopathy andthe absence of cough all predicted strep. Thecurves were constructed by computing thesensitivity and specificity of increasingnumbers of clinical findings (from 0 to 4) inpredicting strep. The study comparedpatients in Virginia and Nebraska and foundthat the rule performed more accurately inVirginia (area under the curve = .78)compared to Nebraska (area under the curve= .73). These differences turn out not to bestatistically different, however. At this point,you may be wondering what this areanumber really means and how it is computed.The area measuresdiscrimination, that is,the ability of the test to correctly classifythose with and without the disease. Considerthe situation in which patients are alreadycorrectly classified

into two groups. Yourandomly pick on from the disease groupand one from the no-disease group and dothe test on both.

The patient with the more abnormaltest result should be the one from the diseasegroup. The area under the curve is thepercentage of randomly drawn pairs forwhich this is true (that is, the test correctlyclassifies the two patients in the randompair). Computing the area is moredifficult to explain. Two methods arecommonly used: a non-parametric methodbased

on constructing trapezoids under thecurve as an approximation of area and aparametric method using a maximumlikelihood estimator to fit a smooth curve tothe data points. Both methods are availableas computer programs and give an estimateof area andstandard error that can be used tocompare different tests or the same test indifferent patient populations.

4. DATACOLLECTION AND

DATA PREPROCESSING

Domain families and individualdomain sequences were obtained fromSCOP http://scop.berkeley.edu/ and thecorresponding ASTRAL compendium of thegenetic domain sequences found athttp://astral.stanford.edu/scopseq-1.61.html .It should be noted that SCOP is a databaseof protein families in which the families areconstructed by human experts. There issome computer assistance in the initialphases of classification, such as that of class,fold etc., but humans ultimately place eachdomain into a single family. The sequencesobtained were those where the E-value wasgreater than or equal to 10-25. This yielded6024 protein domain sequences. However,many of these sequences belong to domainfamilies that only have a few members. Thiscan severely complicate machine learning,so an arbitrary cutoff value of 24 memberswas set. This gave a total of 31 families withwhich is the data set for the learningalgorithms. Figure 1 lists the protein domainfamilies that constituted the selected 1250protein domains. (Note: For clarity, some ofthe family names have been shortened andalso the fold and superfamilydesignationshave been omitted.)

Each individual domain was thenprofiled according to amino acid %composition which yielded a 20 dimensionalvector for each domain. Then, using 5-foldcross validation, the supervised learningalgorithms were

trained on 80% of the dataand tested on the remaining 20%. Apartfrom comparing the performance ofalgorithms against each other, each of thealgorithms were executed with different setof conditions/parameters to determine whichparameters were contributing towardsalgorithms performance.

5

SCOPdesignation

Class

Family

a.3.1.1

All alpha

monodomain cytochrome

a.35.1.5

All alpha

Bacterial repressors

a.39.1.5

All alpha

Calmodulin-like

a.4.1.1

All alpha

Homeodomain

a.43.1.1

All alpha

Phage repressors

a.53.1.1

All alpha

p53 tetramerization domain

b.1.1.1

All beta

V set domains (antibody variable domain-like)

b.1.1.2

All beta

C1 set domains (antibody constant domain-like)

b.1.1.4

All beta

I set domains

b.1.1.5

All beta

E set domains

b.1.2.1

All beta

Fibronectin type III

b.34.2.1

All beta

SH3-domain

b.34.3.1

All beta

Myosin S1 fragment, N-terminal domain

b.71.1.1

All beta

alpha-Amylases, C-terminal beta-sheet domain

c.2.1.2

Alpha and beta (a/b)

Tyrosine-dependent oxidoreductases

c.3.1.5

Alpha and beta (a/b)

FAD/NAD-linked reductases, N-terminal and centraldomains

d.9.1.1

Alpha and beta (a+b)

Interleukin 8-like chemokines

f.2.1.2

Membrane proteins

Photosynthetic reaction centre, L-, M-

and H-chains

f.2.1.3

Membrane proteins

Cytochrome c oxidase-like

f.2.1.8

Membrane proteins

Cytochrome bc1 transmembrane subunits

g.1.1.1

Small proteins

Insulin-like

g.24.1.1

Small proteins

TNF receptor-like

g.3.1.1

Small proteins

Hevein-like agglutinin (lectin) domain

g.3.11.1

Small proteins

EGF-type module

g.3.7.2

Small proteins

Short-chain scorpion toxins

g.37.1.1

Small proteins

Classic zinc finger, C2H2

Figure1. SCOP Protein domain families used for profiling and classification.

The following is a brief descriptionof the parameters selected and analysis ofhow the algorithms performed :

5. RESULTS

5.1. K-Nearest Neighbors:

Algorithm was run for K values of 15 & 10, performance was good for both K=1and d K=5 but with K value of 10 itdecreased. High K values increases

computation time, but may yield goodvalues if the dataset is complex, hence the

nature of data set should be consideredwhile selecting the K value.

Figure 2. Number of families vs. ROC

plot for K-nearest neighbors

with K values 1 5 & 10.

6

5.2. Decision Trees:

The first set of experiments was tobuild the complete tree with the given set oftraining set and to see its performance,second and third sets of experimentsbasically depicts the performance ofalgorithm with pruning and early cutoffoptions. The performance was better whentree was allowed to grow completely; this isdue to the fact that more information that islearned will be encoded in the tree. Earlycutoff i.e training was stopped when thenumber of instances reached a minimumnumber at any node, seemed to be a betteroption than pruning if time and or spacecomplexity is an issue. However, in this caseit wasn’t.

Figure 3. Number of families vs. ROC

Plotfor Decision Trees with

standard, pruning, and cutoff.

5.3. Neural Networks:

5 different sets of experimentswere performed with neural nets. The choiceof parameters was basically dependent onthe nature of our data set. We found that thenumber of hidden layers and number ofnodes in them had a great effect on theperformance on the neural nets. With

number nodes of 25 to 20 the performancewas almost the same but with number of

nodes being 19 the performance

drasticallyreduced. More than one hidden layer didn’timprove performance at all and in fact, evendecreased it. Multiple hidden layers arerequired when there are higher orderrelationships in the dataset. The selection ofa activation function is alsoequallyimportant, we got the second bestperformance measure when a Symmetricfunction was selected. One of the importantfeatures which drew our attention wasPerceptron Convergence , the performanceimproved as the the learning rule convergedto correct

weights that produced correctoutput.

Figure 4. Number of families vs. ROC

plot for Neural Networks

5.4. Support Vector Machines:

Kernel selection is probably one ofthe most important step in order to get agood performance. Algorithm was run withthree different Kernal function. SVM withRadial Basis Kernal with a value 0.6 wasfound to give the best performance.Polynomial with degree two also performedwell.

7

Figure 5.Number of families vs.

ROC plot for SVM

5.5.

Comparison of Algorithms

Figure 6. Comparison ROC plot of

SVM (yellow), NN (pink),

KNN (blue) and Decision

Trees (red).

6. BIOLOGICAL NOTE

The protein families which gave theworst results were 7, 8, 9, and 10. Thesewere domains in the Immunoglobulinsuperfamily and immunoglobulins areknown for their high sequence variabilityand hence variable amino acid composition.The immunoglobulin amino acid sequencesmay at times to appear almost random andthis is because they are. The combinatorialrearrangement is what causes the immunesystem to defend an almost limitless number

of chemical structures. Neural networksperformed surprisingly well with one hiddenlayer with 20 nodes and perceptronconvergence. Overall, the algorithmsperformed surprisingly similar and this mayhave been caused by the dataset. Moredomain families and/or more attributesshould give a greater robustness to thedataset.

7. CONCLUSIONS &FUTUREWORK

In k nearest neighbor technique Kvalues increases computation time, but mayyield good values if the dataset is complex,k value of 1,5,10 are considered in the ROCplot ofwhich the 10 showed an decreasesensitivity since the data set is not complex.So nature of data set should be consideredwhile selecting the K value.

In Decision Tree the performancewas better when tree was allowed to growcompletely; this isdue to the fact that moreinformation that is learned will be encodedin the tree. Early cutoff i.e training wasstopped when the number of instancesreached a minimum number at any node,seemed to be a better option than pruning iftime and or space complexity is an issue. Inneural nets the number of hidden layers andnumber of nodes in them had a great effecton the performance. One of the importantfeatures, which drew our attention, wasPerceptron Convergence, the performanceimproved as the learning rule converged tocorrect weights that produced correct output.In SVM Kernel selection is probably one ofthe most important step in order to get agood performance. Algorithm was run withthree different Kernel functions. SVM withRadial Basis Kernel with a value 0.6 wasfound to give the best performance. Toofew protein domain families were subjectedto classification hence added features to theattribute vector such as length, mass, area,