Comments 0

Document transcript

A Taxonomy and Short Review of Ensemble SelectionGrigorios Tsoumakas and Ioannis Partalas and Ioannis Vlahavas1Abstract.Ensemble selection deals with the reduction of an en-semble of predictive models in order to improve its efﬁciency andpredictive performance.The last 10 years a large number of very di-verse ensemble selection methods have been proposed.In this paperwe make a ﬁrst approach to categorize them into a taxonomy.Wealso present a short review of some of these methods.We particu-larly focus on a category of methods that are based on greedy searchof the space of all possible ensemble subsets.Such methods use dif-ferent directions for searching this space and different measures forevaluating the available actions at each state.Some use the trainingset for subset evaluation,while others a separate validation set.Thispaper abstracts the key points of these methods and offers a generalframework of the greedy ensemble selection algorithm,discussing itsimportant parameters and the different options for instantiating theseparameters.1 IntroductionEnsemble methods [5] have been a very popular research topic dur-ing the last decade.They have attracted scientists fromseveral ﬁeldsincluding Statistics,Machine Learning,Pattern Recognition andKnowledge Discovery in Databases.Their popularity arises largelyfromthe fact that they offer an appealing solution to several interest-ing learning problems of the past and the present.First of all,ensembles lead to improved accuracy compared toa single classiﬁcation or regression mode.This was the main mo-tivation that led to the development of the ensemble methods area.Ensembles achieve higher accuracy than individual models,mainlythrough the correction of their uncorrelated errors.Secondly,en-sembles solve the problem of scaling inductive algorithms to largedatabases.Most inductive algorithms are too computationally com-plex and suffer from memory problems when applied to very largedatabases.A solution to this problem is to horizontally partition thedatabase into smaller parts,train a predictive model in each of thesmaller manageable part and combine the predictive models.Thirdly,ensembles can learn from multiple physically distributed data sets.Often such data can’t be collected to a single site due to privacyor size reasons.This problem can be overcome through the com-bination of multiple predictive models,each trained on a differentdistributed data set.Finally,ensembles are useful for learning fromconcept-drifting data streams.The main idea here is to maintain anensemble of classiﬁers that are trained from different batches of thedata stream.Combining these classiﬁers with a proper methodologycan solve the problem of data expiration that occurs when the learn-ing concept drifts.Typically,ensemble methods comprise two phases:the productionof multiple predictive models and their combination.Recent work1Dept.of Informatics,Aristotle University of Thessaloniki,Thessaloniki54124,Greece,email:fgreg,partalas,vlahavasg@csd.auth.gr[13,9,12,7,21,4,14,1,16,24,17],has considered an additionalintermediate phase that deals with the reduction of the ensemble sizeprior to combination.This phase is commonly called ensemble prun-ing,selective ensemble,ensemble thinning and ensemble selection,of which we shall use the last one within this paper.Ensemble selection is important for two reasons:efﬁciency andpredictive performance.Having a very large number of models inan ensemble adds a lot of computational overhead.For example,de-cision tree models may have large memory requirements [13] andlazy learning methods have a considerable computational cost dur-ing execution.The minimization of run-time overhead is crucial incertain applications,such as in stream mining.Equally important isthe second reason,predictive performance.An ensemble may consistof both high and low predictive performance models.The latter maynegatively affect the overall performance of the ensemble.Pruningthese models while maintaining a high diversity among the remain-ing members of the ensemble is typically considered a proper recipefor an effective ensemble.The last 10 years a large number of very diverse ensemble se-lection methods have been proposed.In this paper we make a ﬁrstapproach to categorize them into a taxonomy.We hope that com-munity feedback will help ﬁne-tuning this taxonomy and shape itinto a proper starting place for researchers designing new methods.In addition,we delve a little deeper into a speciﬁc category in thistaxonomy:greedy search-based methods.A number of ensemble selection methods that are based on agreedy search of the space of all possible ensemble subsets,haverecently been proposed [13,7,4,14,1].They use different direc-tions for searching this space and different measures for evaluatingthe available actions at each state.Some use the training set for sub-set evaluation,while others a separate validation set.In this paper weattempt to highlight the salient parameters of greedy ensemble selec-tion algorithms,offer a critical discussion of the different options forinstantiating these parameters and mention the particular choices ofexisting approaches.The paper steers clear of a mere enumeration ofparticular approaches in the related literature,by generalizing theirkey aspects and providing comments,categorizations and complex-ity analysis wherever possible.The remainder of this paper is structured as follows.Section 2contains background material on ensemble production and combi-nation.Section 3 presents the proposed taxonomy including a shortaccount of methods in each category.The category of clustering-based methods is discussed at a greater detail,from a more criticalpoint of view.Section 4 discusses extensively the category of greedysearch-based ensemble selection algorithms.Finally Section 5 con-cludes this work.2 BackgroundThis section provides background material on ensemble methods.More speciﬁcally,information about the different ways of produc-ing models are presented as well as different methods for combiningthe decisions of the models.2.1 Producing the ModelsAn ensemble can be composed of either homogeneous or heteroge-neous models.Homogeneous models derive from different execu-tions of the same learning algorithm.Such models can be producedby using different values for the parameters of the learning algorithm,injecting randomness into the learning algorithm or through the ma-nipulation of the training instances,the input attributes and the modeloutputs [6].Popular methods for producing homogeneous models arebagging [2] and boosting [18].Heterogeneous models derive from running different learning al-gorithms on the same data set.Such models have different viewsabout the data,as they make different assumptions about it.For ex-ample,a neural network is robust to noise in contrast with a k-nearestneighbor classiﬁer.2.2 Combining the ModelsCommon methods for combining an ensemble of predictive modelsinclude voting,stacked generalization and mixture of experts.In voting,each model outputs a class value (or ranking,or proba-bility distribution) and the class with the most votes is the one pro-posed by the ensemble.When the class with the maximum numberof votes is the winner,the rule is called plurality voting and whenthe class with more than half of the votes is the winner,the rule iscalled majority voting.A variant of voting is weighted voting wherethe models are not treated equally as each of themis associated witha coefﬁcient (weight),usually proportional to its classiﬁcation accu-racy.Let x be an instance and mi,i = 1::k a set of models that outputa probability distribution mi(x;cj) for each class cj,j = 1::n.Theoutput of the (weighted) voting method y(x) for instance x is givenby the following mathematical expression:y(x) = arg maxcjkXi=1wimi(x;cj);where wiis the weight of model i.In the simple case of voting (un-weighted),the weights are all equal to one,that is,wi= 1;i = 1::k.Stacked generalization [23],also known as stacking is a methodthat combines models by learning a meta-level (or level-1) model thatpredicts the correct class based on the decisions of the base level (orlevel-0) models.This model is induced on a set of meta-level trainingdata that are typically produced by applying a procedure similar tok-fold cross validation on the training data.The outputs of the base-learners for each instance along with the true class of that instanceform a meta-instance.A meta-classiﬁer is then trained on the meta-instances.When a new instance appears for classiﬁcation,the outputof the all base-learners is ﬁrst calculated and then propagated to themeta-classiﬁer,which outputs the ﬁnal result.The mixture of experts architecture [10] is similar to the weightedvoting method except that the weights are not constant over the in-put space.Instead there is a gating network which takes as input aninstance and outputs the weights that will be used in the weightedvoting method for that speciﬁc instance.Each expert makes a deci-sion and the output is averaged as in the method of voting.3 A Taxonomy of Ensemble Selection AlgorithmsWe propose the organization of the various ensemble selection meth-ods into the following categories:a) Search-based,b) Clustering-based c) Ranking-based and d) Other.3.1 Search Based MethodsThe most direct approach for pruning an ensemble of predictive mod-els is to perform a heuristic search in the space of the possible dif-ferent model subsets,guided by some metric for the evaluation ofeach candidate subset.We further divide this category into two sub-categories,based on the search paradigm:a) greedy search,and b)stochastic search.The former is among the most popular categoriesof ensemble pruning algorithms and is investigated at depth in Sec-tion 4.Stochastic search allows randomness in the selection of thenext candidate subset and thus can avoid getting stuck in local op-tima.3.1.1 Stochastic SearchGasen-b [25] performs stochastic search in the space of model sub-sets using a standard genetic algorithm.The ensemble is representedas a bit string,using one bit for each model.Models are includedor excluded from the ensemble depending on the value of the corre-sponding bit.Gasen-b performs standard genetic operations such asmutations and crossovers and uses default values for the parametersof the genetic algorithm.The performance of the ensemble is used asa function for evaluating the ﬁtness of individuals in the population.Partalas et al.[16] search the space of model subsets using a re-inforcement learning approach.We categorize this approach into thestochastic search algorithms,as the exploration of the state space in-cludes a (progressively reducing) stochastic element.The problemof pruning an ensemble of n classiﬁers has been transformed into thereinforcement learning task of letting an agent learn an optimal pol-icy of taking n actions in order to include or exclude each classiﬁerfrom the ensemble.The method uses the Q-learning [22] algorithmto approximate an optimal policy.3.2 Clustering-based methodsThe methods of this category comprise two stages.Firstly,they em-ploy a clustering algorithmin order to discover groups of models thatmake similar predictions.Subsequently,each cluster is separatelypruned in order to increase the overall diversity of the ensemble.3.2.1 Giacinto et al.,2000Giacinto et al.[9] employ Hierarchical Agglomerative Clustering(HAC) for classiﬁer pruning.This type of clustering requires thedeﬁnition of a distance metric between two data points (here clas-siﬁers).The authors deﬁned this metric as the probability that theclassiﬁers don’t make coincident errors and estimate it from a val-idation set in order to avoid overﬁtting problems.The authors alsodeﬁned the distance between two clusters as the maximum distancebetween two classiﬁers belonging to these clusters.This way theyimplicitly used the complete link method for inter-cluster distancecomputation.Pruning is accomplished by selecting a single represen-tative classiﬁer from each cluster.The representative classiﬁer is theone exhibiting the maximumaverage distance fromall other clusters.HAC returns a hierarchy of different clustering results startingfromas many clusters as the data points and ending at a single clustercontaining all data points.This raises the problem of how to chosethe best clustering from this hierarchy.They solve this problem asfollows:For each clustering result they evaluate the performance ofthe pruned ensemble on a validation set using majority voting asthe combination method.The ﬁnal pruned ensemble is the one thatachieves the highest classiﬁcation accuracy.They experimented on a single data set,using heterogeneous clas-siﬁers derived by running different learning algorithms with differentconﬁgurations.They compared their approach with overproduce andchoose strategies and found that their approach exhibits better clas-siﬁcation accuracy.This approach is generally guided by the notion of diversity.Di-versity guides both the clustering process and the subsequent pruningprocess.However,the authors use the classiﬁcation accuracy with aspeciﬁc combination method (majority voting) to select among thedifferent clustering results.This reduces the generality of the method,as the selection is optimized towards majority voting.Of course thiscould be easily alleviated by using at that stage the method that willbe later used for combining the ensemble.In addition,the authors used a speciﬁc distance metric to guide theclustering process,while it would be interesting to evaluate the per-formance of other pairwise diversity metrics,like the ones proposedby Kuncheva [11].Their limited (datasets) experimental results how-ever does not guarantee the general utility of their method.3.2.2 Lazarevic and Obradovic,2001Lazarevic and Obradovic [12] use the k-means algorithmto performthe clustering of classiﬁers.The k-means algorithmis applied to a ta-ble of data with as many rows as the classiﬁers and as many columnsas the instances of the training set.The table contains the predictionsof each classiﬁer on each instance.Similar to HAC,the k-means al-gorithm suffers from the problem of selecting the number of clus-ters (k).The authors solve this problem,by considering iteratively alarger number of clusters until the diversity between them starts todecrease.Subsequently,the authors prune the classiﬁers of each cluster us-ing the following approach until the accuracy of the ensemble is de-creased.They consider the classiﬁers in turn from the least accurateto the most accurate.A classiﬁer is kept in the ensemble if its dis-agreement with the most accurate classiﬁer is more than a predeﬁnedthreshold and is sufﬁciently accurate.In addition to simple elimina-tion of classiﬁers a method for distributing their voting weights isalso implemented.They experimented on four different data sets,using neural net-work ensembles produced with bagging and boosting.They com-pare the performance of their pruning method with that of unprunedensembles and another ad-hoc method that they propose (see othermethods) and ﬁnd that their clustering-based approach offers thehighest classiﬁcation accuracy.Their method suffers fromthe fact of parameter setting.Howdoesone set the threshold for pruning models?In addition,the method isnot compared to any other pruning methods and sufﬁcient data sets,so its utility cannot be determined.It is very heuristic and ad-hoc.3.2.3 Fu,Hu and Zhao,2005The work of [8] is largely based on the two previous methods.Simi-larly to [12] it uses the k-means algorithm for clustering the modelsof an ensemble.Similarly to [9] it prunes each cluster by selectingthe single best performing model and uses the accuracy of the prunedensemble to select the number of clusters.The difference of this work with the other two clustering-basedmethods,is merely that the experiments are performed on regressiondata sets.However,both previous methods could be relatively easilyextended to handle the pruning of regression models.The experi-ments of this work are performed on four data sets using an ensem-ble of neural networks produced with bagging and boosting,similarto [12].3.3 Ranking-basedRanking-based methods order the classiﬁers in the ensemble onceaccording to some evaluation metric and select the classiﬁers in thisorder.They differ mainly in the criterion used for ordering the mem-bers of the ensemble.A key concept in Orientation Ordering [15] is the signature vec-tor.The signature vector of a classiﬁer c is a jDj-dimensional vectorwith elements taking the value +1 if c(xi) = yiand -1 if c(xi) 6= yi.The average signature vector of all classiﬁers in an ensemble is calledthe ensemble signature vector and is indicative of the ability of theVoting ensemble combination method to correctly classify each ofthe training examples.The reference vector is a vector perpendicularto the ensemble signature vector that corresponds to the projectionof the ﬁrst quadrant diagonal onto the hyper-plane deﬁned by theensemble signature vector.In Orientation Ordering the classiﬁers are ordered by increasingvalues of the angle between their signature vector and the referencevector.Only the classiﬁers whose angle is less than ¼=2 are includedin the ﬁnal ensemble.Essentially this ordering gives preference toclassiﬁers,which correctly classify those examples that are incor-rectly classiﬁed by the full ensemble.3.4 OtherThis category includes two approaches that don’t belong to any of theprevious categories.The ﬁrst one is based on statistical proceduresfor directly selecting a subset of classiﬁers,while the second is basedon semi-deﬁnite programming.Tsoumakas et al.[21,20] prune an ensemble of heterogeneousclassiﬁers using statistical procedures that determine whether the dif-ferences in predictive performance among the classiﬁers of the en-semble are signiﬁcant.Only the classiﬁers with signiﬁcantly betterperformance than the rest are retained and subsequently combinedwith the methods of (weighted) voting.The obtained results are bet-ter than those of state-of-the-art ensemble methods.Zhang et al.[24] formulate the ensemble pruning problem as amathematical problem and apply semi-deﬁnite programming (SDP)techniques.In speciﬁc,the authors initially formulated the ensemblepruning problem as a quadratic integer programming problem thatlooks for a ﬁxed-size subset of k classiﬁers with minimum misclas-siﬁcation and maximum divergence.They subsequently found thatthis quadratic integer programming problem is similar to the “maxcut with size k” problem,which can be approximately solved usingan algorithm based on SDP.Their algorithm requires the number ofclassiﬁers to retain as a parameter and runs in polynomial time.4 Greedy Ensemble SelectionGreedy ensemble selection algorithms attempt to ﬁnd the globallybest subset of classiﬁers by taking local greedy decisions for chang-ing the current subset.An example of the search space for an ensem-ble of four models is presented in Figure 1.Figure 1.An example of the search space of greedy ensemble selectionalgorithms for an ensemble of four models.In the following subsections we present and discuss on what weconsider to be the main aspects of greedy ensemble selection algo-rithms:the direction of search,the measure and dataset used for eval-uating the different branches of the search and the size of the ﬁnalsubensemble.The notation that will be used is the following.²D = f(xi;yi);i = 1;2;:::;Ng is an evaluation set of labelledtraining examples where each example consists of a feature vectorxiand a class label yi.²H = fht;t = 1;2;:::;Tg is the set of classiﬁers or hypothesesof an ensemble,where each classiﬁer htmaps an instance x to aclass label y,ht(x) = y.²S µ H,is the current subensemble during the search in the spaceof subensembles.4.1 Direction of SearchBased on the direction of search,there are two main categories ofgreedy ensemble selection algorithms:forward selection and back-ward elimination.In forward selection,the current classiﬁer subset S is initializedto the empty set.The algorithm continues by iteratively adding toS the classiﬁer ht2 HnS that optimizes an evaluation functionfFS(S;ht;D).This function evaluates the addition of classiﬁer htin the current subset S based on the labelled data of D.For example,fFScould return the accuracy of the ensemble S [ hton the dataset D by combining the decisions of the classiﬁers with the methodof voting.Algorithm 1 shows the pseudocode of the forward selec-tion ensemble selection algorithm.In the past,this approach has beenused in [7,14,4] and in the Reduce-Error Pruning with Backﬁtting(REPwB) method in [13].In backward elimination,the current classiﬁer subset S is initial-ized to the complete ensemble H and the algorithm continues byAlgorithm1 The forward selection method in pseudocodeRequire:Ensemble of classiﬁers H,evaluation function fFS,eval-uation set D1:S =;2:while S 6= H do3:ht= arg maxh2HnSfFS(S;h;D)4:S = S [ fhtg5:end whileiteratively removing from S the classiﬁer ht2 S that optimizes theevaluation function fBE(S;ht;D).This function evaluates the re-moval of classiﬁer h fromthe current subset S based on the labelleddata of D.For example,fBEcould return a measure of diversity forthe ensemble S n fhtg,calculated on the data of D.Algorithm 2shows the pseudocode of the backward elimination ensemble selec-tion algorithm.In the past,this approach has been used in the AIDthinning and concurrency thinning algorithms [1].Algorithm2 The backward elimination method in pseudocodeRequire:Ensemble of classiﬁers H,evaluation function fBE,eval-uation set D1:S = H2:while S 6=;do3:ht= arg maxh2SfBE(S;h;D)4:S = S n fhtg5:end whileThe time complexity of greedy ensemble selection algorithms fortraversing the space of subensembles is O(t2g(T;N)).The termg(T;N) concerns the complexity of the evaluation function,whichis linear with respect to N and ranges fromconstant to quadratic withrespect to T,as we shall see in the following subsections.4.2 Evaluation FunctionOne of the main components of greedy ensemble selection algo-rithms is the function that evaluates the alternative branches duringthe search in the space of subensembles.Given a subensemble S anda model htthe evaluation function estimates the utility of inserting(deleting) htinto (from) S using an appropriate evaluation measure,which is calculated on an evaluation dataset.Both the measure andthe dataset used for evaluation are very important,as their choice af-fects the quality of the evaluation function and as a result the qualityof the selected ensemble.4.2.1 Evaluation DatasetOne approach is to use the training dataset for evaluation,as in [14].This approach offers the beneﬁt that plenty of data will be availablefor evaluation and training,but is susceptible to the danger of over-ﬁtting.Another approach is to withhold a part of the training set for eval-uation,as in [4,1] and in the REPwB method in [13].This approachis less prone to overﬁtting,but reduces the amount of data that areavailable for training and evaluation compared to the previous ap-proach.It sacriﬁces both the predictive performance of the ensem-ble’s members and the quantity of the evaluation data for the sake ofusing unseen data in the evaluation.This method should probably bepreferred over the previous one,when there is abundance of trainingdata.An alternative approach that has been used in [3],is based on k-fold cross-validation.For each fold an ensemble is created using theremaining folds as the training set.The same fold is used as the evalu-ation dataset for models and subensembles of this ensemble.Finally,the evaluations are averaged across all folds.This approach is lessprone to overﬁtting as the evaluation of models is based on data thatwere not used for their training and at the same time,the completetraining dataset is used for evaluation.During testing the above approach works as follows:the k modelsthat where trained using the same procedure (same algorithm,samesubset,etc.) forma cross-validated model.When the cross-validatedmodel makes a prediction for an instance,it averages the predictionsof the individuals models.An alternative testing strategy that we sug-gest for the above approach is to train an additional single modelfromthe complete training set and use this single model during test-ing.4.2.2 Evaluation MeasureThe evaluation measures can be grouped into two major categories:those that are based on performance and those on diversity.The goal of performance-based measures is to ﬁnd the model thatmaximizes the performance of the ensemble produced by adding(removing) a model to (from) the current ensemble.Their calcula-tion depends on the method used for ensemble combination,whichusually is voting.Accuracy was used as an evaluation measurein [13,7],while [4] experimented with several metrics,includingaccuracy,root-mean-squared-error,mean cross-entropy,lift,preci-sion/recall break-even point,precision/recall F-score,average pre-cision and ROC area.Another measure is beneﬁt which is based ona cost model and has been used in [7].The calculation of performance-based metrics requires the deci-sion of the ensemble on all examples of the pruning dataset.There-fore,the complexity of these measures is O(jSjN).However,thiscomplexity can be optimized to O(N),if the predictions of the cur-rent ensemble are updated incrementally each time a classiﬁer isadded to/removed fromit.It is generally accepted that an ensemble should contain diversemodels in order to achieve high predictive performance.However,there is no clear deﬁnition of diversity,neither a single measure tocalculate it.In their interesting study,[11],could not reach into asolid conclusion on how to utilize diversity for the production of ef-fective classiﬁer ensembles.In a more recent theoretical and exper-imental study on diversity measures [19],the authors reached to theconclusion that diversity cannot be explicitly used for guiding theprocess of greedy ensemble selection.Yet,certain approaches havereported promising results [14,1].One issue that worths mentioning here is how to calculate the di-versity during the search in the space of ensemble subsets.For sim-plicity we consider the case of forward selection only.Let S be thecurrent ensemble and ht2 H n S a candidate classiﬁer to add to theensemble.One could compare the diversities of subensembles S0= S [ htfor all candidate ht2 H n S and select the ensemble with the high-est diversity.Any pairwise and non-pairwise diversity measure canbe used for this purpose.The time complexity of most non-pairwisediversity measures is O(jS0jN),while that of pairwise diversity mea-sures is O(jS0j2N).However,a straightforward optimization can beperformed in the case of pairwise diversity measures.Instead of cal-culating the sumof the pairwise diversity for every pair of classiﬁersin each candidate ensemble S0,one can simply calculate the sum ofthe pairwise diversities only for the pairs that include the candidateclassiﬁer ht.The sum of the rest of the pairs is equal for all candi-date ensembles.The same optimization can be achieved in backwardelimination too.This reduces their time complexity to O(jSjN).Existing methods [14,1,19] use a different approach to calcu-late diversity during the search.They use pairwise measures to com-pare the candidate classiﬁer htwith the current ensemble S,which isviewed as a single classiﬁer that combines the decisions of its mem-bers with voting.This way they calculate the diversity between thecurrent ensemble as a whole and the candidate classiﬁer.Such anapproach has time complexity O(jSjN),which can be optimized toO(N),if the predictions of the current ensemble are updated incre-mentally each time a classiﬁer is added to/removed fromit.However,these calculations do not take into account the decisions of individualmodels.In the past,the widely known diversity measures disagreement,double fault,Kohavi-Wolpert variance,inter-rater agreement,gen-eralized diversity and difﬁculty were used for greedy ensemble selec-tion in [19].Concurrency [1],margin distance minimization,Com-plementariness [14] and Focused Selection Diversity are four diver-sity measures designed speciﬁcally for greedy ensemble selection.We next present these measures using a common notation.We candistinguish 4 events concerning the decision of the current ensembleand the candidate classiﬁer:e1:y = ht(xi) ^ y 6= S(xi)e2:y 6= ht(xi) ^ y = S(xi)e3:y = ht(xi) ^ y = S(xi)e4:y 6= ht(xi) ^ y 6= S(xi)The complementariness of a model hkwith respect to a subensem-ble S and a set of examples D = (xi;yi);i = 1;2;:::;N is calcu-lated as follows:COMD(hk;S) =NXi=1I(e1);where I(true) = 1,I(false) = 0 and S(xi) is the classiﬁcationof instance xifromthe subensemble S.This classiﬁcation is derivedfromthe application of an ensemble combination method to S,whichusually is voting.The complementariness of a model with respect toa subensemble is actually the number of examples of Dthat are clas-siﬁed correctly by the model and incorrectly by the subensemble.Aselection algorithmthat uses the above measure,tries to add (remove)at each step the model that helps the subensemble classify correctlythe examples it gets wrong.The concurrency of a model hkwith respect to a subensemble Sand a set of examples D = (xi;yi);i = 1;2;:::;N is calculated asfollows:COND(hk;S) =NXi=1³2 ¤ I(e1) +I(e3) ¡2 ¤ I(e4)´This measure is very similar to complementariness with the differ-ence that it takes into account two extra cases.The focused ensemble selection method [17] uses all the eventsand also takes into account the strength of the current ensemble’s de-cision.Focused ensemble selection is calculated with the followingform:FES(hk;S) =NXi=1³NTi¤ I(e1) ¡NFi¤ I(e2) ++NFi¤ I(e3) ¡NTi¤ I(e4)´;where NTidenotes the proportion of models in the current ensembleS that classify example (xi;yi) correctly,and NFi= 1 ¡ NTidenotes the number of models in S that classify it incorrectly.The margin distance minimization method [14] follows a differentapproach for calculating the diversity.For each classiﬁer htan N-dimensional vector,ct,is deﬁned where each element ct(i) is equalto 1 if the tthclassiﬁer classiﬁes correctly instance i,and -1 other-wise.The vector,CSof the subensemble S is the average of the in-dividual vectors ct,CS=1jSjPjSjt=1ct.When S classiﬁes correctlyall the instances the corresponding vector is in the ﬁrst quadrant ofthe N-dimensional hyperplane.The objective is to reduce the dis-tance,d(o;C),where d is the Euclidean distance and o a predeﬁnedvector placed in the ﬁrst quadrant.The margin,MARD(hk;S),ofa classiﬁer k with respect to a subensemble S and a set of examplesD = (xi;yi);i = 1;2;:::;N is calculated as follows:MARD(hk;S) = dÃo;1jSj +1³ck+CS´!4.3 Size of Final EnsembleAnother issue that concerns greedy ensemble selection algorithms,iswhen to stop the search process,or in other words howmany modelsshould the ﬁnal ensemble include.One solution is to perform the search until all models have beenadded into (removed from) the ensemble and select the subensemblewith the highest accuracy on the evaluation set.This approach hasbeen used in [4].Others prefer to select a predeﬁned number of mod-els,expressed as a percentage of the original ensemble [13,7,14,1].5 ConclusionsThis works was a ﬁrst attempt towards a taxonomy of ensemble se-lection methods.We believe that such a taxonomy is necessary forresearchers working on new methods.It will help them identify themain categories of methods and their key points,and avoid duplica-tion of work.Due to the large amount of existing methods and thedifferent parameters of an ensemble selection framework (heteroge-neous/homogeneous ensemble,algorithms used,size of ensemble,etc),it is possible to devise a new method,which may only differ insmall,perhaps unimportant,details from existing methods.A gener-alized view of the methods,as offered from a taxonomy,will helpavoid work towards such small differences,and perhaps may lead tomore novel methods.Of course,we do not argue that the proposed taxonomy is perfect.On the contrary,it is just a ﬁrst and limited step in abstracting andcategorizing the different methods.Much more elaborate study hasto be made,to properly account for the different aspects of exist-ing methods.No doubt,some high quality methods may have beenleft outside this study.We hope that through a discussion and thecriticism of this work within the ensemble methods community,andespecially people working on ensemble selection,a much improvedversion of it will arise.REFERENCES[1]R.E.Banﬁeld,L.O.Hall,K.W.Bowyer,and W.P.Kegelmeyer,‘En-semble diversity measures and their application to thinning.’,Informa-tion Fusion,6(1),49–62,(2005).[2]L.Breiman,‘Bagging Predictors’,Machine Learning,24(2),123–40,(1996).[3]R.Caruana,A.Munson,and A.Niculescu-Mizil,‘Getting the most outof ensemble selection’,in Sixth International Conference in Data Min-ing (ICDM’06),(2006).[4]R.Caruana,A.Niculescu-Mizil,G.Crew,and A.Ksikes,‘Ensembleselection fromlibraries of models’,in Proceedings of the 21st Interna-tional Conference on Machine Learning,p.18,(2004).[5]T.G.Dietterich,‘Machine-learning research:Four current directions’,AI Magazine,18(4),97–136,(1997).[6]T.G.Dietterich,‘Ensemble Methods in Machine Learning’,in Pro-ceedings of the 1st International Workshop in Multiple Classiﬁer Sys-tems,pp.1–15,(2000).[7]W.Fan,F.Chu,H.Wang,and P.S.Yu,‘Pruning and dynamic schedul-ing of cost-sensitive ensembles’,in Eighteenth national conference onArtiﬁcial intelligence,pp.146–151.American Association for ArtiﬁcialIntelligence,(2002).[8]Qiang Fu,Shang-Xu Hu,and Sheng-Ying Zhao,‘Clusterin-based se-lective neural network ensemble’,Journal of Zhejiang University SCI-ENCE,6A(5),387–392,(2005).[9]Giorgio Giacinto,Fabio Roli,and Giorgio Fumera,‘Design of effectivemultiple classiﬁer systems by clustering of classiﬁers’,in 15th Inter-national Conference on Pattern Recognition,ICPR 2000,pp.160–163,(3–8 September 2000).[10]R.A.Jacobs,M.I.Jordan,S.J.Nowlan,and G.E.Hinton,‘Adaptivemixtures of local experts’,Neural Computation,3,79–87,(1991).[11]L.I.Kuncheva and C.J.Whitaker,‘Measures of diversity in classiﬁerensembles and their relationship with the ensemble accuracy’,MachineLearning,51,181–207,(2003).[12]Aleksandar Lazarevic and Zoran Obradovic,‘Effective pruning of neu-ral network classiﬁers’,in 2001 IEEE/INNS International Conferenceon Neural Networks,IJCNN 2001,pp.796–801,(15–19 July 2001).[13]D.Margineantu and T.Dietterich,‘Pruning adaptive boosting’,in Pro-ceedings of the 14th International Conference on Machine Learning,pp.211–218,(1997).[14]G.Martinez-Munoz and A.Suarez,‘Aggregation ordering in bagging’,in International Conference on Artiﬁcial Intelligence and Applications(IASTED),pp.258–263.Acta Press,(2004).[15]G.Martinez-Munoz and A.Suarez,‘Pruning in ordered bagging ensem-bles’,in 23rd International Conference in Machine Learning (ICML-2006),pp.609–616.ACMPress,(2006).[16]I.Partalas,G.Tsoumakas,I.Katakis,and I.Vlahavas,‘Ensemble prun-ing via reinforcement learning’,in 4th Hellenic Conference on ArtiﬁcialIntelligence (SETN 2006),pp.301–310,(May 18–20 2006).[17]I.Partalas,G.Tsoumakas,and I.Vlahavas,‘Focused ensemble selec-tion:A diversity-based method for greedy ensemble selection’,in 18thEuropean Conference on Artiﬁcial Intelligence,(2008).[18]Robert E.Schapire,‘The strength of weak learnability’,MachineLearning,5,197–227,(1990).[19]E.K.Tang,P.N.Suganthan,and X.Yao,‘An analysis of diversity mea-sures’,Machine Learning,65(1),247–271,(2006).[20]G.Tsoumakas,L.Angelis,and I.Vlahavas,‘Selective fusion of hetero-geneous classiﬁers’,Intelligent Data Analysis,9(6),511–525,(2005).[21]G.Tsoumakas,I.Katakis,and I.Vlahavas,‘Effective Voting of Hetero-geneous Classiﬁers’,in Proceedings of the 15th European Conferenceon Machine Learning,ECML2004,pp.465–476,(2004).[22]C.J.Watkins and P.Dayan,‘Q-learning’,Machine Learning,8,279–292,(1992).[23]D.Wolpert,‘Stacked generalization’,Neural Networks,5,241–259,(1992).[24]Yi Zhang,Samuel Burer,and W.Nick Street,‘Ensemble pruning viasemi-deﬁnite programming’,Journal of Machine Learning Research,7,1315–1338,(2006).[25]Zhi-Hua Zhou and Wei Tang,‘Selective ensemble of decision trees’,in9th International Conference on Rough Sets,Fuzzy Sets,Data Mining,and Granular Computing,RSFDGrC 2003,pp.476–483,Chongqing,China,(May 2003).