F β Support Vector Machines

Transcription

1 Proceedings of International Joint Conference on Neural Networks, Montreal, Canada, July 31 - August 4, 2005 F β Support Vector Machines Jérôme Callut and Pierre Dupont Department of Computing Science and Engineering, INGI Université catholique de Louvain, Place Sainte-Barbe 2 B-1348 Louvain-la-Neuve, Belgium Abstract We introduce in this paper F β SVMs, a new parametrization of support vector machines. It allows to optimize a SVM in terms of F β, a classical information retrieval criterion, instead of the usual classification rate. Experiments illustrate the advantages of this approach with respect to the traditionnal 2- norm soft-margin SVM when precision and recall are of unequal importance. An automatic model selection procedure based on the generalization F β score is introduced. It relies on the results of Chapelle, Vapnik et al. [4] about the use of gradient-based techniques in SVM model selection. The derivatives of a F β loss function with respect to the hyperparameters C and the width σ of a gaussian kernel are formally defined. The model is then selected by performing a gradient descent of the F β loss function over the set of hyperparameters. Experiments on artificial and real-life data show the benefits of this method when the F β score is considered. I. INTRODUCTION Support Vector Machines (SVM) introduced by Vapnik [18] have been widely used in the field of pattern recognition for the last decade. The popularity of the method relies on its strong theoretical foundations as well as on its practical results. Performance of classifiers is usually assessed by means of classification error rate or by Information Retrieval (IR) measures such as precision, recall, F β, breakeven-point and ROC curves. Unfortunately, there is no direct connection between these IR criteria and the SVM hyperparameters: the regularization constant C and the kernel parameters. In this paper, we propose a novel method allowing the user to specify his requirement in terms of the F β criterion. First of all, the F β measure is reviewed as a user specification criterion in section II. A new SVM parametrization dealing with the β parameter is introduced in section III. Afterwards, a procedure for automatic model selection according to F β is proposed in section IV. This procedure is a gradient-based technique derived from the results of Chapelle, Vapnik et al. [4]. Finally, experiments with artifical and real-life data are presented in section V. II. USER SPECIFICATIONS WITH THE F β CRITERION Precision and recall are popular measures to assess classifiers performance in an information retrieval context [16]. Therefore, it would be convenient to use these evaluation criteria when formulating the user specifications. For instance, let us consider the design of a classifier used to retrieve documents according to topic. Some users prefer to receive a limited list of relevant documents even if this means losing some interesting ones. Others would not want to miss any relevant document at the cost of also receiving non-relevant ones. Those specifications correspond respectively to a high precision and a high recall. The two previous measures can be combined in a unique F β measure in which the paramater β specifies the relative importance of recall with respect to precision. Setting β equals to 0 would only consider precision whereas taking β = would only take recall into account. Moreover, precision and recall are of equal importance when using the F 1 measure. The contingency matrix and estimations of precision, recall and F β are given hereafter. Target: +1 Target: True Pos. (#TP) False Pos. (#FP) -1 False Neg. (#FN) True Neg. (#TN) Precision π Recall ρ F β #TP #TP+#FP #TP #TP+#FN (β 2 +1)πρ β 2 π+ρ III. F β SUPPORT VECTOR MACHINES In this section, we introduce a new parametrization of SVM allowing to formulate user specifications in terms of the F β criterion. To do so, we establish a relation between the contingency matrix and the slack variables used in the soft-margin SVM setting. Based on this link, we devise a new optimization problem which maximizes an approximation of the F β criterion regularized by the size of the margin. A. Link between the contingency matrix and the slacks Let us consider a binary classification task with a training set Tr = {(x 1,y 1 ),..., (x n,y n )} where x i is an instance in some input space X and y i { 1, +1} represents its category. Let n + and n denote respectively the number of positive and negative examples. The soft-margin formulation of SVM allows examples to be missclassified or to lie inside the margin by the introduction of slack variables ξ in the problem constraints: OP1 Minimize W (w,b,ξ) = 1 2 w 2 + C.Φ(ξ) /05/$ IEEE 1443

2 s.t. { yi ( w, x i + b) 1 ξ i i =1..n ξ i 0 i =1..n where w and b are the parameters of the hyperplane. The Φ(.) term introduced in the objective function is used to penalize solutions presenting many training errors. For any feasible solution (w,b,ξ), missclassified training examples have an associated slack value of at least 1. The situation is illustrated in figure 1. Hence, it seems natural to chose a function counting the number of slacks greater or equal to 1 as penalization function Φ(.). Unfortunately, the optimization of such a function combined with the margin criterion turns out to be a mixed-integer problem known to be NP-hard [15]. In fact, two approximations of the counting function are commonly used: Φ(ξ) = n i=1 ξ i (1-norm) and Φ(ξ) = n i=1 ξ2 i (2- norm). These approximations present two peculiarities: 1) The sum of slacks related to examples inside the margin might be considered as errors. 2) Examples with a slack value greater than 1 might contribute as more than one error. However, the use of these approximations is computationally attractive as the problem remains convex, quadratic and consequently solvable in polynomial time. In the sequel, we will focus on the 2-norm alternative. Fig. 1. Soft-margin SVM and associated slacks The computation of the preceding approximations separately for different class labels allows to bound the elements of the contingency matrix. Proposition 1: Let (w,b,ξ) be a solution satisfying the constraints of OP1. The following bounds holds for the elements of the contingency matrix computed on the training set: #TP n + ξi 2 #FP ξi 2 #FN ξi 2 #TN n ξi 2 These bounds will be called the slack estimates of the contingency matrix. It should be noted that they also could have been formulated using the 1-norm approximation. B. The F β parametrization Let us introduce a parametrization of SVM in which a regularized F β criterion is optimized. The F β function can be expanded using the definition of precision and recall as: F β = (β2 +1)πρ β 2 π + ρ = (β 2 +1)#TP (β 2 +1)#TP + β 2 #FN +#FP The optimal value for F β ( 1) is obtained by minimizing β 2 #FN +#FP. Replacing #FN and #FP by their slack estimates and integrating this into the objective function leads to the following optimization problem: OP2 Minimize W (w,b,ξ) = 1 2 w 2 + C.[β 2. ξ 2 i + { yi ( w, x s.t. i + b) 1 ξ i i =1..n ξ i 0 i =1..n The relative importance of the F β criterion with respect to the margin can be tuned using the regularization constant C. Since the slack estimates for #FP and #FN are upper bounds, OP2 is based on a pessimistic estimation of the F β. OP2 can be seen as an instance of the SVM parametrization considering two kinds of slacks with the associated regularization constants C + and C [21], [13]. In our case, the regularization constants derive from the β value, i.e. C + = Cβ 2 and C = C. It should be pointed out that when β =1, OP2 is equivalent to the traditional 2-norm soft-margin SVM problem. The optimization of the F β criterion is closely related to the problem of training a SVM with an imbalanced dataset. When the prior of a class is by far larger than the prior of the other class, the classifier obtained by a standard SVM training is likely to act as the trivial acceptor/rejector (i.e. a classifier always predicting +1, respectively 1). To avoid this inconvenience, some authors [21] have introduced different penalities for the different classes using C + and C. This method has been applied in order to control the sensitivity 1 of the model. However, no automatic procedure has been proposed to choose the regularization constants with respect to the user specifications. Recently, this technique has been improved by artificially oversampling the minority class [1]. Other authors [2] have proposed to select a unique regularization constant C through a bootstrap procedure. This constant is then used as a starting point for tuning C + and C on a validation set. IV. MODEL SELECTION ACCORDING TO F β In the preceding section, we proposed a parametrization of SVM enabling the user to formulate his specifications with the β parameter. In addition, the remaining hyperparameters, i.e. the regularization constant and the kernels parameters, must be selected. In the case of SVM, model selection can be made using the statistical properties of the optimal hyperplane, thus avoiding the need of performing cross-validation. Indeed, 1 The sensitivity is the rate of true positive examples and is equivalent to recall. ξ 2 i ] 1444

3 several bounds of the leave-one-out (loo) error rate can be directly derived from the parameters of the optimal hyperplane expressed in dual form [20], [14], [10]. A practical evaluation of several of these bounds has been recently proposed in [7]. Moreover, Chapelle, Vapnik et al. [4] have shown that the hyperplane dual parameters are differentiable with respect to the hyperparameters. This allows the use of gradient-based techniques for model selection [4], [5]. In this section, we propose a gradient-based algorithm selecting automatically C and the width σ of a gaussian kernel 2 according to the generalization F β score. A. The generalization F β loss function It has been proved by Vapnik [19] that for an example (x i,y i ) producing a loo error, 4α i R 2 1 holds, where R is the radius of the smallest sphere enclosing all the training examples and α i is the i-th dual parameter of the optimal hyperplane. This inequality was originally formulated for the hard-margin case. However, it can be applied to the 2-norm soft-margin SVM as the latter can be seen as a hard margin problem with a transformed kernel [6], [13]. Using the preceding inequality, one can build an estimator of the generalization F β score of a given model. Alternately, it is possible to formulate a loss function following the reasoning developed in section III-B: L Fβ (α,r) 4R 2 β 2 α i + α i In the algorithm proposed in section IV-B, the model parameters are selected by minimizing the L Fβ (.,.) loss function. B. The model selection algorithm We introduce here an algorithm performing automatic model selection according to the F β criterion. It selects the model by performing a gradient descent of the F β loss function over the set of hyperparameters. For the sake of clarity, C and σ, are gathered in a single vector θ. The model selection algorithm is sketched hereafter. 2 k(x i, x j )=exp( x i x j 2 /2σ 2 ) Algorithm F β MODELSELECTION Input: Training set Tr =(x 1,y 1 ),...,(x n,y n ) Initial values for the hyperparameters θ 0 Precision parameter ɛ Output: Optimal hyperparameters θ SVM optimal solution α using θ α 0 trainf β SVM(Tr,θ 0 ); (R, λ) 0 smallestsphereradius(tr,θ 0 ); repeat θ t+1 updatehyperparameters(θ t, α t,r t, λ t ); α t+1 trainf β SVM(Tr,θ t+1 ); (R, λ) t+1 smallestsphereradius(tr,θ t+1 ); t t +1; until L Fβ (α t,r t ) L Fβ (α t 1,R t 1 ) <ɛ; return {θ t, α t } The trainf β SVM function solves OP3, the dual problem of OP2, which has the same form as the dual hard-margin problem [15]: OP3 Maximize W (α) = 1 n n α i α j y i y j k (x i, x j )+ α i 2 i,j=1 i=1 { n s.t. i=1 α iy i =0 α i 0 i =1..n with a transformed kernel: { k 1 k(xi, x j )+δ ij. (x i, x j )= Cβ if y 2 i =+1 k(x i, x j )+δ ij. 1 C if y i = 1 where δ ij is the Kronecker delta and k(.,.) is the original kernel function. The radius of the smallest sphere enclosing all the examples computed by the smallestsphereradius function is obtained by taking the square root of the objective function optimal value in the following optimization problem [15]: OP4 Maximize n W (λ) = λ i k (x i, x i ) i=1 n λ i λ j k (x i, x j ) i,j=1 { n s.t. i=1 λ i =1 λ i 0 i =1..n The optimization problems OP3 and OP4 can be solved in polynomial time in n, e.g. using an interior point method [17]. Furthermore, the solution to OP3, respectively OP4, at a given iteration can be used as a good starting point for the next iteration. 1445

4 At each iteration, the hyperparameters can be updated by means of a gradient step : θ t+1 = θ t η. L Fβ / where η>0 is the updating rate. However, second order methods often provide a faster convergence, which is valuable since two optimization problems have to be solved at each iteration. For this reason, the updatehyperparameters function relies on the BFGS algorithm [8], a quasi-newton optimization technique. The time complexity of the updatehyperparameters function is O(n 3 ) since it is dominated by the inversion of a possibly n n matrix (see section IV-C). The derivatives of the F β loss function with respect to the hyperparameters are detailed in the next section. The algorithm is iterated until the F β loss function no longer changes by more than ɛ. negative classes were respectively 0.3 and 0.7. It is usually more difficult to obtain a good recall when data are unbalanced in this way. Experiments were carried out using training sets of 600 examples, a fixed test set of 1,000 examples and a linear kernel. A comparison between the F β parametrization and the 2-norm soft-margin SVM with C =1is displayed in figure 2. For each β considered, the training data were resampled 10 times in order to produce averaged results. In this setting, our parametrization obtained better F β scores than the standard soft-margin SVM, especially when a high recall was requested. The second part of the figure 2 presents the evolution of precision, recall and the F β score for different β values. C. Derivatives of the F β loss function The derivatives of the transformed kernel function with respect to the hyperparameters are given by: k (x i, x j ) 1/(C 2 β 2 ) if i = j and y i =+1 = 1/C 2 if i = j and y i = 1 C 0 otherwise k (x i, x j ) σ 2 = k(x i, x j ) x i x j 2 2σ 4 The derivatives of the squared radius can then be obtained applying the lemma 2 of Chapelle, Vapnik et al. [4]: R 2 n = k (x i, x i ) n k (x i, x j ) λ i λ i λ j i=1 i,j=1 where θ {C, σ 2 }. The derivation of the hyperplane dual parameters proposed in [4] follows: ( ) (α,b) 1 H y = H (α,b)t, H = T Ky y y T 0 where K is the kernel matrix and y is the vector of examples labels. The H matrix is derived by using the preceding kernel function derivatives. It should be stressed that only examples corresponding to support vectors have to be considered in the above formula. Finally, the derivative of L Fβ (.,.) with respect to a hyperparameter θ is given by: L Fβ (α,r) = 4 R2 β 2 + 4R 2 β 2 α i + α i + α i α i V. EXPERIMENTS We performed several experiments to assess the performance of the F β parametrization and the model selection algorithm. First, the F β parametrization was tested with positive and negative data in R 10 drawn from two largely overlapping normal distributions. The priors for positive and Fig. 2. The F β parametrization tested with artificially generated data. Top: comparison between the standard 2-norm soft-margin SVM and the F β parametrization. Bottom: Evolution of precision, recall and of the F β score accoring to different β values. Afterwards, our parametrization was tested using several class priors. The experimental setup was unchanged except for the class priors while generating the training and test data. Figure 3 shows the evolution of the F β score obtained by our parametrization and by the 2-norm soft-margin SVM using several class priors. For the standard 2-norm soft-margin SVM, one notes that the effect of the priors is particularly important when positive examples are few in numbers and that a high recall is requested. In this setting, our parametrization outperformed the standard 2-norm soft-margin SVM by more than 0.1. The model selection algorithm was first tested with data 1446

5 Fig. 3. The F β parametrization tested using artificially generated data with several class priors. Top: F β scores obtained on the test set using the F β parametrization. Bottom: F β scores obtained on the test set using the standard 2-norm soft-margin SVM. Fig. 4. The F β model selection algorithm tested with artificially generated data and with β =2. Top: the evolution of the F β loss function during the gradient descent. Bottom: the related values of precision, recall and F β score on independent test data. generated as in the previous paragraph. The hyperparameters C and σ were initialized to 1 and the precision parameter ɛ was set to Our objective was to investigate the relation between the minimization of the F β loss function and the F β score obtained on unknown test data. The figure 4 shows the evolution of the F β loss function during the gradient descent, using β =2. The associated precision, recall and F β scores on test data are displayed in the bottom of the figure 4. Even if the optima of the F β loss function and the F β score do not match exactly, one can observe that good F β scores were obtained when the F β loss function is low. After 35 iterations, the classifier obtained a F β score close to 0.9 with the hyperparameters C = 4.33 and σ = The model selection algorithm was then compared to the Radius-Margin (RM) based algorithm [4] using the Diabetes dataset [3]. This dataset contains 500 positive examples and 268 negative examples. It was randomly split into a training and a test set, each one containing 384 examples. In this setting, it is usually more difficult to obtain a classifier with a high precision. The same initial conditions as before were used. The RM based algorithm select the model parameters of the 2-norm soft-margin SVM according to the RM estimator of the generalization error rate. It should be pointed out that when β =1, both methods are equivalent since the same function is optimized. The comparison is illustrated in the first part of the figure 5. As expected, our method provided better results when β moves far away from value 1. The influence of the β parameter on precision, recall and the F β score can be observed in the second part of the figure 5. VI. CONCLUSION We introduced in this paper F β SVMs, a new parametrization of support vector machines. It allows to formulate user specifications in terms of F β, a classical IR measure. Experiments illustrates the benefits of this approach over a standard SVM when precision and recall are of unequal importance. Besides, we extended the results of Chapelle, Vapnik et al. [4] based on the Radius-Margin (RM) bound in order to automatically select the model hyperparameters according to the generalization F β score. We proposed an algorithm which performs a gradient descent of the F β loss function over the set of hyperparameters. To do so, the partial derivatives of the F β loss function with respect to these hyperparameters have been formally defined. Our experiments on real-life data show the advantages of this method compared to the RM based algorithm when the F β evaluation criterion is considered. Our future work includes improvements to the model selection algorithm in order to deal with larger training sets. Indeed, it is possible to use a sequential optimization method [11] in the smallestsphereradius function and chunking 1447

Knowledge Discovery from patents using KMX Text Analytics Dr. Anton Heijs anton.heijs@treparel.com Treparel Abstract In this white paper we discuss how the KMX technology of Treparel can help searchers

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

A Simple Introduction to Support Vector Machines Martin Law Lecture for CSE 802 Department of Computer Science and Engineering Michigan State University Outline A brief history of SVM Large-margin linear

Comparing Support Vector Machines, Recurrent Networks and Finite State Transducers for Classifying Spoken Utterances Sheila Garfield and Stefan Wermter University of Sunderland, School of Computing and

Journal of Machine Learning Research 6 (25) 889 98 Submitted 4/5; Revised /5; Published /5 Working Set Selection Using Second Order Information for Training Support Vector Machines Rong-En Fan Pai-Hsuen

Convex Optimization SVM s and Kernel Machines S.V.N. Vishy Vishwanathan vishy@axiom.anu.edu.au National ICT of Australia and Australian National University Thanks to Alex Smola and Stéphane Canu S.V.N.

82 This chapter proposes the stacking ensemble approach for combining different data mining classifiers to get better performance. Other combination techniques like voting, bagging etc are also described

Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

Artificial Neural Networks and Support Vector Machines CS 486/686: Introduction to Artificial Intelligence 1 Outline What is a Neural Network? - Perceptron learners - Multi-layer networks What is a Support

Several Views of Support Vector Machines Ryan M. Rifkin Honda Research Institute USA, Inc. Human Intention Understanding Group 2007 Tikhonov Regularization We are considering algorithms of the form min

Shuichi Katsumata The University of Tokyo shuichi katsumata@mist.i.u-tokyo.ac.jp Akiko Takeda The University of Tokyo takeda@mist.i.u-tokyo.ac.jp Abstract In this paper we consider robust classifications

Classification of Bad Accounts in Credit Card Industry Chengwei Yuan December 12, 2014 Introduction Risk management is critical for a credit card company to survive in such competing industry. In addition

Maximum Margin Clustering Linli Xu James Neufeld Bryce Larson Dale Schuurmans University of Waterloo University of Alberta Abstract We propose a new method for clustering based on finding maximum margin

Evaluation & Validation: Credibility: Evaluating what has been learned How predictive is a learned model? How can we evaluate a model Test the model Statistical tests Considerations in evaluating a Model

Journal of Machine Learning Research 3 (2002) 723-746 Submitted 12/01; Published 12/02 The Set Covering Machine Mario Marchand School of Information Technology and Engineering University of Ottawa Ottawa,

Equity forecast: Predicting long term stock price movement using machine learning Nikola Milosevic School of Computer Science, University of Manchester, UK Nikola.milosevic@manchester.ac.uk Abstract Long

Making Sense of the Mayhem: Machine Learning and March Madness Alex Tran and Adam Ginzberg Stanford University atran3@stanford.edu ginzberg@stanford.edu I. Introduction III. Model The goal of our research

University of Wollongong Research Online Faculty of Informatics - Papers (Archive) Faculty of Engineering and Information Sciences 2000 Building MLP networks by construction Ah Chung Tsoi University of

OpenStax-CNX module: m42090 1 Introduction to Logistic Regression Dan Calderon This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 3.0 Abstract Gives introduction