This paper introduces least squaressupport vector machines as a direct kernelmethod, where the kernel is considered as a datapre-processing step. A heuristic formula for theregularization parameter is proposed based onpreliminary scaling experiments.

I. INTRODUCTION

A. One-Layered Neural Networks for Regression

A standard (predictive) data mining problem isdefined as a regression problem for predicting theresponse from descriptive features. In order to do so,we will first build a predictive model based ontraining data, evaluate the performance of thispredictive model based on validation data, and finallyuse this predictive model to make actual predictionson a test data for which we generally do not know (orpretend not to know) the response value.

It is customary to denote the data matrixasnmXand the response vector asny. In this case,there aren

data points andm

descriptive features inthe dataset. We would like to infernyfromnmXbyinduction, denoted asnnmyX, in such a way thatour inference model works not only for the trainingdata, but also does a good job on the out-of-sampledata (i.e., validation data and test data). In otherwords, we aim to build

a linear predictive model ofthe type:

mnmnwXyˆ

(1)

The hat symbol indicates that we are makingpredictions that are not perfect (especially for thevalidation and test data). Equation (1) is the answerto the question “wouldn’t it be nice if we could applywisdom to the data, and pop comes out the answer?”The vectornwis that wisdom vector and is usuallycalled the weight vector in machine learning.

There are many different ways to build suchpredictive regression models. Just to mention a fewpossibilities here, the regression model could be alinear statistical model, a Neural Network basedmodel (NN), or a Support Vector Machine (SVM)[1-3]

based model. Examples for linear statistical modelsare Principal Component Regression models (PCR)and Partial-Least Squares models (PLS). Popularexamples of neural network-based models includefeedforward neural networks (trained with one of themany popular learning methods), Sef-OrganizingMaps (SOMs), and Radial Basis Function Networks(RBFN). Examples of Support Vector Machinealgorithms include the perceptron-like support vectormachines (SVMs), and Least-Squares Support VectorMachines (LS-SVM), also known as kernel ridgeregression. A straightforward way to estimate theweights is outlined in Equation (2).

Predictions for the training set can now be madeforyby substituting (2) in (1):

nTmnnmTmnnmnyXXXXy1ˆ

(3)

Before applying this formula for a general

prediction proper data preprocessing is required. Acommon procedure in data mining to center all thedescriptors and to bring them to a unity variance. Thesame process is then applied to the response. Thisprocedure of centering and variance normalization isknown asMahalanobis scaling. While Mahalanobisscaling is not the only way to pre-process the data, itis probably the most general and the most robust wayto do pre-processing that applies well across theboard. If we represent a feature vector asz,Mahalanobis scaling will result in a rescaled featurevectorzand can be summarized as:

)('zstdzzz

(4)

wherezrepresents the average valueandzstdrepresents

the standard deviation forattributez.

Making a test model proceeds in a very similarway as for training: the “wisdom vector” or theweight vector will now be applied to the test data tomake predictions according to:

mtestkmtestkwXyˆ

(5)

In the above expression it was assumed that therearek

test data, and the superscript ‘test” is used toexplicitly indicate that the weight vector will beapplied to a set ofk

test data withm

attributes or

descriptors. If one considers

testing for one sampledata point at a time, Eq. (5) can be represented as asimple neural network with an input layer and just asingle neuron, as shown in Fig. 1. The neuronproduces the weighted sum of the average inputfeatures. Note that the transfer

function, commonlyfound in neural networks, is not present here. Notealso that that the number of weights for this one-layerneural networks equals the number of inputdescriptors or attributes.

symmetric matrixwhere each entry represents the similarity betweenfeatures. Obviously, if there were two features thatwould be completely redundant the feature matrixwould contain two columns and two rows that are(exactly) identical, and the inverse does not exist.One can argue that all is still

well, and that in order tomake the simple regression method work one wouldjust make sure that the same descriptor or attribute isnot included twice. By the same argument, highlycorrelated descriptors (i.e., “cousin features” in datamining lingo) should be eliminated as well. Whilethis argument sounds plausible, the truth of the matteris more subtle. Let us repeat Eq. (2) again and go justone step further as shown below.

Eq. (10) is the derivation of an equivalent linearformulation to Eq. (2), based on the so-called right-hand pseudo-inverse or Penrose inverse, rather thanusing the more common left-hand pseudo-inverse. Itwas not shown here how that last line followed fromthe previous equation, but the proof isstraightforward and left as an exercise to the reader.Note that now the inverse is needed for a differententity matrix, which now has annn

dimensionality, and is called the data kernel,DK, asdefined by:

TmnnmDXXK

(11)

The right-hand pseudo-inverse formulation is lessfrequently cited in the literature, because it can onlybe non-rank deficient when there are moredescriptive attributes than data points, which is notthe usual case for data mining problems (except fordata strip mining[17]

cases). The data kernel matrix isa symmetrical matrix that contains entriesrepresenting similarities between data points. Thesolution to this problem seems to be straightforward.We will first try to explain here what seems to be anobvious solution, and then actually show why thiswon’t work. Looking at Eqs. (10) and (11) it can beconcluded that, except for rare cases where there areas many data records as there are features, either thefeature kernel is rankdeficient (in case thatnm,i.e., there are more attributes than data), or the datakernel is rank deficient (in case thatmn, i.e.,there are more data than attributes). It can be nowargued that for thenmcase one can proceed withthe usual left-hand pseudo-inverse method of Eq. (2),and that for thenmcase one should proceed withthe right-hand pseudo inverse, or Penrose inversefollowing Eq. (10).

While the approach just proposed here seems tobe reasonable, it will not work well in practice.Learning occurs by discovering patterns in datathrough redundancies present in the data. Dataredundancies imply that there are data present thatseem to be very similar to each other (and that havesimilar values for the response as well). An extremeexample for data redundancy would be a dataset thatcontains the same data point twice. Obviously, in thatcase, the data matrix is ill-conditioned and the inversedoes not exist. This

type of redundancy, where datarepeat themselves, will be called here a “hardredundancy.” However, for any dataset that one canpossibly learn from, there have to be many “softredundancies” as well. While these soft redundancieswill not necessarily make the data matrix ill-conditioned, in the sense that the inverse does notexist because the determinant of the data kernel iszero, in practice this determinant will be very small.In other words, regardless whether one proceeds witha left-hand or a right-hand inverse, if data containinformation that can be learnt from, there have to besoft or hard redundancies in the data. Unfortunately,Eqs. (2) and (10) can’t be solved for the weight

vector in that case, because the kernel will either berank deficient (i.e., ill-conditioned), or poor-conditioned, i.e., calculating the inverse will benumerically unstable. We call this phenomenon “themachine learning dilemma:” (i) machine learningfrom data can only occur when data containredundancies; (ii) but, in that case the kernel inversein Eq. (2) or Eq. (10) is either not defined ornumerically unstable because of poor conditioning.Taking the inverse of a poor-conditioned matrix ispossible, but the inverse is not “sharply defined” andmost numerical methods,

with the exception ofmethods based on single value decomposition (SVD),will run into numerical instabilities. The data miningdilemma seems to have some similarity with theuncertainty principle in physics, but we will not try todraw that parallel toofar.

Statisticians have been aware of the data miningdilemma for a long time, and have devised variousmethods around this paradox. In the next sections, wewill propose several methods to deal with the datamining dilemma, and obtain efficient and robustprediction models in the process.

C. Regression Models Based on the Data Kernel

Reconsider the data kernel formulation of Eq.(10) for predictive modeling. There are several well-known methods for dealing with the data miningdilemma by using techniques that ensure that thekernel matrix will not be rank deficient anymore.Two well-known methods are principal componentregression and ridge regression.[5]

In order to keep themathematical diversions to its bare minimum, onlyridge regression will be discussed.

Ridge regression is a very straightforward way toensure that the kernel matrix is positive definite (orwell-conditioned), before inverting the data kernel. Inridge regression, a small positive value,, is added toeach element on the main diagonal of the data matrix.Usually the same value for

is used for each entry.Obviously, we are not solving the same problemanymore. In order to not deviate too much from theoriginal problem, the value for

will be kept as smallas we reasonably can tolerate. A good choice for

isa small value that will make the newly defined datakernel matrix barely positive definite, so that theinverse exists and is mathematically stable. In datakernel space, the solution for the weight vector thatwill be used in

the ridge regression prediction modelnow becomes:

nTmnnmTmnnyIXXXw1

(12)

and predictions forycan now be made according to:

nDnDDnTmnnmTmnnmwKyIKKyIXXXXy11ˆ

(13)

where a very different weight vector was introduced:nw. This weight vector is

applied directly to the datakernel matrix (rather than the training data matrix)and has the same dimensionality as the number oftraining data. To make a prediction on the test set,one proceeds in a similar way, but appliesthe weightvector on the data kernel for the test data, which isgenerally a rectangular matrix, and projects the testdata on the training data according to:

TtrainmntestkmtestDXXK

(14)

where it is assumed that there arek

data points inthe test set.

II. THE KERNEL TRANSFORMATION

The kernel transformation is an elegant way tomake a regression model nonlinear. The kerneltransformation goes back at least to the early 1900’s,when Hilbert addressed kernels in the mathematicalliterature. A kernel is a matrix containing similaritymeasures for a dataset: either between the data of thedataset itself, or with other data (e.g., supportvectors[1,3]). A classical use of a kernel is thecorrelation matrix used for determining the principalcomponents in principal component analysis, wherethe feature kernel contains linear similarity measuresbetween (centered) attributes. In support vectormachines, the kernel entries are similarity measuresbetween data rather than features and these

similaritymeasures are usually nonlinear, unlike the dotproduct similarity measure that we used before todefine a kernel. There are many possible nonlinearsimilarity measures, but in order to bemathematically tractable the kernel has to satisfycertain conditions, the so-called Mercer conditions.[1]

nnnnnnnnkkkkkkkkkK............212222111211

(15)

The expression above, introduces the generalstructure for the data kernel matrix,nmK, forndata.The kernel matrix is a symmetrical matrix whereeach entry contains a (linear or nonlinear) similaritybetween two data vectors. There are many differentpossibilities for defining similarity metrics such asthe dot product, which is a linear similarity measureand the Radial Basis Function kernel or RBF kernel,

which is a nonlinear similarity measure. The RBFkernel is the most widely used nonlinear kernel andthe kernel entries are defined by

2222ljxxijek

(16)

Note that in the kernel definition above, thekernel entry contains the square of the Euclideandistance (or two-norm) between data points, which isa dissimilarity measure (rather than a similarity), in anegative exponential. The negative exponential alsocontains a free parameter,, which is the Parzenwindow width

for the RBF kernel. The proper choicefor selecting the Parzen window is usuallydetermined by an additional tuning, also calledhyper-tuning, on an external validation set. Theprecise choice for

is not crucial, there usually is arelatively broad range for the choice for

for whichthe model quality should be stable.

Different learning methods distinguishthemselves in the way by which the weights aredetermined. Obviously, the model in Eqs. (12-

14) toproduce estimates or predictions foryis linear. Sucha linear model has a handicap in the sense that itcannot capture inherent nonlinearities in the data.This handicap can easily be overcome by applyingthe kernel transformation directly as a datatransformation. We will therefore not operate directlyon the data, but on a nonlinear transform of the data,in this case the nonlinear data kernel. This is verysimilar to what is done in principal componentanalysis, where the data are substituted by theirprincipal components before building a model. Asimilar procedure will be applied here, but rather thansubstituting data by their principal components, thedata will be substituted by their kernel transform(either linear or nonlinear) before building apredictive model.

The kernel transformation is applied here as adata transformation in a separate pre-processingstage. We actually replace the data by a nonlineardata kernel and apply a traditional linear predictivemodel. Methods where a traditional linear algorithmis used on

a nonlinear kernel transform of the data areintroduced here as “direct kernel methods.” Theelegance and advantage of such a direct kernelmethod is that the nonlinear aspects of the problemare captured entirely in the kernel and are transparentto theapplied algorithm. If a linear algorithm wasused before introducing the kernel transformation, therequired mathematical operations remain linear. It isnow clear how linear methods such as principalcomponent regression, ridge regression, and partialleast squares can be turned into nonlinear directkernel methods, by using exactly the same algorithmand code: only the data are different, and we operateon the kernel transformation of the data rather thanthe data themselves.

In order to make out-of-sample predictions ontrue test data, a similar kernel transformation needs tobe applied to the test data, as shown in Eq. (14). Theidea of direct kernel methods is illustrated in Fig. 2,by showing how any regression model can be appliedto kernel-transformed data. One could also representthe kernel transformation in a neural network type offlow diagram and the first hidden layer would nowyield the kernel-transformed data, and the weights inthe first layer would be just the descriptors of thetraining data. The second layer contains the weightsthat can be calculated with a hard computing method,such as kernel ridge regression. When a radial basisfunction kernel is used, this type of neural networkwould look very similar to a radial basis functionneural network, except that the weights in the secondlayer are calculated differently.

Fig. 2. Direct kernels as a data pre-processing step

A. Dealing with Bias: Centering the Kernel

There is still one important detail that wasoverlooked so far, and that is necessary to makedirect kernel methods work. Looking at theprediction equations in which the weight vector isapplied to data as in Eq. (1), there is no constantoffset term or bias. It turns out that for data that arecentered this offset term isalways zero and does nothave to be included explicitly. In machine learninglingo the proper name for this offset term is the bias,and rather than applying Eq. (1), a more generalpredictive model that includes this bias can be writtenas:

bwXymnmnˆ

(17)

wherebis the bias term. Because we made it apractice in data mining to center the data first byMahalanobis scaling, this bias term is zero and can beignored.

When dealing with kernels, the situation is morecomplex, as they need some type of bias as well. Wewill give only a recipe here, that works well inpractice, and refer the reader to the literature for amore detailed explanation.[3, 6]

Even when the data

were Mahalanobis-scaled, before applying a kerneltransform, the kernel still needs some type ofcentering to be able to omit the bias term in theprediction model. A straightforward way for kernelcentering is to subtract the average from each columnof the training data kernel, and store this average forlater recall, when centering the test kernel. A secondstep for centering the kernel is going through thenewly obtained vertically centered kernel again, thistime row by row, and subtracting the row averageform each horizontal row.

The kernel of the test data needs to be centered ina consistent way, following a similar procedure. Inthis case, the stored column centers from the kernelof the training data will be used for the verticalcentering of the kernel of the test data. This verticallycentered test kernel is then centered horizontally, i.e.,for each row, the average of the vertically centeredtest kernel is calculated, and each horizontal entry ofthe vertically centered test kernel is substituted bythat entry minus the row average.

Mathematical formulations for centering squarekernels are explained in the literature.[3, 6]

Theadvantage of the kernel-centering algorithmintroduced (and described above in words) in thissection is that it also applies to rectangular datakernels. The flow chart

for pre-processing the data,applying a kernel transform on this data, andcentering the kernel for the training data, validationdata, and test data is shown in Fig. 3.

Fig. 3. Data pre-processing with kernel centering

B. Direct Kernel Ridge Regression

So far, the argument was made that by applyingthe kernel transformation in Eqs. (13) and (14), manytraditional linear regression models can betransformed into a nonlinear direct kernel method.The kernel transformation and kernel centeringproceedas data pre-processing steps (Fig. 2). In orderto make the predictive model inherently nonlinear,the radial basis function kernel will be applied, ratherthan the (linear) dot product kernel, used in Eqs. (2)and (10). There are actually several alternate choicesfor the kernel,[1-3]

but the RBF kernel is the mostwidely applied kernel. In order to overcome themachine learning dilemma, a ridge can be applied tothe main diagonal of the data kernel matrix. Since thekernel transformation is applied directly on the data,before applying ridge regression, this method iscalled direct-kernel ridge regression.

Kernel ridge regression and (direct) kernel ridgeregression are not new. The roots for ridge regressioncan be traced back to the statistics literature.[5]

andLeast-Squares Support Vector Machines wereintroduced by Suykens et al.[9-10]). In these works,Kerned Ridge Regression is usually introduced as aregularization method that solves a convexoptimization problem in a Langrangian formulationfor the dualproblem that is very similar to traditionalSVMs. The equivalency with ridge regressiontechniques then appears after a series of mathematicalmanipulations. By contrast, we introduced kernelridge regression with few mathematical diversions inthe context

of the machine learning dilemma anddirect kernel methods. For all practical purposes,kernel ride regression is similar to support vectormachines, works in the same feature space as supportvector machines, and was therefore named least-squares supportvector machines by Suykens et al.

Note that kernel ridge regression still requires thecomputation of an inverse for annmatrix, whichcan be quite large. This task is computationallydemanding for large datasets, as is the case in atypical data mining problem. Since the kernel matrixnow scales with the number of data squared, thismethod can also become prohibitive from a practicalcomputer implementation point of view, because bothmemory and processing requirements can be verydemanding. Krylov space-based methods[10]

andconjugate gradient methods[1, 10]

are relativelyefficient ways to speed up the matrix inversetransformation of large matrices, where thecomputation time now scales asn2, rather thann3.The Analyze/Stripminercode[12]

developed by theauthor applies MØller’s scaled conjugate gradientmethod to calculate the matrix inverse.[13]

The issue of dealing with large datasets is evenmore profound. There are several potential solutionsthat will not be discussed in detail. One approachwould be to use a rectangular kernel, were not all thedata are used as bases to calculate the kernel, but agood subset of “support vectors” is estimated bychunking[1]

or other techniques such as sensitivityanalysis. More efficient ways for inverting largematrices are based on piece-wise inversion.Alternatively, the matrix inversion may be avoided

altogether by adhering to the support vector machineformulation of kernel ridge regression and solving thedual Lagrangian optimization problem and applyingthe sequential minimum optimization or SMO.[16]

III. HEURISTIC REGULARIZATION FOR

It has been shown that kernel ridge regression canbe expressed as an optimization method,[10-15]

whererather than minimizing the residual error on thetraining set, according to:

221trainˆniiiyy

(18)

we now minimize:

222212ˆtrainwyyniii

(19)

The above equation is a form of Tikhonovregularization[14]

that has been explained in detail byCherkassky and Mulier[4]

inthe context of empiricalversus structural risk minimization. Minimizing thenorm of the weight vector is in a sense similar to anerror penalization for prediction models with a largenumber of free parameters. An obvious question inthis context relatesto the proper choice for theregularization parameter or ridge parameter.

In the machine learning, it is common to tune thehyper-parameter

using a tuning/validation set. Thistuning procedure can be quite time consuming forlarge datasets, especially in consideration that asimultaneous tuning for the RBF kernel width mustproceed in a similar manner. We therefore propose aheuristic formula for the proper choice for the ridgeparameter, that has proven to be close to optimal innumerous practical cases [36]. If the data wereoriginally Mahalanobis scaled, it was found byscaling experiments that a near optimal choice for

is

2320005.0;1minn

(20)

wheren

is the number of data in the training set.

Note that in order to apply the above heuristic thedata have to be Mahalanobis scaled first. Eq. (20)was validated on a variety of standard benchmarkdatasets from the UCI data repository, and providedresults that are nearly identical to an optimally tuned

on a tuning/validation set. In any case, the heuristicformula for

should be an excellent starting choicefor the tuning process for. The above formulaproved to be also useful for the initial choice for theregularization parameter C of SVMs, where C is nowtaken as 1/.

ACKNOWLEDGEMENT

The author acknowledges the National ScienceFoundation support of this work (IIS-9979860). Thediscussions with Robert Bress, Kristin Bennett,Karsten Sternickel, Boleslaw Szymanski and SeppoOvaska were extremely helpful to prepare this paper.

REFERENCES

[1]

Nello Cristianini and John Shawe-Taylor [2000]SupportVector Machines and Other Kernel-Based LearningMethods, Cambridge University Press.