The Kernel Trick in Support Vector Machines: Seeing Similarity in More Intricate Dimensions

The “kernel” is the seed or the essence at the heart or the core, and the kernel function measures distance from that center. In the following example from Wikipedia, the kernel is at the origin and the different curves illustrate alternative depictions of what happens as we move away from zero.

At what temperature do you prefer your first cup of coffee? If we center the scale at that temperature, how do we measure the effects of deviations from the ideal level. The uniform kernel function tells us that closest to the optimal makes little difference as long as it is within a certain range. You might feel differently, perhaps it is a constant rate of disappointment as you move away from the best temperature in either direction (a triangular kernel function). However, for most us, satisfaction takes the form of exponential decay with a Gaussian kernel describing our preferences as we deviate from the very best.

Everitt and Hothorn show how it is done in R for density estimation. Of course, the technique works with any variable, not just preference or distance from the ideal. Moreover, the logic is the same: give greater weight to closer data. And how does one measure closeness? You have many alternatives, as shown above, varying from tolerant to strict. What counts as the same depends on your definition of sameness. With human vision the person retains their identity and our attention as they walk from the shade into the sunlight; my old camera has a different kernel function and fails to keep track or focus correctly. In addition, when the density being estimated is multivariate, you have the option of differential weighting of each variable so that some aspects will count a great deal and others can be ignored.

Now, with the preliminaries over, we can generalize the kernel concept to support vector machines (SVMs). First, we will expand our feature space because the optimal cup of coffee depends on more than its temperature (e.g., preparation method, coffee bean storage, variation on finest and method of grind, ratio of coffee to water, and don’t forget the the type of bean and its processing). You tell me the profile of two coffees using all those features that we just enumerated, and I will calculate their pairwise similarity. If their profiles are identical, the two coffees are the same and centered at zero. But if they are not identical, how important are the differences? Finally, we ought to remember that differences are measured with respect to satisfaction, that is, two equally pleasing coffees may have different profiles but the differences are not relevant.

As the Mad Hatter explained in the last post, SVMs live in the observation space, in this case, among all the different cups of coffees. We will need a data matrix with a bunch of coffees for taste testing in the rows and all those features as columns, plus an additional column with a satisfaction rating or at least a thumbs-up or thumbs-down. Keeping it simple, we will stay with a classification problem distinguishing good from bad coffees. Can I predict your coffee preference from those features? Unfortunately, individual tastes are complex and that strong coffee may be great for some but only when hot. What of those who don’t like strong coffee? It is, as if, we had multiple configurations of interacting nonlinear features with many more dimensions than can be represented in the original feature space.

Our training data from the taste tests might contain actual coffees near each of these configurations differentiating the good and the bad. These are the support vectors of SVMs, what Andrew Ng calls “landmarks” in his Coursera course and his more advanced class at Stanford. In this case, the support vectors are actual cups of coffee that you can taste and judge as good or bad. Chapter 9 of An Introduction to Statistical Learning will walk you through the steps, including how to run the R code, but you might leave without a good intuitive grasp of the process.

It would help to remember that a logistic regression equation and the coefficients from a discriminant analysis yield a single classification dimension when you have two groupings. What happens when there are multiple ways to succeed or fail? I can name several ways to prepare different types of coffee, and I am fond of them all. Similarly, I can recall many ways to ruin a cup of coffee. Think of each as a support vector from the training set and the classification function as a weighted similarity to instances from this set. If a new test coffee is similar to one called “good” from the training data, we might want to predict “good” for this one too. The same applies to coffees associated with the “bad” label.

The key is the understanding that the features from our data matrix are no longer the dimensions underlying this classification space. We have redefined the basis in terms of landmarks or support vectors. New coffees are placed along dimensions defined by previous training instances. As Pedro Domingos notes (at 33 minutes into the talk), the algorithm relies on analogy, not unlike case-based reasoning. Our new dimensions are more intricate compressed representations of the original features. If this reminds you of archetypal analysis, then you may be on the right track or at least not entirely lost.