Wednesday, November 30, 2011

Week 7 of Andrew Ng's Machine Learning class covers support vector machines, pragmatically from the perspective of calling a library rather than implementation. I've been wanting to learn more about SVMs for quite a while, so I was excited for this one.

A support vector machine is a supervised classification algorithm. Given labeled training data, typically high-dimensional vectors, SVM finds the maximum-margin hyperplane separating the positive and negative examples. The algorithm selects a decision boundary that does the best job of separating the classes with the extra stipulation that the boundary be as far as possible from the nearest samples on either side. This is where the large margin part comes from.

The cost function used with SVMs is a slightly modified version of that used with logistical regression:

With SVMs, we replace the sigmoid functions with linearized version called cost1 and cost2.

Some error is accepted, allowing for misclassification of some training examples in the interest of getting the majority correct. The parameter C acts as a form of regularization, specifying tolerance for training error.

But, what if the boundary between classes is non-linear, like the one shown here?

The SVM algorithm generalizes to non-linear cases with the aid of kernel functions. A straight line in n dimensions, a hyper-plane, can be viewed as a linear kernel. The other widely used class of kernel functions is the guassian kernel. It's my understanding that the kernel function maps a non-linear boundary in the problem space to a linear boundary in a higher dimensional space.

The SVM algorithm is sped up by a performance hack called the kernel trick, which I understand just in general outline: The kernel trick is a way of mapping observations into a higher dimensional space V, without ever having to compute the mapping explicitly. The trick is to use learning algorithms that only require dot products between the vectors in V, and choose the mapping such that these high-dimensional dot products can be computed within the original space, by means of a kernel function.

There is some equivalence between SVMs and neural networks that I don't quite grasp. The process of computing the kernel function on the input vectors is something like the hidden layer of the neural network, which transforms and weighs the input features. I'm not sure if the analogy between SVMs and ANNs goes deeper. Also by virtue of the kernels, SVMs are a member of a more general class of statistical algorithms called kernel methods.

The exercise was to build a spam filter based on a small subset of the SpamAssassin public corpus of 6047 messages, of which roughly a third are spam. I trained an SVM and tried in on email from my spam-magnet yahoo email address, and it worked!