Support Vector Machines (SVMs) – A Must Have ML Algorithm

I'm Piyush Malhotra, a Delhilite who loves to dig Deep in the woods of Artificial Intelligence. I like to find new ways to solve not so new but interesting problems. Fitting new models to data and articulating new ways to manipulate and personify things is what I think my field is all about. When not working or playing with data, you'll find me in the gym or writing new blog posts.

March 29, 2019

So, you want to add new algorithm into your Machine Learning arsenal. If you are here, then you know we are about to do it. Make sure you grab a cup of coffee as this is going to be a long but useful read.

A Support Vector Machine (SVM) is a powerful and flexible Machine Learning algorithm. It can work on dataset of small or medium size (say, 100 to 10K datapoints). Many major ML tasks – Regression, Classification and outlier detection can be achieved using SVMs.

LARGE MARGIN CLASSIFICATION

For the sake of simplicity, let’s consider a linearly separable dataset (2 features of the Iris dataset, to predict whether a flower is Iris-Virginica or not).

Now, let’s have a look at 3 possible decision boundaries that will work for the given data:

On one hand, boundary in (a) is too close to the dataset represented by triangles, while on other, boundary in (b) is too close to dataset represented by squares. What if we introduce some more points (our validation data) in the graph? Let’s have a look at the 3 boundaries:

We can clearly see that boundary in (c) has more precision over the other two. This gives us the essence of what SVMs do. An SVM classifier tries to fit the widest possible street between the classes. This is called large margin classification.

The boundary in SVM is dependent (or supported) by the instances that are located closest to the boundary, or at the edges of the street. These instances are referred to as support vectors. What does this mean? This means that addition of more training instances which are not in the street will have the least effect on the boundary.

SOFT MARGIN CLASSIFICATION vs HARD MARGIN CLASSIFICATION

If we are too stringent on imposing the condition that no point is between the street lines, it will be called hard margin classification. This kind of classification is susceptible to the presence of outliers, example a square class lying close to a triangle class!

To get over this, we will aim at having a classification that has a good balance between largest possible street and limited violations of this margin. We can control this balance using the C param in the scikit-learn LinearSVC class.

Enough of the theory, let’s do some action now!

The code below, loads iris dataset, uses 2 of its classes, fits a LinearSVC to the data and plots a decision boundary using the plot_predictions function.

NON-LINEAR CLASSIFICATION AND KERNEL TRICK

There are plenty of ways to make classification models perform well over non-linear datasets. For example, we can use polynomial features to help our model get a sense of higher order relations between the existing features.

But these higher order feature creations add more number of features to our dataset and in-turn slow down our model. This is where we can use a technique called kernel trick.

Kernel trick, at its core, carry the idea of linear separability of non-linear data when looked at in (or transformed to) a higher dimensional space. So, we can use polynomial features with higher degree polynomials, without bombarding our model with an explosion of number of features.

Let’s compare a Linear SVM with polynomial feature pipeline to an SVM with poly kernel applied to it.

Well, it all looks great. But what is the hinge loss we have been so fixated to in all of the code till now? Let’s have a look at it now!

HINGE LOSS

Hinge loss is defined by the function:

loss = max(0, 1 – t)

(Intuitively) This loss function is achieved by taking the cross entropy loss (as we discussed in Maximum Likelihood Estimation section) function and flattening it, as shown below. This in-turn gives us computational advantages in optimization process.

The derivative of hinge loss is 0 when t > 1, -1 when t < 1 and undefined for t = 1. To make use of gradient descent, we can take any sub-derivative (-1 or 0) for t = 1.

How is it any better than cross-entropy? Say, we’ve true label to be 1. Remember, ‘t’ here represents weighted sum of the input features. In case of cross-entropy, when t >= 0, sigmoid (or softmax) of t is closer to 1, thus cross-entropy provisions a lower loss value to the given feature-set. On the other hand, in case of hinge loss we get loss 0 only when t >= 1 which results in a higher value of sigmoid (or softmax) over result. Thus, a higher value of weighted sum is needed to prove the significance of the result!

In other words, we can say that hinge loss is more stringent on confidence over the input features.

We’ve covered some major ground here. But for one of the most researched ML algorithm of the last decade of 20th century, this still isn’t enough. Most of this research was around the use of kernel-trick. Many kernels were introduced and researched, one such famous kernel is Gaussian Radial Basis Function (RBF) with γ = 0.3.

GAUSSIAN RBF KERNEL

Consider it to be another technique that can tackle nonlinear datasets. The kernel works on the basis of using features that quantify similarity between landmarks and input features.

In this case, landmark represents all the other points available. This might at a glance look expensive. However, like polynomial kernel, the RBF kernel behaves in manner that new similarity features are added without actually adding them.