3 Answers
3

Intuitively, as all regressors it tries to fit a line to data by minimising a cost function. However, the interesting part about SVR is that you can deploy a non-linear kernel. In this case you end making non-linear regression, i.e. fitting a curve rather than a line.

This process is based on the kernel trick and the representation of the solution/model in the dual rather than in the primal. That is, the model is represented as combinations of the training points rather than a function of the features and some weights. At the same time the basic algorithm remains the same: the only real change in the process of going non-linear is the kernel function, which changes from a simple inner product to some non linear function.

$\begingroup$Where do the support vectors and hyperplane come into play?$\endgroup$
– A AJan 13 '14 at 19:28

$\begingroup$Support vectors are those data points that participate to the solution, i.e., define the curve that fits the data. I don't think there is the notion of hyperplane in the same sense as classification, you could say that it is the curve itself.$\endgroup$
– iliasflJan 13 '14 at 23:10

$\begingroup$My understanding is that SVM classification is a subset of SVM regression: Basically, SVM regression is a linear function from the features (and the generated child features [sorry, not very familiar with the terminology]) to the output values. Classification is basically doing the same thing, except the results of the linear function are binned into appropriate classes, by setting thresholds. These thresholds are the hyperplanes.$\endgroup$
– naught101Aug 4 '15 at 0:15

5

$\begingroup$Sorry, but -1. Kernels are but a detail (any linear model working with the inner product between training stances can use kernels). The important thing is the difference between the hinge-loss (in SVC) versus the epsilon-insensitive loss (in SVR). And there's still a notion of a margin.$\endgroup$
– FirebugFeb 5 '17 at 22:18

$\begingroup$Firebug is completely on the point. This is not a mathematical explanation but a specific way of running an SVM.$\endgroup$
– bonoboFeb 13 at 10:45

In short: Maximising the margin can more generally be seen as regularising the solution by minimising $w$ (which is essentially minimising model complexity) this is done both in the classification and regression. But in the case of classification this minimisation is done under the condition that all examples are classified correctly and in the case of regression under the condition that the value $y$ of all examples deviates less than the required accuracy $\epsilon$ from $f(x)$ for regression.

In order to understand how you go from classification to regression it helps to see how both cases one applies the same SVM theory to formulate the problem as a convex optimisation problem. I'll try putting both side by side.

Classification

In this case the goal is to find a function $f(x)= wx +b$ where $f(x) \geq 1$ for positive examples and $f(x) \leq -1$ for negative examples. Under these conditions we want to maximise the margin (distance between the 2 red bars) which is nothing more than minimising the derivative of $f'=w$.

The intuition behind maximising the margin is that this will give us a unique solution to the problem of finding $f(x)$ (i.e. we discard for example the blue line) and also that this solution is the most general under these conditions, i.e. it acts as a regularisation. This can be seen as, around the decision boundary (where red and black lines cross) the classification uncertainty is the biggest and choosing the lowest value for $f(x)$ in this region will yield the most general solution.

The data points at the 2 red bars are the support vectors in this case, they correspond to the non-zero Lagrange multipliers of the equality part of the inequality conditions $f(x) \geq 1$ and $f(x) \leq -1$

Regression

In this case the goal is to find a function $f(x)= wx +b$ (red line) under the condition that $f(x)$ is within a required accuracy $\epsilon$ from the value value $y(x)$ (black bars) of every data point, i.e. $|y(x) -f(x)|\leq \epsilon$ where $epsilon$ is the distance between the red and the grey line. Under this condition we again want to minimise $f'(x)=w$, again for the reason of regularisation and to obtain a unique solution as the result of the convex optimisation problem. One can see how minimising $w$ results in a more general case as the extreme value of $w=0$ would mean no functional relation at all which is the most general result one can obtain from the data.

The data points at the 2 red bars are the support vectors in this case, they correspond to the non-zero Lagrange multipliers of the equality part of the inequality condition $|y -f(x)|\leq \epsilon$.

Conclusion

Both cases result in the following problem:

$$ \text{min} \frac{1}{2}w^2 $$

Under the condition that:

All examples are classified correctly (Classification)

The value $y$ of all examples deviates less than $\epsilon$ from $f(x)$. (Regression)

In SVM for classification problem we actually try to separate the class as far as possible from the separating line (Hyperplane) and unlike logistic regression, we create a safety boundary from both sides of the hyperplane (different between logistic regression and SVM classification is in their loss function).
Eventually, having a separated different data points as far as possible from hyperplane.

In SVM for regression problem, We want to fit a model to predict a quantity for future. Therefore, we want the data point(observation) to be as close as possible to the hyperplane unlike SVM for classification.
The SVM regression inherited from Simple Regression like (Ordinary Least Square) by this difference that we define an epsilon range from both sides of hyperplane to make the regression function insensitive to the error unlike SVM for classification that we define a boundary to be safe for making the future decision(prediction).
Eventually, SVM in Regression has a boundary like SVM in classification but the boundary for Regression is for making the regression function insensitive respect to the error but the boundary for classification is only to be way far from hyperplane(decision boundary) to distinguish between class for future (that is why we call it safety margin).