ANN (Artificial Neural Networks) and SVM (Support Vector Machines) are two popular strategies for supervised machine learning and classification. It's not often clear which method is better for a particular project, and I'm certain the answer is always "it depends." Often, a combination of both along with Bayesian classification is used.

These questions on Stackoverflow have already been asked regarding ANN vs SVM:

In this question, I'd like to know specifically what aspects of an ANN (specifically, a Multilayer Perceptron) might make it desirable to use over an SVM? The reason I ask is because it's easy to answer the opposite question: Support Vector Machines are often superior to ANNs because they avoid two major weaknesses of ANNs:

(1) ANNs often converge on local minima rather than global minima, meaning that they are essentially "missing the big picture" sometimes (or missing the forest for the trees)

(2) ANNs often overfit if training goes on too long, meaning that for any given pattern, an ANN might start to consider the noise as part of the pattern.

SVMs don't suffer from either of these two problems. However, it's not readily apparent that SVMs are meant to be a total replacement for ANNs. So what specific advantage(s) does an ANN have over an SVM that might make it applicable for certain situations? I've listed specific advantages of an SVM over an ANN, now I'd like to see a list of ANN advantages (if any).

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
If this question can be reworded to fit the rules in the help center, please edit the question.

11

Unfortunately this will probably be closed or moved soon, but I absolutely love the question. I'd like nothing better than to see a range of thoughtful answers to this one.
–
duffymoJul 24 '12 at 14:01

6

I imagine most of the answers to this question will be speculative or based on evidence, because there are very few theoretical guarantees on the power of these machines. For instance (if I recall correctly), it is unknown whether an n-layer feed-forward neural network is more powerful than a 2-layer network. So how can we say that one is better than the other in principle if we don't even understand the relationships between slight variations of the same model?
–
JeremyKunNov 24 '12 at 20:30

5

It is closed for not being very constructive ... Lol!
–
ErogolMay 1 '14 at 16:37

3

I love that StackOverflow tries to keep the quality of questions and answers high. I hate that StackOverflow enforces this with an ax instead of a scalpel. There's a difference between asking "how do I do HTML stuffz?" and a domain-specific question that would be hard to find an answer to elsewhere. There's a reason this has 140 upvotes -- yet it's considered "not constructive." Questions like this are the epitome of constructive. Certainly far more so than many of the ones I see every day that neatly fall into the Q&A format while being nevertheless useless to almost everyone but the asker.
–
Jules RiesJun 16 at 8:47

This is obviously constructive. I can't understand why it would be closed. It is asking for specific situations where using one algorithm has advantages over using an alternative algorithm. Is that not a reasonable thing to ask?
–
RabJul 17 at 17:59

6 Answers
6

Judging from the examples you provide, I'm assuming that by ANNs, you mean multilayer feed-forward networks (FF nets for short), such as multilayer perceptrons, because those are in direct competition with SVMs.

One specific benefit that these models have over SVMs is that their size is fixed: they are parametric models, while SVMs are non-parametric. That is, in an ANN you have a bunch of hidden layers with sizes h1 through hn depending on the number of features, plus bias parameters, and those make up your model. By contrast, an SVM (at least a kernelized one) consists of a set of support vectors, selected from the training set, with a weight for each. In the worst case, the number of support vectors is exactly the number of training samples (though that mainly occurs with small training sets or in degenerate cases) and in general its model size scales linearly. In natural language processing, SVM classifiers with tens of thousands of support vectors, each having hundreds of thousands of features, is not unheard of.

Also, online training of FF nets is very simple compared to online SVM fitting, and predicting can be quite a bit faster.

EDIT: all of the above pertains to the general case of kernelized SVMs. Linear SVM are a special case in that they are parametric and allow online learning with simple algorithms such as stochastic gradient descent.

Another reason can be found in this paper: yann.lecun.com/exdb/publis/pdf/bengio-lecun-07.pdf. In short, the author states that "deep architectures" can represent "intelligent" behaviour/functions etc. more efficiently than "shallow architectures" like SVMs.
–
alfaJul 25 '12 at 17:23

As an aside, deep learning loses the "advantages" given here for MLPs (fixed size, simpler training) somewhat. I am not sure that these advantages are worth it, though.
–
Muhammad AlkarouriNov 25 '12 at 8:11

3

@MuhammadAlkarouri: deep learning is a pretty broad set of techniques, but the ones that I'm familiar with retain the benefit of the models being parametric (fixed-size).
–
larsmansNov 25 '12 at 13:33

Two comments: the online training point is true, but there is a variant of SVM-like classifiers specifically designed for online learning, called MIRA (a type of passive-aggressive classifier) for which updates are trivial. Secondly, it's worth pointing out that many neural nets can be formulated as SVMs through the kernel trick.
–
Ben AllisonNov 26 '12 at 12:08

@BenAllison: linear SVMs can be trained online trivially, that's true (SGD + hinge loss is even easier than MIRA). But for online training of kernel SVMs, you need specialized algorithms. I was assuming that's what the OP means, since the difference between linear SVMs and multilayer nets is even easier to tell: multilayer nets are universal approximators, while linear SVMs are just that: linear models.
–
larsmansNov 26 '12 at 13:07

One obvious advantage of artificial neural networks over support vector machines is that artificial neural networks may have any number of outputs, while support vector machines have only one. The most direct way to create an n-ary classifier with support vector machines is to create n support vector machines and train each of them one by one. On the other hand, an n-ary classifier with neural networks can be trained in one go. Additionally, the neural network will make more sense because it is one whole, whereas the support vector machines are isolated systems. This is especially useful if the outputs are inter-related.

For example, if the goal was to classify hand-written digits, ten support vector machines would do. Each support vector machine would recognize exactly one digit, and fail to recognize all others. Since each handwritten digit cannot be meant to hold more information than just its class, it makes no sense to try to solve this with an artificial neural network.

However, suppose the goal was to model a person's hormone balance (for several hormones) as a function of easily measured physiological factors such as time since last meal, heart rate, etc ... Since these factors are all inter-related, artificial neural network regression makes more sense than support vector machine regression.

One thing to note is that the two are actually very related. Linear SVMs are equivalent to single-layer NN's (i.e., perceptrons), and multi-layer NNs can be expressed in terms of SVMs. See here for some details.

If you want to use a kernel SVM you have to guess the kernel. However, ANNs are universal approximators with only guessing to be done is the width (approximation accuracy) and height (approximation efficiency. If you design the optimization problem correctly you do not over-fit (please see bibliography for over-fitting). It also depends on the training examples if they scan correctly and uniformly the search space. Width and depth discovery is the subject of integer programming.

Suppose you have bounded functions f(.) and bounded universal approximators on I=[0,1] with range again I=[0,1] for example that are parametrized by a real sequence of compact support U(.,a) with the property that there exists a sequence of sequences with

lim sup { |f(x) - U(x,a(k) ) | : x } =0

and you draw examples and tests (x,y) with a distribution D on IxI.

For a prescribed support, what you do is to find the best a such that

sum { ( y(l) - U(x(l),a) )^{2} | : 1<=l<=N } is minimal

Let this a=aa which is a random variable!, the over-fitting is then

average using D and D^{N} of ( y - U(x,aa) )^{2}

Let me explain why, if you select aa such that the error is minimized, then for a rare set of values you have perfect fit. However, since they are rare the average is never 0. You want to minimize the second although you have a discrete approximation to D. And keep in mind that the support length is free.

We should also consider that the SVM system can be applied directly to non-metric spaces, such as the set of labeled graphs or strings. In fact, the internal kernel function can be generalized properly to virtually any kind of input, provided that the positive definiteness requirement of the kernel is satisfied. On the other hand, to be able to use an ANN on a set of labeled graphs, explicit embedding procedures must be considered.

In my mind, constructing a sensible kernel and constructing a sensible metric embedding are equally problematic. So this is just a comment that there may be more varied kernels than metrics, but I don't really buy that. ohli.de/download/papers/Deza2009.pdf
–
JeremyKunNov 24 '12 at 19:50

One answer I'm missing hear:
Multi-layer perceptron is able to find relation between features. For example it is necessary in computer vision when a raw image is provided to the learning algorithm and now Sophisticated features are calculated.
Essentially the intermediate levels can calculate new unknown features.