a NEUral STANchion

NNets

Many years ago, in a galaxy far, far away, I was summoned by my former team leader, that was clearly preoccupied by a difficult situation. They developed a cool (in every way) project about predicting alarms for refrigerator aisles. It was implemented in 2 tastes, one using a Neural Network, one using a Support Vector Machine.

– “So what?” I asked with my renowned prettiness.

– “We can’t understand which performs best.”

– “Okey… Why did you decide to implement it with a Neural Network AND with a SVM?”

– “Because we couldn’t understand which would perform best.”

Yes. That was the point. You had no idea which one to choose, and, like a little kid which is not able to decide between spaghetti and ice-cream, you went for a dish of tomato spaghetti with Haagen-Dazs. The final result can be surprisingly good (I wouldn’t bet) or a terrible mistake (more likely).

Ice cream spaghetti were ACTUALLY invented in Germany. By Heinrich Himmler, I suppose.

The first point to understand what differentiates NN and SVM is probably to understand what NN and SVM have in common.

From an architectural perspective, I’d say nothing.

To understand what the two algorithms have in common, we need to go to the root of the class of problems they solve.

Mainly they help to solve multi-class classification problems.

Multi-class classification problems can be supervised or unsupervised, but think of them as asking questions like “Who are you? A car, a tree or a cat?“. A typical implementation is the captcha mechanism to exclude you are a robot:

As you may figure out from the captcha example above here (if you weren’t distracted by the cats), a multi-class classification problem is nothing but a multi-leveledbinary classification problem.

Follow me: the previous “who are you?” problem can be represented as a cascade of binary classification problems; in pseudo-code:

or, with vectors, like this:

In other words, the minimal classification problem that might be addressed is a binary classification problem: the most common is a spam/non spam filter. In such a classification problem you could be part of a set (for instance, being spam) or not (for instance, being ham), no other options.

What is the simplest machine learning method used for binary classification problems?

The answer is: Logistic Regression.

Think about it: both SVM and NN’s cost function include the concept of an activator; an activator is nothing but a sigmoid function, also called logistic function: a function that translates continuous values into a discrete interval (usually 0-1, where 0=off, 1=on).

Now prepare yourself for the big Matrix revelation: the simplest versions of SVM (which is a SVM with a linear kernel) and of NN (which is a single layer NN with a single output hypothesis unit/node) ARE LOGISTIC REGRESSION ALGORITHMS.

Let’s go further: non-trivial SVM and NN (I mean: with non linear kernels or with >1 inner layers) could be thought as A PIPELINE OF LOGISTIC REGRESSION COMPUTATIONS.

Still alive?

So we’ve found the connection between the two algorithms. Ah, this revelation implies also that when you have to create a very simple binary classification algorithm in a 2 dimensional space, the choice is not a 2 players game between SVM and NN, but you ought to include also the third wheel: Logistic Regression.

Better: in such a problem, Logistic Regression SHOULD be your first choice, following the old advice to “keep it simpler (but not dumber)“.

Now let’s get back to the core of this article: when and why to select SVM rather than NN (or viceversa).

Basically SVM is an improved nearest neighbor classifier; this means that you should be able to reduce your abstract problem to a problem which you can visualize, and which consists in clumps of points related to the same class; something like the picture we used earlier:

Can you reduce your problem to something like this, with many classes deducible with your own eyes?

If so, this is a good starting point for selecting a SVM.

This is because SVM requires you to select a kernel, and to pick an appropriate kernel you must have an underlying knowledge about the distribution of the classes.

SVM is (as we said) a maximum margin classifier, which means that it does not learn any parameter but it’s pretty good to get rid of training examples which do not help defining the classification boundary: that’s exactly the purpose of the support vectors.

Let’s jump to the other side of the table.

A NN aims similarly to find a separating hyper-plane in a data set to divide observations of 2 classes, but unfortunately has nothing to do with the concept of ‘being supportive‘.

The purpose of all the inner layers of a NN is purely to find this separating surface; after achieving this goal, the NN literally stops learning. It doesn’t matter if this surface is 1 millimeter away from (most of) ‘class 1’ and 1 meter away from (most of) ‘class 2’: as long as the surface separates the classes, a single NN does not care about a ‘classification boundary’ – i.e. more prosaically, to move to the middle of (most of) the 2 classes.

The reason is that Neural Networks are not maximum margin classifiers.

Sometimes an image is better than 1000 words:

So, is a NN indefinitely doomed to be inadequate to generalize?

The answer for a single feedforward NN is yes; but consider that you can train a system of a multitude of NNets.

Returning to the initial example of the spaghetti-ice cream, if you choosed a single feedforward NN, ask yourself why you did not choose a simpler architecture (like a Logistic Regression) for such an (apparently) trivial classification problem.

As I said before, sometimes an image is better than 1000 words:

So, is the difference in terms of accuracy so significant? Or are you just over-complicating things?

I am telling this because training multiple NNets surely avoids the danger of overfitting data and helps to generalize your solution, but it pays the penalty to be way more complicated, and way more onerous in computational times (it’s easy to figure out that training several nets requires more time and resources than ‘training’ a single SVM).

So, SVM may be faster for many data sets, but remember that this speed depends on its monolithicity: NNets have a complex structure that can be increased or downsized, SVM not.

This is very important for another factor: Does your data set fit all into memory in a single representation?

In this case NNets are preferable since it’s easier to adapt their structure to learn from data in chunks.

Ok, maybe you have read this rigmarole and want desperately to ask:

“You bored me to death, very good. Just one question: which one is easier to implement: SVM or NNets?”

Well, from an architectural viewpoint, SVM tends to be easier (especially – as we said above – if you are somehow able to visualize the classes in your data set): basically you just need to select a type of kernel -and the list is not infinite: Gaussian, Linear, Polynomial, String Kernel, Chi-squared Kernel, Histogram Intersection Kernel; the reason is that they have to satisfy the Mercer’s Theorem – and to calibrate few other parameters like the C (cost) and Gamma.

NNets need to be built up from scratch, and this includes all the structural choices like how many (hidden) layers, how many neurons in each layer, what type of neurons in each layer and, finally, the way you connect the neurons.

Continuing the dessert parallelism, SVM is like cannoli: you buy the ready-made cilyndrical wafer, buy the chocolate chips, buy the chopped pistachios, and all you have to do by yourself is the sweetened ricotta filling; then just assemblate all these ready-made ingredients with grace, et voilà:

SVM for everyone!

NNets are more like the French Iles Flottants dessert: at the beginning you just have whipped egg whites, milk, sugar and vanilla. Then it’s all up to you to know how to create all the different creams, until you reach the wished taste and texture and whatever.

This freedom gives more room to the adaptability to different tasks and scenarios, but requires a higher level of awareness about what you are doing at every step.

Another point to consider is that usually the testing and refinement of the hidden (internal) layers of a NN is quite complicated, often resulting in a typical scenario of impasse where there’s something wrong (usually an overfitted model) but you can’t make head nor tail of anything to fix it.

A picture of a typical Stochastic Neural Network

So, to summarize there’s not an absolute answer, as you could expect. Speaking of desserts, remember the ‘No Free Lunch Theorem‘: there isn’t a ‘bestest’ learning algorithm, it depends a lot on the model, which is, after all, a representation of the reality.

In the end, the NNets and SVM can even collaborate for a common task: never heard about Model Ensemble?

For instance, tasks like ImageNet classification don’t start with a hand-designed good representation, and need to learn one. A solution provided in the past used Deep Learning (so NNets models) to come up with a good representation, and then trained SVM on top of resulting features. NNets got rid of unuseful features, SVM got rid of unuseful training examples. Inspiring, don’t you think?