About this Part

It took me a while to write this article because the subject is vast and assume a lot of prior knowledge. What should I explain and what should I skip was kind of a hard line to trace. After a while, I ended up with a large Part 4 which was too long to read. So I decided to split it. Welcome, Part 4, Part 5 and Part 6!

During the last few months, I received a lot of comments and encouragements and several hundred people subscribed to be notified when this part is published. I wish to thank all of you, and I hope you will enjoy reading it.

Where we left.

In Part 3, we discovered that to maximize the margin we need to minimize the norm of .

It means we need to solve the following optimization problem:

Minimize in ()

subject to

(for any )

The first thing to notice about this optimization problem is that it has constraints. They are defined by the line which begins with "subject to". You may think that there is only one constraint, but there is, in fact, constraints. (this is because of the last line "for any"...)

"OK, How do I solve it? I have been waiting for this for one year !!!"

Before tackling such a complicated problem, let us start with a simpler one. We will first look at how to solve an unconstrained optimization problem, more specifically, we will study unconstrained minimization. That is the problem of finding which input makes a function return its minimum. (Note: in the SVM case, we wish to minimize the function computing the norm of , we could call it and write it ).

Unconstrained minimization

Let us consider a point (you should read it "x star", we just add the star so that you know we are talking about a specific variable, and not about any ).

How do we know if is a local minimum of a function ? Well, it is pretty simple, we just need to apply the following theorem:

The hard truth with such a theorem is that although being extremely concise, it is totally impossible to understand without some background information. What is ? What is ? What do we mean by positive definite?

Sometimes, we will be given more informations, and the previous theorem can also be rephrased like this :

Theorem (with more details):

If satisfies:

has a zero gradient at :

and

the Hessian of at is positive definite:

0, \forall \mathbf{z} \in \mathbb R^n" />

where

then is a local minimum.

What does this all mean?

Let us examine this definition step by step.

Step 1:

Let be a continuously twice differentiable function at .

First, we introduce a function which we call , this function takes its values from a set (omega) and returns a real number. There is a first difficulty here because we do not state what is, but we will be able to guess it in the next line. This function should be continuous and twice differentiable, or the rest of the definition will not be true.

Step 2:

is a local minimum of if and only if:

We want to find a value to give to for it to produce its minimum. We simply name this value .

From the notation we can tell two things:

is written in bold, so it is a vector. It means that is a multivariate function.

As a result, the set we saw earlier is the set from which we pick values to give to . It means that the set is a set of vectors and ("x stars belongs to Omega")

Step 3:

has a zero gradient at

This one is the first condition which must hold if we want to be a local minimum of . We must check that the gradient of the function at is equal to zero. What is the gradient? Just think of it as a derivative on steroids.

Definition: "the gradient is a generalization of the usual concept of derivative of a function in one dimension to a function in several dimensions" (Wikipedia)

This definition gives us more pieces of information. A gradient is, in fact, the same thing as a derivative, but for functions like which take vectors as input. That is why we wanted to be a differentiable function in the first place, if it is not the case we cannot compute the gradient, and we are stuck.

In calculus, when we want to study a function, we often study the sign of its derivative. It allows you to determine if the function is increasing or decreasing and to identify minimum and maximum. By setting the derivative to zero, we can find the "critical points" of the function at which it reaches a maximum or a minimum. (You can read this excellent explanation if you want to refresh your memory). When we work with functions having more variable, we need to set each partial derivative to zero.

It turns out, the gradient of a function is a vector containing each of its partial derivatives. By studying the sign of the gradient, we can gather important pieces of information about the function. In this case, checking if the gradient equals zero for allow us to determine if is a critical point (and that the function possibly has a minimum at this point). (Note: Checking if the gradient equals zero at a point means checking that each partial derivative equals zero for this point)

The gradient of a function is denoted by the symbol (nabla).
The line

is just a repetition of " has a zero gradient at " in mathematical notation.

For a vector , means:

Step 4:

the Hessian of at is positive definite

That is where most people get lost. This single sentence requires a lot of backgrounds. You need to know:

that the Hessian is a matrix of second-order partial derivatives

how we can tell if a matrix is positive definite

The Hessian matrix

The Hessian is a matrix, and we give it a name. We could call it but instead we call it which is more explicit. We keep the symbol used for the gradient, and add a to denote we the fact that this time we are talking about second-order partial derivative. Then we specify the name of the function () from which we will compute these derivates. By writing we know that takes a vector as input and that the Hessian is computed for a given .

To sum up, we need to compute a matrix called the Hessian matrix for .
So we take the function , we take the value of and we compute the value for each cell of the matrix using the following formula:

Eventually we get the Hessian matrix and it contains all the numbers we have computed.

Let us look at the definition to see if we understand it well:

Definition: In mathematics, the Hessian matrix or Hessian is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a function of many variables. (Wikipedia)

(Note: A scalar valued function is a function that takes one or more values but returns a single value. In our case is a scalar valued function.)

Positive definite

Now that we have the Hessian matrix, we want to know if it is positive definite at .

Definition: A symmetric matrix is called positive definite if 0" />, for all . (Source)

This time, we note that once again we were given the definition in the first place. It was just a little bit harder to read because of our notational choice. If we replace by and by we get exactly the formula written in the part 2. of the detailed theorem:

0, \forall \mathbf{z} \in \mathbb R^n" />

The problem with this definition is that it is talking about a symmetric matrix. A symmetric matrix is a square matrix this is equal to its transpose.

The Hessian matrix is square, but is it symmetric?

Luckily for us yes!

"if the second derivatives of are all continuous in a neighborhood , then the Hessian of is a symmetric matrix throughout " (Wikipedia)

But even with the definition, we still don't know how to check that the Hessian is positive definite. That is because the formula is for all in .

We can't try this formula for all in !
That is why we will use the following theorem:

(Tip: if you want to check your calculation you can use Wolfram alpha)

Our goal is to find when they are both at zero, so we need to solve the following system of equations:

We distribute to get:

We multiply by to obtain:

We now add and to get:

which simplifies into:

We substitute in

It looks like we have found the point for which . But is this a minimum?

The Hessian matrix is :

Let us now compute the Hessian for

The matrix is symetric, we can check its leading principal minors:

Minors of rang 1:

If we remove the last line and last column the minor is .

Minor of rang 2:

This is the determinant of the Hessian:

All the leading principal minors of the Hessian are positives. It means that the Hessian is positive definite.

The two conditions we needed are met, and we can say that the point is a local minimum.

LOCAL minimum?

A point is called a local minimum when it is the smallest value within a range. More formally:

Given a function defined on a domain , a point is said to be a local minimum if there exists some 0" /> such that within distance of . This is illustrated in the figure below:

A global minimum, however, is true for the whole domain of the function:

Given a function defined on a domain , a point is said to be a global minimum if

So all our hard work was just to find a local minimum, but in real life, we often want to find the global minimum...

How can we find a global minimum?

There is one simple way to find the global minimum:

Find all the local minima

Take the smallest one; it is the global minimum.

Another approach is to study the function we are trying to minimize. If this function is convex, then we are sure its local minimum is a global minimum.

Conclusion

We discovered that finding the minimum of a function is not so simple, and it was not even a global minimum. However, some functions, called convex functions are easier to work with. What is a convex function? Read the Part 5 of this tutorial series to find out! Thanks for reading.

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

Thank you for the attention you pay for explaining every details. It's really helpfull. By the way, I'm not sure but is it normal that f partial derivative relative to x is exactly equal to f(x,y) ?
(2-x)^2+100(y-x^2)^2

Sorry I do not understand why we compute M22 in the given example if "minor of order kk obtained by deleting the last n−kn−k rows and columns." It seemed it just needed to compute M11 when delete the last 1 raw and columb ...

Hi, thanks for the article. It's pretty clear and thorough. Just on confusion I thought in your example of Rosenbrock's banana function, should M11 be 4801 instead of 200 because your try to remove the last line and column. Actually I think this is the same question as lcmosasohi put. Thanks.

I study neural network in korea university student, recently, I'm happy to study, because I read your lecture to SVM. You are best teacher for me. I hope that you lecture for Principal Component Analisis. thank you so much. if you will write to lecture, i must read to your lecture.

thanks a lot for this tutorial.
I have one question regarding Face Recognition with Support Vector Machines. What is in this case the support vector (Xi): one pixel of the picture or the whole picture as a big matrix with all pixels? In the tutorials will often be used an example with 2D dots. Could you please explain how it works? Thanks!