What is this article about?

The main focus of this article is to show you the reasoning allowing us to select the optimal hyperplane.

Here is a quick summary of what we will see:

How can we find the optimal hyperplane ?

How do we calculate the distance between two hyperplanes ?

What is the SVM optimization problem ?

How to find the optimal hyperplane ?

At the end of Part 2 we computed the distance between a point and a hyperplane. We then computed the margin which was equal to .

However, even if it did quite a good job at separating the data it was not the optimal hyperplane.

Figure 1: The margin we calculated in Part 2 is shown as M1

As we saw in Part 1, the optimal hyperplane is the one which maximizes the margin of the training data.

In Figure 1, we can see that the margin , delimited by the two blue lines, is not the biggest margin separating perfectly the data. The biggest margin is the margin shown in Figure 2 below.

Figure 2: The optimal hyperplane is slightly on the left of the one we used in Part 2.

You can also see the optimal hyperplane on Figure 2. It is slightly on the left of our initial hyperplane. How did I find it ? I simply traced a line crossing in its middle.

Right now you should have the feeling that hyperplanes and margins are closely related. And you would be right!

If I have an hyperplane I can compute its margin with respect to some data point. If I have a margin delimited by two hyperplanes (the dark blue lines in Figure 2), I can find a third hyperplane passing right in the middle of the margin.

Finding the biggest margin, is the same thing as finding the optimal hyperplane.

How can we find the biggest margin ?

It is rather simple:

You have a dataset

select two hyperplanes which separate the data with no points between them

maximize their distance (the margin)

The region bounded by the two hyperplanes will be the biggest possible margin.

If it is so simple why does everybody have so much pain understanding SVM ?
It is because as always the simplicity requires some abstraction and mathematical terminology to be well understood.

So we will now go through this recipe step by step:

Step 1: You have a dataset and you want to classify it

Most of the time your data will be composed of vectors .

Each will also be associated with a value indicating if the element belongs to the class (+1) or not (-1).

First, we recognize another notation for the dot product, the article uses instead of .

You might wonder... Where does the comes from ? Is our previous definition incorrect ?

Not quite. Once again it is a question of notation. In our definition the vectors and have three dimensions, while in the Wikipedia definition they have two dimensions:

Given two 3-dimensional vectors and

Given two 2-dimensional vectors and

Now if we add on both side of the equation we got :

For the rest of this article we will use 2-dimensional vectors (as in equation (2)).

Given a hyperplane separating the dataset and satisfying:

We can select two others hyperplanes and which also separate the data and have the following equations :

and

so that is equidistant from and .

However, here the variable is not necessary. So we can set to simplify the problem.

and

Now we want to be sure that they have no points between them.

We won't select any hyperplane, we will only select those who meet the two following constraints:

For each vector either :

or

Understanding the constraints

On the following figures, all red points have the class and all blue points have the class .

So let's look at Figure 4 below and consider the point . It is red so it has the class and we need to verify it does not violate the constraint

When we see that the point is on the hyperplane so and the constraint is respected. The same applies for .

When we see that the point is above the hyperplane so 1\" /> and the constraint is respected. The same applies for , , and .

With an analogous reasoning you should find that the second constraint is respected for the class .

Figure 4: Two hyperplanes satisfying the constraints

On Figure 5, we see another couple of hyperplanes respecting the constraints:

Figure 5: Two hyperplanes also satisfying the constraints

And now we will examine cases where the constraints are not respected:

Figure 6: The right hyperplane does not satisfy the first constraint

Figure 7: The left hyperplane does not satisfy the second constraint

Figure 8: Both constraint are not satisfied

What does it means when a constraint is not respected ? It means that we cannot select these two hyperplanes. You can see that every time the constraints are not satisfied (Figure 6, 7 and 8) there are points between the two hyperplanes.

By defining these constraints, we found a way to reach our initial goal of selecting two hyperplanes without points between them. And it works not only in our examples but also in -dimensions !

Combining both constraints

In mathematics, people like things to be expressed concisely.

Equations (4) and (5) can be combined into a single constraint:

We start with equation (5)

And multiply both sides by (which is always -1 in this equation)

Which means equation (5) can also be written:

In equation (4), as it doesn't change the sign of the inequation.

We combine equations (6) and (7) :

We now have a unique constraint (equation 8) instead of two (equations 4 and 5) , but they are mathematically equivalent. So their effect is the same (there will be no points between the two hyperplanes).

Step 3: Maximize the distance between the two hyperplanes

This is probably be the hardest part of the problem. But don't worry, I will explain everything along the way.

a) What is the distance between our two hyperplanes ?

Before trying to maximize the distance between the two hyperplane, we will first ask ourselves: how do we compute it ?

Let:

be the hyperplane having the equation

be the hyperplane having the equation

be a point in the hyperplane .

We will call the perpendicular distance from to the hyperplane . By definition, is what we are used to call the margin.

As is in , is the distance between hyperplanes and .

We will now try to find the value of.

Figure 9: m is the distance between the two hyperplanes

You might be tempted to think that if we add to we will get another point, and this point will be on the other hyperplane !

But it does not work, because is a scalar, and is a vector and adding a scalar with a vector is not possible. However, we know that adding two vectors is possible, so if we transform into a vector we will be able to do an addition.

We can find the set of all points which are at a distance from . It can be represented as a circle :

Figure 10: All points on the circle are at the distance m from x0

Looking at the picture, the necessity of a vector become clear. With just the length we don't have one crucial information : the direction. (recall from Part 2 that a vector has a magnitude and a direction).

We can't add a scalar to a vector, but we know if we multiply a scalar with a vector we will get another vector.

From our initial statement, we want this vector:

to have a magnitude of

to be perpendicular to the hyperplane

Fortunately, we already know a vector perpendicular to , that is (because )

Figure 11: w is perpendicular to H1

Let's define the unit vector of . As it is a unit vector and it has the same direction as so it is also perpendicular to the hyperplane.

Figure 12: u is also is perpendicular to H1

If we multiply by we get the vector and :

is perpendicular to (because it has the same direction as )

From these properties we can see that is the vector we were looking for.

Figure 13: k is a vector of length m perpendicular to H1

We did it ! We transformed our scalar into a vector which we can use to perform an addition with the vector .

If we start from the point and add we find that the point is in the hyperplane as shown on Figure 14.

Figure 14: z0 is a point on H1

The fact that is in means that

We can replace by because that is how we constructed it.

We can now replace using equation

We now expand equation

The dot product of a vector with itself is the square of its norm so :

As is in then

This is it ! We found a way to compute .

b) How to maximize the distance between our two hyperplanes

We now have a formula to compute the margin:

The only variable we can change in this formula is the norm of .

Let's try to give it different values:

When then

When then

When then

One can easily see that the bigger the norm is, the smaller the margin become.

Maximizing the margin is the same thing as minimizing the norm of

Our goal is to maximize the margin. Among all possible hyperplanes meeting the constraints, we will choose the hyperplane with the smallest because it is the one which will have the biggest margin.

Solving this problem is like solving and equation. Once we have solved it, we will have found the couple () for which is the smallest possible and the constraints we fixed are met. Which means we will have the equation of the optimal hyperplane !

Conclusion

We discovered that finding the optimal hyperplane requires us to solve an optimization problem. Optimization problems are themselves somewhat tricky. And you need more background information to be able to solve them. So we will go step by step. Let us discover unconstrained minimization problems in Part 4! Thanks for reading.

I am passionate about machine learning and Support Vector Machine. I like to explain things simply to share my knowledge with people from around the world. If you wish you can add me to linkedin, I like to connect with my readers.

really nice report! Just one thing, if you have data points in a 2 dimensional space(x1,x2), like in your examples, you should visualize it in 3 dimensions since the hyperplane is spanned in a third dimension(y). If you don't want to mess around with 3 dimensions use a simple 1D example, so your hyperplane can be really visualized with a line.

That was a thing which confused me for quiet a bit. But once I had the right visualization everything else became kind of intuitive.

They are set to 1 and -1 because it is the only possible values the vector y can take. As suggested by snake2971chris in a previous comment, visualizing the picture in 3 dimensions can help to understand this.

This is by definition. The equation is called a normal equation for the hyperplane. I invite you to read the part about hyperplanes in this course (page 9). This is about the calculus of functions of several variables but I used it as a reference several times to write this serie of articles (you can read the full course here).

The hyperplanes are defined in the way that there is vector which is perpendicular to all the vectors in the hyperplane...imagine hyperplane consists of all the vectors which are perpendicular to w.There are good video lectures available which explains hyperplane.

Hey, really great tutorial, waiting for the next part. I had a doubt...Please help!!
Let us consider a two dimensional dataset as shown in your tutorial. My doubt is what is the physical significance of w.I know it is a vector perpendicular to the hyperplane but what does it actually signify

Hi Tejas. Thanks for your comment. I don't have a precise date because I have been very busy at work. As there is a lot of people asking me I will try to do my best to publish it soon (before 2016)
I added a small form at the end of the article so that you can put your email in it if you want me to send you a mail when it will be published.

Thanks for the great tutorial. The expressions wx-b=1 and wx-b=-1 are however a little confusing. You write the following in defence of these equations.

"And we choose +1 and -1 because they are the only values the class can take."

Question: My understanding is that the result of wx-b should not evaluate to a class label. What would evaluate to a class label is an expression such as f(wx-b) where f is the classification rule that assigns the product between a given test vector and the w vector to the classes 1 or -1. Can you clarify why you are having wx-b evaluating to the value of the class labels? Some literature suggests that the numerical values of the class labels are only arbitrary -- in fact -1 and 1 are chosen in SVM algebra because they enable an elegant way to represent the SVM (e.g., we are able to write y(wx-b)>0 because y can only be 1 or -1).

Could you have chosen 1 and -1 because you wanted to have the planes wx-b=-1,1 equidistant from wx-b=0. In that case does it mean that I could have used numbers such as 2 and -2?

Indeed the expression "And we choose +1 and -1 because they are the only values the class can take." was an error. I checked online and this value has no connection with the class label.
You can see equation (2) of this paper for more details.

I updated the article so that it is more explicit how this number is choosen.
I think we could use numbers such as 2 and -2 but I am not sure, I need to check the math to be 100% sure.

Actually, you can choose +2 and -2 instead of +1 and -1.
Then you follow the steps: 10 ~ 19, you will get m = 4 / w'.
means you find a w', which length is half of the original one(w), when you choose +1 and -1.
then the support hyper plane is: w' * x + b' = 2 => 2w * x + 2b = 2 , which is equivalent to the
original one: w * x + b = 1.

so you can choose other number either, eventually, you will get the same support hyper plane.

Really thanks a lot for this series
This is the first time I see such a simplicity of explaining the SVM
I am waiting part 4, I am a bit scared to learn how solve that optimization problem. Please try to explain it as simple as possible.

We could have kept in all the following equations. At the end we would have had in equation 8. Indeed, is the functional margin. You can read part 3 of this course to understand better why we can set it to 1 and -1. You will see in part 4 that he replaces the functionnal margin by 1. I hope this helps you.

what does \delta means? I still didn't understand this paragraph,
"We now note that we have over-parameterized the problem: if we scale w, b and \delta by a constant factor \alpha, the equations for x are still satisfied. To remove this ambiguity we will require that \delta = 1, this sets the scale of the problem, i.e. if we measure distance in millimeters or meters. " Can you explain this?

In step 2 we select two hyperplanes among all possible couple of hyperplanes satisfying the constraints (4) and (5). Given one hyperplane separating the data, we can find two other hyperplanes meeting the constraint. Given another hyperplane separating the data (with a different w than the first one), we can find two other different hyperplanes meeting the constraint. So minimizing w in step 3 allow us to find the two hyperplanes having the biggest margin and the optimal hyperplane.

This is the clearest explanation of a complex topic that I have ever seen. Why can't the professors in classes teach the same way? I have a gut feeling that they don't do it because they don't understand it themselves! I went through your entire tutorial in a couple of hours and undestood the entire conceptual framework of SVM - something I never got from my professors in the entire semester. I am sure it took you many days of hard work to put this together. Thanks for sharing your knowledge and understanding - I really appreciate your service to the community.

I have a question on th dot product of the 2 vectors w.x -b = 1 and -1.
Isnt it be >0 and <0 . Since we are taking the dot product of the normal vector w with another point x which is like dot product of the two vectors
|w||x|.cos(theta) and the values will vary between 1 and -1 and when they are on the hyperplane the values of the dot product will be 0.
Please correct me if I am wrong.

We don't want the point to be too close to the initial hyperplane, that is why we introduce delta. We choose two hyperplanes at a distance delta from the original hyperplane. We then arbitrarely set delta = 1. But that means that points on the other hyperplanes will have values +1 or -1

when they are on the hyperplane the values of the dot product will not be 0, but 1 on one side of plane and it will be >1 as the point goes away from that plane.
Same way with other plane having -1 for point that is exactly on the plane and as it goes out of the plane then it will be < -1.
I guess this can be your answer. If i understood some thing from the Alexandre KOWALCZYK's 3 valuable "Understanding the math" parts.

If there are two definitions of "margin" then there should be two different mathematical representations correspondingly. But as I see it does not matter if "margin" is defined as distance between two support vectors or a separating hyperplane and one support vector, it will have the same formula, namely 2/||w||. But how is it possible?

Hello. The two hyperplanes H0 and H1 are not support vectors, they are hyperplanes. As they separate the data, they could be called separating hyperplanes too. In the case were the central hyperplane is the optimal hyperplane, then vectors which lies on the two others hyperplanes will be called support vectors. In the case were the margin is the distance between a support vector and the hyperplane it will have a length of 1/||w||. It does not matter which one we choose for the margin because at the end we optimize the norm of w and the same argument works if we use 2/||w|| or 1/||w|| (minimizing the norm is like maximizing the margin). If you want to understand more about the margin you can take a look at this lecture of Andrew Ng which introduces the functional and geometric margin. I hope it helps you.

Hello optimus. This is a two step process: you find any hyperplane which separate data, and you are able to make predictions on your training set. However you have no guarantee that it will perform well in the test set (that it will "generalize" well). With SVMs you find the "optimal" hyperplane, one which separates the data, but which is also as far away as possible from it. As a result it generalize better and you can expect that the prediction you made on unseen data will be better.

I really liked your post and it gave me a keen interest in SVm, although i do have some open points which I do not understand after reading this 3rd part. I followed till combining of the constraints in one equation(till eq 8). But I have some open points regarding the initial selection of any hyperplane like how we start initially with first hyperplane which will classify the dataset(I know we can do it through single perceptron),but the thing is here in this post when we maximizing the distance in those steps I really gone confused.

thanks for this wonderful post.
It will be great if you can take one very simple 2 dimensional example to explain what we did in part 3 which explain all steps we covered here , it will be very helpful for visualization ,like what is going on.

Let me first thank you for an absolutely outstanding explanation of SVM! I have scoured the internet upside down, ranging from lectures from MIT and Caltech profs, none of the explanations I have found can match yours in terms of simplicity and effectiveness! Please upload part 4 ASAP! I am unable to have peace of mind until I see it working end-to-end!

Hello. The 4th part is not ready yet but contains already some content, I want to put enough content in it so that people are not frustrated and we the article advance the understanding of SVM. I have some time to work on it this week because I am on holidays s but most of the time I have very little time to write during the week, hence the big delay.

Hello Alexandre,
I have a question about a special case of SVM. I tried to find the support vectors for these training examples:
x1 = (0,1) y1 = +1
x2 = (1,0) y1 = +1
x3 = (-1,0) y1 = -1
x4 = (0,-1) y1 = -1
Clearly, all examples are support vectors because x1 and x2 are over the hyperplane H+ and x3 and x4 are over the hyperplane H-. However, when the LD is maximized subject to sum(yi.alphai = 0) and alphai >= 0, the obtained results are (I have used the Wolfram Alpha online optimizer):
alpha1 = alpha3
alpha2 = 1 - alpha3
alpha4 = 1 - alpha3
(notice that alphai is the Lagrage Multiplier of xi)
The solution that has sense to me is alpha1 = alpha3 = 1, but then alpha2 = alpha4 = 0, indicating that x2 and x4 are not support vectors, and this is illogical. Even more with alpha1 = alpha3 = 1 and alpha2 = alpha4 = 0, the w vector and b obtained return these results:
(w . x1) + b = +1 (x1 is over the H+, since x1 is a support vector)
(w . x2) + b = +1 (x1 is over the H+, but it is not a support vector because its Lagrange Multiplier is zero)
(w . x3) + b = -1 (x1 is over the H-, since x1 is a support vector)
(w . x4) + b = -1 (x1 is over the H-, but it is not a support vector because its Lagrange Multiplier is zero)
I have also used the svmtrain of Matlab and the support vectors obtained are the same (x1 and x3).
One last observation, if I add one more training example, [x5 = (2, 2) y5 = +1], only alpha5 is zero, this indicate that x1, x2, x3 and x4 are support vectors.
Do you have any suggestion to clarify this situation?
Thank you in advance.

I performed the same experiment as you with a quadratic programming library and I have found 4 support vectors. This is because I was solving the problem without regularization (the hard-margin). However most libraries give you an implementation of soft-margin SVM with regularization where the hyperparameter C allows you to tune how much you want to regularize. By default they use a value of C which perform some regularization. I made a reproducible example in Python. If you change the value of C to be greater than one, you will have 2 support vectors. However as soon as you use a value less than 1 you have 4 support vectors. I hope it helps you.

Hi, thanks for writing this up. Could you please double-check that w is indeed perpendicular to H1? If you take any point on H1, this point is represented by a vector drawn from the origin to that point. This vector is obviously not perpendicular to w. I'd appreciate it if you could explain why I'm wrong 🙂

The two things that matter about w are its direction and its magnitude. I can draw it where I want and I will be the same vector (even if it does not have the same starting coordinates). I explain this in Part 2 when talking about the difference between two vectors. I hope it helps you 🙂

Hi!
In equation 9 we have K = mu = m ( w / ||w|| );
I think in defining the vector "k" we need something like b in wx+b=0
ti indicate the location of vector in the space.
just the magnitude and the norm shows where is the starting point of k?????
Please help me

Hello. No, you do not need to indicate the location of the vector in the space. This is often confusing, I explain this in Part 2 when talking about the difference between two vectors. Reread this section, it might help you understand.

Dear Sir,
How we calculate w and b?. are they different for every feature and at the end we need to optimize them?..how we decide the the value of w and b for optimal hyperplane equation...if for each feature we are calculating w and b then what value of w and b we will choose?. and why we take unit value for w always even the data are not always at the same distance. I am not getting the base of w and b value. could you please explain it earliest. Thank you very much for your time and article.

Hello. There is only one vector w, and it will contain several values. If X is an (m,n) matrix, then a single example is a (1,n) vector and w is a (n,1) vector. So one vector w contains the values for all examples. There is a single value of b. They will be chosen by solving an optimization problem (you can read more on that on the later articles).

I'm learning a project about human detection which using SVM algorithm to classify non-human or human. Can you tell me the relationship between the variables w, b, xi with the values from the input images? I haven't still imagined yet? TKS so much!

How to compute the weight vector w and bias b in linear SVM. We have a hyperplane equation and the positive and negative feature. In equation Wx+b= 0, what does it mean by weight vector and how to compute it??
Can anybody explain it please

Hello. I agree with the many comments praising your post. Thank you so much for it!

There's one thing I'm still confused about: How do we find a hyperplane that seperates the set in the first place? In step 2 it says: "Given a hyperplane H0 separating the dataset and satisfying[...]"
So how do we get H0? Did I somehow miss that?

Hello, Matthias. Thank you for your comment. The hyperplane H0 is introduced in order to derive the optimization problem. I could start an argument by "Given a point x on the horizontal axis", and it does not matter how you pick it, or where it is, what matters is that it is a point, and it is on the horizontal axis. The same apply in this context. But to answer your question, if you have a linearly separable dataset, you can find a separating hyperplane by using an algorithm like the Perceptron Learning Algorithm for instance, or of course the SMO algorithm for the SVM problem.

It's all about dot product. Normal vector w is perpendicular to the hyperplane, and so the margin vector k is. Given arbitrary vector w and k, what we want is the result of this dot product must be 1 (or any positif number, we use δ in this post). Since these two vectors is in parallel, the dot product is only based on its length.
How do you choose w and m which the dot product is 1 while we want to maximize margin? Yap, minimize the length of the normal vector w!

Thanks a lot for the detailed explanation. That's the greatest article on SVM I have ever read before. Here I have one question. In Figure 4, 5, 6, 7, 8, there are hyperplanes written by wx - b = 1 and wx - b = -1, but we are talking about wx + b = 1 and wx + b = -1, right? So, I think that wx + b = 1 and wx + b = -1 should be also written in those figures. What do you think?

Yes indeed it would be more coherent to use wx+b=1 in the figure as well. Even if the only difference between an hyperplane with wx-b=1 and wx+b=1 is the sign of b. Thanks for pointing that out. It is in fact often confusing in the literature as well as some author use the wx+b convention and others use wx-b (Platt for example).

I am not sure to understand your question perfectly. But the vector w here is the vector normal to the hyperplane. In Part 2 we defined a vector w which was the direction of a vector u. But they are not the same vectors.

You wrote in a previous tutorial: "Given a particular hyperplane, we can compute the distance between the hyperplane and the closest data point. Once we have this value, if we double it we will get what is called the margin"
However with the constraints above(in this tutorial) there isn't guarantee that a data point (of the training set) was on the boundary of hyperplane H0 or H1. In this case the given definition of margin seems wrong. Am I wrong?

Another question is why δ=1? It should be correlation between the margin and δ. If you leave δ you get (if I'm not wrong):
margin = (2*δ)/||w||
And the optimization problem should ivolve the parameter δ too.
Why can simplification be done (δ=1)?

1) As the hyperplane will be at equal distance between the two nearest data point finding the distance from one point than multiplying it by 2 works.
However this is not very rigorous, that is why we see the "true" method to compute the margin in this article.
2) To better understand delta, you can read this paper. (Equation 2)

Great work. I am a Deep Learning researcher. I personally have never seen such simple and effective explanation for SVMs in any of the books (or across the web). This reminded me of the quote by Einstein...."If you cannot explain it simply, you probably have not understood it"....Cheers.

Hi thank's very much for your tutorial. Very informative. (As usual when I got into new intellectual topics I often miss understand basics). Regarding SVM I wounder how will I do to find the first 3 support vector. One guess I have is this (correct me if I am wrong).
1. Create a first version of test "nr1 hyperplane" by Random select an origo and a random angle for the (bolded) 'w' vector (hyperplane)
2. Go through All vectors and if some vector is constrained by this test "nr1 hyperplane" then throe this vector. Maybee se how bad this constrained point and use this data to produce a new test "nr2 hyperplane".
3. When test "nr? hyperplane" satisfy ALL vectors then Start to find the 3 support vectors.
4. Now again go through the margin calculation method you describe for ALL the vectors and when some vector have smallest margin save this until you have find for example 'A' 'B' and 'H' in your example above.

I guess there must be a way to find this 3 support vector as fast as possible.
Very best Regards Olle Welin
I feel so slow but I am curious about machine learning how the basics math "nuts and bolts" works. (play with write my own C code on Raspberry pi)

Great work Alexandre!Sorry m completely new to this so might be a novice question but am not able to visualise how z0 which is addition of x0 and k .... is in hyperplane H1 ?
"If we start from the point x0 and add k we find that the point z0=x0+k is in the hyperplane H1 ?"

Yes, with all the constraints, I will be able to know whether the hyperplanes are the ones that correctly separate the data. But how do I get the hyperplanes in the first place? Do I take a random guess? If after applying the constraints I found they are not the ones I am looking for, how do I improve myself?

That is the purpose of the algorithm SMO. Right now this article is building the foundation used to define the problem we want to solve, not explaining how to solve it. (I will explain all of that in my upcoming ebook)

First of all I'd like to thank you for this awesome tutorial. There is however something I do not understand. In the part "Understanding the constraints". Why is it that for Xi = A, the point on the hyperplane is =1 and for Xi = C the point on the hyperplane is >1 ?

I have been circling SVM for over a year now, and while i understood the underlying principle, i never truly got how it worked. Most resources provide either a general, high-level explanation, or heavy mathematics that can be difficult to understand on your own.

Your post, on the other hand, is clear, detailed and explicit. I think you have invested a great deal of attention and time into this project, and i want you to know it is truly helpful and of great quality.

I've read the first 3 parts to this tutorial (this page is part 3). In the example you're working through, you chose your first hyperplane to be x2 = -2 * x1 because this is a very basic example that you could visualize. How do you select a hyperplane that best separates multi-dimensional data that can't be visualized by humans?

Basically we can say that the first hyperplane is defined by a vector w starting at 0 or with random values, then an algorithm updates it by tiny increment until it find the optimal hyperplane (you can read the full detail of this in the chapter about the SMO algorithm in my ebook).

Hi,
We have two parameters w and b in w⋅x+b=−1, w⋅x+b=1 equations. w is corresponding for rotation, b is corresponding for offset. I didn't get how we can control distance between two hyperplanes if we have shared b parameter and constant values on the right hand side (-1, 1). We only can rotate and shift hyperplanes by varying w and b. But we cannot change distance between planes.