What I was thinking is that it's hard to use gradient descent when $b$ which is part of the constraints is unknown (it can actually be calculated as a function of $w$). But maybe I'm missing something.

2 Answers
2

You've identiifed the key problem. Certainly the primal can be solved directly by, say, a quadratic programming solver. But typical QP solvers often don't scale well to large problem sizes. A projected gradient method can often scale to significantly larger problems---but only if the derivatives and projections are simple to compute. As I will show, the dual problem can be solved with inexpensive projected gradient iterations, while the primal cannot be.

First, let's simplify notation a bit: collect the vectors $y_ix_i$ into the rows of a matrix $Z$, and the values $y_i$ into the elements of a vector $y$. Then we can write the problem as
$$\begin{array}{ll} \text{minimize}_{w,b} & f(w,b) \triangleq \tfrac{1}{2}w^T w \\ \text{subject to} & Z w + y b \succeq \vec{1} \end{array}$$

A projected gradient algorithm will alternate between gradient steps and projections. The gradient is simply $\nabla f(w,b)=(w,0)$, so this is a pretty trivial operation. Let's denote by $(w_+,b_+)$ the result of a single gradient step. We must then project $(w_+,b_+)$ back onto the feasible set: find the nearest point $(w',b')$ that satisfies $Xw+y \succeq \vec{1}$. This means we must solve
$$\begin{array}{ll}
\text{minimize}_{w,b} & \|w-w_+\|_2^2+(b_+-b)^2 \\
\text{subject to} & Zw+by\succeq \vec{1} \\
\end{array}$$
This is virtually the same problem as the original. In other words, each step of projected gradients for the primal problem is as expensive as the original problem itself.

Now examine the dual problem. To get a handle on this we need to simplify the dual function
$$g(\lambda) = \min_{w,b} \tfrac{1}{2} w^T w - \lambda^T ( Z w + b y - \vec{1} )$$
With a little calculus you can determine that the optimal value of $w=Z^T\lambda$. As for $b$, we observe that if $y^T\lambda\neq 0$, the right-hand expression can be driven to $-\infty$ by growing $b\cdot(y^T\lambda)\rightarrow +\infty$. So
$$g(\lambda) = \begin{cases} - \vec{1}^T \lambda - \tfrac{1}{2} \lambda^T ZZ^T \lambda & y^T \lambda = 0 \\ -\infty & y^T \lambda \neq 0 \end{cases}$$
So the effect of $b$ is to introduce an implicit constraint $y^T\lambda =0$, and the dual problem is equivalent to $$\begin{array}{ll} \text{maximize} & \bar{g}(\lambda) \triangleq \vec{1}^T \lambda - \tfrac{1}{2} \lambda^T ZZ^T \lambda \\ \text{subject to} & y^T \lambda = 0 \\ & \lambda \succeq 0 \end{array}$$
We would not have been able to apply projected gradients to the original dual function, because it wasn't differentiable; but removing the implicit constraint $y^T\lambda=0$ changes that. Now the gradient is $\nabla\bar{g}(\lambda) = \vec{1} - ZZ^T\lambda$---a bit more complex than the primal gradient, but entirely manageable. (Note that we're maximizing now, so gradient steps are taken in the positive direction).

So what's left is to perform the projection. Given the point $\lambda_+$ that comes out of the gradient step, we need to solve
$$\begin{array}{ll} \text{minimize} & \tfrac{1}{2} \|\lambda-\lambda_+\|_2^2 \\ \text{subject to} & y^T \lambda = 0 \\ & \lambda \geq 0 \end{array}$$This looks much easier than the primal projection, doesn't it? I haven't looked at the literature to see how people solve it now, but this is what I came up with: the value of $\lambda$ is
$$\lambda_i = \max\{\lambda_{+,i}+sy,0\}, ~i=1,2,\dots, n$$
where the parameter $s$ is chosen so that $y^T\lambda = 0$. I suspect there is a simple iterative method: start with $s=0$, compute $\lambda$ and $y^T\lambda$, adjust $s$, and repeat. A few $O(n)$ computations get the result. Don't trust me on this, I suspect the extant SVM literature has something solid here.

An important final step is to recover $(w,b)$ from the optimal solution $\lambda^*$. Our derivations for $g(\lambda)$ showed that $w=Z^T\lambda^*$. Once this has been recovered, $b$ is any value in the following interval:
$$b \in \left[ 1 - \max_{i:y_i=1} w\cdot x_i, \max_{i:y_i=-1} w \cdot x_i - 1 \right] $$

I would guess that the main reason for solving the max margin task in the dual is that the dual formulation permits the use of kernel functions. And therefore it is easy to transfer the input points implicitly to some other feature space where the two classes are separable. In your above formulated task (hard margin) there exits only a solution if the two classes where your points come from are linearly separable. If the convex hulls of the points of the two classes intersect then your above task is infeasible. There are two ways to fix this situation: 1) Switch to soft margins by introducing slack variables or 2) map your input points to some other feature space where they are linearly separable and perform your classification there. The second approach can be implicitly done via appropriate kernels e.g. the Gaussian kernel. This is because in the dual task the input points only appear in inner products.
This can be seen in a more convenient form of the dual
$\max_{\alpha} \left( \sum_i \alpha_i - \frac{1}{2} \sum_{i,j} y_i y_j\alpha_i\alpha_j\langle x_i,x_j\rangle\right)$ s.t. $\sum_i \alpha_i y_i = 0$ $\qquad \alpha_i\geq 0\quad \forall i$

For a more comprehensive/detailed explanation I recommend the book "An introduction to Support Vector Machines and other kernel-based learning methods" by Nello Cristianini and John Shawe-Taylor.