2 Answers
2

You can estimate the generalization performance of a given tuple of hyperparameters via cross-validation.

The traditional way to find suitable values for these parameters is using grid search, that is test a predefined set of hyperparameter tuples and select the best one. Another common way is to use expert knowledge and somehow optimize it manually, which sometimes works but is certainly not reproducible.

Both of these standard approaches are quite poorly suited. Grid search and manual search become infeasible when the number of hyperparameters grows. It is far better to use true optimization methods. Recently random search was proposed as a good baseline, but this search method does not focus on good regions. Another commonly used approach is the Nelder-Mead simplex, which I strongly advise against, because it cannot cope with the stochastic nature of hyperparameter search and is therefore prone to getting stuck in local minima.

I wrote a brief article describing the main challenges of hyperparameter optimization. The best existing methods are all forms of black-box optimization approaches. This is currently heavily researched in machine learning. The current trend leans towards Bayesian optimization methods. A few good software libraries are available that can make tuning easy for you. I recommend Optunity which I developed (paper) and Hyperopt, because these two are the easiest to use and can tackle most problems easily.

The figure below illustrates that the response surface in hyperparameter optimization has many local optima (and hence is a poor fit for Nelder-Mead). This figure shows cross-validation performance (higher is better) for an SVM with RBF kernel (with hyperparameters $C$ and $\gamma$) based on the trace of Optunity's particle swarm optimization.

$\begingroup$Thank you very much for the answer, I will dig into this problem, and then I will come back to tell you if it was a good idea$\endgroup$
– yonutixApr 15 '15 at 12:59

$\begingroup$I've found Nelder-Mead works very well for the LS-SVM, which suggests the noisiness might be due to support vectors entering and leaving the kernel expansion. I've used Nelder-Mead and gradient descent with my old SVM toolbox, but not for a fair while (I like the LS-SVM rather more for most problems).$\endgroup$
– Dikran MarsupialApr 16 '15 at 17:37

1

$\begingroup$@DikranMarsupial you are probably right. I think the noisiness may indeed be due to changes in the set of support vectors. That's a pretty cool observation!$\endgroup$
– Marc ClaesenApr 16 '15 at 20:14

In scenarios with just 1 or even two hyperparameters, it's conventional practice to perform CV over a grid of options and pick the one with the lowest (or within 1 s.e. of the lowest) value. Grid search is really easy to implement, but it's a process of trial-and-error: you have to know what "box" in the hyperparameter space you want to search, or else you'll have the minimum on the boundary of the box, implying that there may be a better-performing model outside the box. And this can be really slow if you use a very fine grid, or if building a model takes a lot of time, since you'll spend lots of time exploring dead-ends without ever being able to speed that up.

Some other options include wrapping the hyperparameter tuning process inside of an unconstrained optimization routine where the objective function is out-of-sample performance. As long as the optimization step takes less time than a grid search, you'll get results more quickly. Nelder-Mead works well here.

$\begingroup$+1, but Nelder-Mead is a poor choice for hyperparameter optimization because it has no way of dealing with the stochastic nature of the problem. As a result, Nelder-Mead gets stuck in local optima almost instantly. Bayesian and particle swarm optimization is far more appropriate.$\endgroup$
– Marc ClaesenApr 15 '15 at 12:34

$\begingroup$@MarcClaesen I'm not sure what you mean by stochastic nature - for the same training and test data, I would expect the surface to be smooth and not at all stochastic. I'd be very interested in learning more about both of those methods, since I don't know very much about them at all, and certainly not enough to write an answer! I'd instantly upvote an answer with references!$\endgroup$
– SycoraxApr 15 '15 at 12:37

$\begingroup$The optimization surfaces are not smooth at all. You get a random component due to finite samples in combination with cross-validation. If you plot the optimization surface when tuning SVM with RBF kernel it is far from smooth, even for easy data sets. I will see if I have a figure somewhere close to illustrate this.$\endgroup$
– Marc ClaesenApr 15 '15 at 12:49

$\begingroup$Thank you very much for the answer, I will dig into this problem, and then I will come back to tell you if it was a good idea$\endgroup$
– yonutixApr 15 '15 at 12:59

1

$\begingroup$@DikranMarsupial that may be true for LS-SVM, I have little experience with those. For SVM, however, it is not smooth even when you reuse the same folds during the entire optimization process. The figure I've included in my own answer uses the same folds for all evaluations while optimizing SVM hyperparameters and you can see it's not smooth. Additionally, I reckon it's better not to reuse folds as an additional measure to prevent overfitting on the partitioning of folds (though I usually don't bother to regenerate folds every time).$\endgroup$
– Marc ClaesenApr 16 '15 at 20:13