I am new to machine learning and would like to know if it makes sense to fix the number of estimators and the maximal depth of a random forest with cross validation ?

My intuition would be that yes, as cross validation enables one to determine models' hyper parameters.
Nevertheless, I think I heard a professor saying that it is illogical to estimate random forests hyperparameters with such procedures...
Maybe because the number of estimators would be chosen based on the training data, and then there would be a kind of overfitting as the training and testing datasets are not of the same size, and for instance, having a bigger set size might permit one to consider deeper trees, or maybe averaging on more trees would be necessary to reduce the variance...

What are the other possible ways to determine Random Forests hyper-parameters to use ?

More generally, what are the main model selection procedures in machine learning ?

$\begingroup$For RF, default hyper parameters are very often a quite fine choice. A proper grid search would include two loops of cross-validation, a inner grid search and a outer validation loop. You may use the inner OOB-CV for grid search and a 10-fold CV for validation.$\endgroup$
– Soren Havelund WellingJan 7 '16 at 14:17

2 Answers
2

Random forests have the reputation of being relatively easy to tune. This is because they only have a few hyperparameters, and aren't overly sensitive to the particular values they take. Tuning the hyperparameters can often increase generalization performance somewhat.

Tree size can be controlled in different ways depending on the implementation, including the maximum depth, maximum number of nodes, and minimum number of points per leaf node. Larger trees can fit more complex functions, but also increase the ability to overfit. Some implementations don't impose any restrictions by default, and grow trees fully. Tuning tree size can improve performance by balancing between over- and underfitting.

Number of features to consider per split. Each time a node is split, a random subset of features is considered, and the best is selected to perform the split. Considering more features increases the chance of finding a better split. But, it also increases the correlation between trees, increasing the variance of the overall model. Recommended default values are the square root of the total number of features for classification problems, and 1/3 the total number for regression problems. As with tree size, it may be possible to increase performance by tuning.

Number of trees. Increasing the number of trees in the forest decreases the variance of the overall model, and doesn't contribute to overfitting. From the standpoint of generalization performance, using more trees is therefore better. But, there are diminishing returns, and adding trees increases the computational burden. Therefore, it's best to fit some large number of trees while remaining within the computational budget. Several hundred is typically a good choice, but it may depend on the problem. Tuning isn't really needed. But, it's possible to monitor generalization performance while sequentially adding new trees to the model, then stop when performance plateaus.

Choosing hyperparameters

Tuning random forest hyperparameters uses the same general procedure as other models: Explore possible hyperparameter values using some search algorithm. For each set of hyperparameter values, train the model and estimate its generalization performance. Choose the hyperparameters that optimize this estimate. Finally, estimate the generalization performance of the final, tuned model on an independent data set.

For many models, this procedure often involves splitting the data into training, validation, and test sets, using holdout or nested cross validation. However, random forests have a unique, convenient property: bootstrapping is used to fit the individual trees, which readily yields the out-of-bag (OOB) error. This is an unbiased estimate of the error on future data, and can therefore take the place of the validation or test set error. This leaves more data available for training, and is computationally cheaper than nested cross validation. See this post for more information.

Grid search is probably the most popular search algorithm for hyperparameter optimization. Random search can be faster in some situations. I mention more about this (and some other hyperparameter optimization issues) here. Fancier algorithms (e.g. Bayesian optimization) are also possible.