Abstract

Previous articles (van der Laan and Dudoit (2003); van der Laan et al. (2006); Sinisi et al. (2007)) advertised and theoretically validated the use of cross-validation to select among many candidate estimators to compute a so called super learner which outperforms any of the given candidate estimators. The theoretical basis was provided for this super learner based on oracle results for the cross-validation selector (e.g., van der Laan and Dudoit (2003); van der Laan et al. (2006)) and in Sinisi et al. (2007). In addition, these papers contained a practical demonstration of the adaptivity of this so called super learner in the context of prediction of the fitness of the HIV virus as a function of its mutations. This article proposes a fast algorithm for constructing a super learner in prediction which uses V-fold cross-validation to select a functional form of an initial set of candidate predictors according to a parametric or semi-parametric model, or possibly, data adaptively. The paper contains a proof that the resulting super learner performs asymptotically as well as the oracle selector among the continuum of estimators defined by the (semi-)parametric functional forms of the initial set of candidate estimators.

This approach also yields a new class of cross-validation methods to select among a family of candidate estimators by formulating the minimization of the cross-validated risk over the family of candidate estimators as a new least squares regression problem which itself can be carried out with any type of parametric or nonparametric regression methodology (e.g. using cross-validation itself), thereby preventing over-fitting of the cross-validated risk. Simulations and data analysis suggest this new proposed super learner superior to competing methods. This approach for construction of a super learner generalizes to any parameter which can be defined as a minimizer of a loss function.