“Stacked Generalization”

Suppose you had a data set $(x_i, y_i)$ where $i=1, \ldots, n$, $x_i \in A$, and $y_i \in R$. For cross validation, you might partition the data set into three subsets, train $k$ classifiers on two of the three subsets, and test the classifiers on the held out data. Then we might select one of the $k$ classifiers by choosing the classifier which did the best on the held out data.

Wolpert generalizes this idea by forming an $n$ by $k$ matrix of predictions from the $k$ classifiers. The $i$th row and the $j$th column would contain the prediction of the $j$th classifier on $x_i$ trained on partitions of the data set that do not include $x_i$. Then Wolpert would train a new classifier using the $i$th row of the matrix as input and trying to match $y_i$ as output. The new classifier $f$ would map $R^k$ into $R$. Let $G_j$ be the result of training the $j$th classifier on all of the data. Then the Wolpert generalized classifier would have the form

$$h(x) = f( G_1(x), G_2(x), \ldots, G_k(x) ).$$

Wolpert actually describes an even more general scheme which could have a large number of layers, much like deep belief networks and auto encoders. The idea of leaving part of the data out is similar to denoising or dropout.