Author: Korgun Dmitry, @tbb

Tutorial

"Something else about ensemble learning"

The goal behind ensemble methods is to combine different classifiers into a meta-classifier that has a better generalization performance than each individual classifier alone. For example, assuming that we collected prediction from 10 different kaggle-kernels, ensemble method would allow us to combine these predictions to come up with a prediction that is more accurate and robust than the prediction by one each kernel. There are several ways to create an ensemble of classifiers each aimed for own purpose:

Bagging - decrease the variance

Boosting - decrease the bias

Stacking - improve the predictive force

What is "bagging" and "boosting" you already know from lectures, but let me remind you main ideas.

Bagging - generate additional data for training from the original dataset using combinations with repetitions to produce multisets of the same size as the original dataset. By increasing the size of training set you can't improve the model predictive force, but just decrease the variance, narrowly tuning the prediction to the expected outcome.

Boosting - two-step approach, where first uses subsets of the original data to produce a series of averagely performing models and then "boosts" their performance by combining them together using a particular cost function (e.g. majority vote). Unlike bagging, in the classical boosting the subset creation is not random and depends upon the performance of the previous models: every new subset contains the elements that were misclassified by the previous model.

Stacking (Blending) - is similar to boosting: you also apply several models to your original data. The difference here is that you don't have an empirical formula for your weight function, rather you introduce a meta-level and use another model/approach to estimate the input together with outputs of every model to estimate the weights, in other words, to determine what models perform well and what badly given these input data.

Before we start, I guess, we should see a graph that demonstrates the relationship between the ensemble and individual classifier error. In other words, this graph visualizes the Condorcet’s jury theorem.

# and make a small helper function to plot classifiers decision areadefplot_clf_area(classifiers,labels,X,s_row=2,s_col=2,scaling=True,colors=None,markers=None):ifnotcolors:colors=['green','red','blue']ifnotmarkers:markers=['^','o','x']ifscaling:sc=StandardScaler()X_std=sc.fit_transform(X)# find plot boundariesx_min=X_std[:,0].min()-1x_max=X_std[:,0].max()+1y_min=X_std[:,1].min()-1y_max=X_std[:,1].max()+1xx,yy=np.meshgrid(np.arange(x_min,x_max,.1),np.arange(y_min,y_max,.1))f,axarr=plt.subplots(nrows=s_row,ncols=s_col,sharex='col',sharey='row',figsize=(12,8))foridx,clf,ttinzip(product(range(s_row),range(s_col)),classifiers,labels):clf.fit(X_std,y_train)Z=clf.predict(np.c_[xx.ravel(),yy.ravel()])Z=Z.reshape(xx.shape)axarr[idx[0],idx[1]].contourf(xx,yy,Z,alpha=.3)forlabel,color,markerinzip(np.unique(y_train),colors,markers):axarr[idx[0],idx[1]].scatter(X_std[y_train==label,0],X_std[y_train==label,1],c=color,marker=marker,s=50)axarr[idx[0],idx[1]].set_title(tt)

As we can see, the perfomance of the MajorityVotingClassifier gas substabtially improved over the individual classifiers in the 10-fold cross-validation evaluation. Note that the decicion regions of the ensemble classifier seem to be a hybrid of the decision regions from the individual classifiers.

The majority vote approach similar to stacking. However, the stacking algorithm used in combination with a model that predicts the final class label using the predictions of the individual classifiers in the ensemble as input.

The basic idea behind stacked generalization is to use a pool of base classifiers, then using another classifier, that called meta-classifier, to combine their predictions, with the aim of reducing the generalization error.

Let’s say you want to do 2-fold stacking:

Split the train set in 2 parts: train_a and train_b

Fit a first-stage model on train_a and create predictions for train_b

Fit the same model on train_b and create predictions for train_a

Finally fit the model on the entire train set and create predictions for the test set.

Now train a second-stage stacker model on the probabilities from the first-stage model(s).

We will use only meta features and 1-block validation. You can easily add the necessary functionality if you need.
Let implement Stacking based on the MajorityVoteClassifier class.

Blending is a word introduced by the Netflix winners. It is very close to stacked generalization, but a bit simpler and less risk of an information leak. Some researchers use “stacked ensembling” and “blending” interchangeably.

With blending, instead of creating out-of-fold predictions for the train set, you create a small holdout set of say 10% of the train set. The stacker model then trains on this holdout set only.

Blending has a few benefits:

It is simpler than stacking.

It wards against an information leak: The generalizers and stackers use different data.

Ensemble methods combine different classification models to cancel out their individual weakness, which often results in stable and well-performing models that are very attractive for machine learning competitions and sometimes for industrial applications too.