Data Mining Algorithms In R/Classification/adaboost

Boosting is one of the most important developments in classification methodology. Boosting works by sequentially applying a classification algorithm to reweighted versions of the training data and then taking a weighted majority vote of the sequence of classifiers thus produced. For many classification algorithms, this simple strategy results in dramatic improvements in performance. This seemingly mysterious phenomenon can be understood in terms of well-known statistical principles, namely additive modeling and maximum likelihood. For the two-class problem, boosting can be viewed as an approximation to additive modeling on the logistic scale using maximum Bernoulli likelihood as a criterion.

Contents

While boosting has evolved somewhat over the years, we describe the most commonly used version of the AdaBoost procedure (Freund and Schapire - 1996) which we call Discrete AdaBoost. This is essentially the same as AdaBoost.M1 for binary data in Freund and Schapire. Here is a concise description of AdaBoost in the two-class classification setting.We have training data (x1,y1),...,(xn,yn){\displaystyle (x_{1},y_{1}),...,(x_{n},y_{n})} with xi{\displaystyle x_{i}} a vector valued feature and yi=−1{\displaystyle y_{i}=-1} or 1. We define F(x)=∑1Mcmfm{\displaystyle F(x)=\sum _{1}^{M}c_{m}f_{m}} where each fm(x){\displaystyle f_{m}(x)} is a classifier producing values plus or minus 1 and cm{\displaystyle c_{m}} are constants; the corresponding prediction is sign (F(x)){\displaystyle (F(x))}. The AdaBoost trains the classifiers f_m(x) on weighted versions of the training sample, giving higher weight to cases that are currently misclassified. This is done for a sequence of weighted samples, and then the final classifier is defined to be a linear combination of the classifiers from each stage.

x: matrix of descriptors.
Y: vector of responses. ‘y’ may have only two unique values.
test.x: testing matrix of discriptors (optional)
test.y: vector of testing responses (optional)
loss: loss="exponential", "ada","e" or any variation corresponds to the default boosting
under exponential loss. loss="logistic","l2","l" provides boosting under logistic
loss.
type: type of boosting algorithm to perform. “discrete” performs discrete Boosting
(default). “real” performs Real Boost. “gentle” performs Gentle Boost.
Iter: number of boosting iterations to perform. Default = 50.
Nu: shrinkage parameter for boosting, default taken as 1.
bag.frac: sampling fraction for samples taken out-of-bag. This allows one to use random
permutation which improves performance.
model.coef: flag to use stageweights in boosting. If FALSE then the procedure corresponds
to epsilon-boosting.
bag.shift: flag to determine whether the stageweights should go to one as nu goes to zero.
This only makes since if bag.frac is small. The rationale behind this parameter
is discussed in (Culp et al., 2006).
max.iter: number of iterations to perform in the newton step to determine the coeficient.
delta: tolarence for convergence of the newton step to determine the coeficient.
Verbose: print the number of iterations necessary for convergence of a coeficient.
Formula: a symbolic description of the model to be fit.
data: an optional data frame containing the variables in the model.
Subset: an optional vector specifying a subset of observations to be used in the fitting
process.
na.action: a function that indicates how to process ‘NA’ values. Default=na.rpart.
...: arguments passed to rpart.control. For stumps, use rpart.control(maxdepth=1,cp=-
1,minsplit=0,xval=0). maxdepth controls the depth of trees, and cp
controls the complexity of trees. The priors should also be fixed through the
parms argument as discussed in the second reference.

Type the following command to show the result from this algorithm:

summary(AdaObject)
varplot(VariableImportanceObject)

When using usage 'ada(x,y)': x data can take the form data.frame or as.matrix. y data can take form data.frame, as.factor, as.matrix, as.array, or as.table. missing values must be removed from the data prior to execution.

When using usage 'ada(y~.)': data must be in a data frame. Response can have factor or numeric values. missing values can be present in the descriptor data, whenever na.action is set to any option other than na.pass.

After the model is fit, 'ada' prints a summary of the function call, the method used for boosting, the number of iterations, the final confusion matrix (observed classification vs predicted classification; labels for classes are same as in response), the error for the training set, and testing, training , and kappa estimates of the appropriate number of iterations.

A summary of this information can also be obtained with the command ‘print(x)’. Corresponding functions (Use help with summary.ada, predict.ada, . . . varplot for additional information on these commands): summary : function to print a summary of the original function call, method used for boosting, number of iterations, final confusion matrix, accuracy, and kappa statistic (a measure of agreement between the observed classification and predicted classification). ‘summary’ can be used for training, testing, or validation data.

predict : function to predict the response for any data set (train, test, or validation)

plot : function to plot performance of the algorithm across boosting iterations. Default plot is iteration number (x-axis) versus prediction error (y-axis) for the data set used to build the model. Function can also simultaneously produce an error plot for an external test set and a kappa plot for training and test sets.

pairs : function to produce pairwise plots of descriptors. Descriptors are arranged by decreasing frequency of selection by boosting (upper left = most frequently chosen). The color of the marker in the plot represents class membership; the Size of the marker represents predicted class probability. The larger the marker, the higher the probability of classification.

A data set that contains information about compounds used in drug discovery. Specifically, this data set consists of 5631 compounds on which an in-house solubility screen (ability of a compound to dissolve in a water/solvent mixture) was performed. Based on this screen, compounds were categorized as either insoluble (n=3493) or soluble (n=2138). Then, for each compound, 72 continuous, noisy structural descriptors were computed. Of these descriptors, one contained missing values for approximately 14% (n=787) of the observations. The objective of the analysis is to model the relationship between the structural descriptors and the solubility class. The data will be called soldat.

Testing accuracy rates are printed in the order they are entered so the accuracy on the testing set is 0.765 and on the validation set 0.781. For this type of early drug discovery data, the Gentle AdaBoost algorithm performs adequately with test set accuracy of 76.5% (kappa is aproximately 0.5). In order to enhance our understanding regarding the relationship between descriptors and the response, the varplot function was employed.