Adaboosting is proven to be one of the most effective class prediction algorithms. It mainly consists of an ensemble simpler models (known as “weak learners”) that, although not very effective individually, are very performant combined.

The process by which these weak learners are combined is though more complex than simply averaging results. Very briefly, Adaboosting training process could be described as follows:

For each weak learner:

1) Train weak learner so that the weighted error sum of squares is minimised

2) Update weights, so that correctly classified cases have their weight reduced and misclassified cases have their weights increased.

3) Determine weak learner’s weight, i.e., the total contribution of the weak learner’s result to the overall score. This is known as alpha and is calculated as 0.5 * ln((1- error.rate)/error.rate))

As the weight is updated on each iteration, each weak learner will tend to focus more on those cases that were misclassified on previous instances.

Decision stumps as weak learners

The most common weak learner used in Adaboosting is known as Decision Stump and consists basically on a decision tree of depth 1, i.e., a model that returns an output based on a single condition, which could be summarised as “If (condition) then A else B”.

ada package in R

Although the implementation provides very good results in terms of model performance, “ada” package has two main problems:

It creates too big objects: Even with not so big datasets (around 500k x 50) that the final model object can be extremely big (5 or 6 Gb) and consequently too expensive to keep in memory. Of course, this is needed to perform any kind of prediction with the model. This is because the ada object is an ensemble of rpart objects, which holds a bunch of other information which is actually not needed when you just want to train an adaBoost model with Stumps and then predict with it.

Its execution time is too high: As both ada and rpart objects have a lot of other uses, they are not optimal for Stumps in terms of timings

Introducing adaStump

Considering that each stump consists just of a condition, two probability estimates (one when the condition is TRUE and the other when it’s FALSE) and a multiplier (alpha), I thought it could be much better, both in terms of size and timings, even just using R.

For that reason I developed adaStump, which is a package that takes advantage of ada’s excellent training process implementation, but creates a smaller output object and generates faster predictions.

An Example

Before running the example, you need to download the package from my Github repo. In order to do that, you need to have devtools installed

After the package is installed, let’s start with the example! In this case, the letters dataset is loaded, its variable names assigned and the variable isA is created, which gets a value 1 if true and 0 if false. That will be the class we will predict.

Now we are going to create two identical models. As AdaBoosting is usually ran with bagging, the bag.frac parameter is set to 1 so that all the cases are included. As what we want to demonstrate is that we can get the same results with much less memory usage and in less time, we need to ensure that both models are identical.

Taking a look at these numbers we can definitely say that the first problem is solved. Besides, the size of ada objects depend on the bag fraction, the sample size and the amount of iterations while adaStump objects only depend on the amount of iterations. There is almost no scenario where the size of the adaStump object can turn into a resource problem even on old personal computers. Now, let’s see if the new model can also predict quicker:

Indeed 🙂 . Let us check if the results are the same. Contrary to ada, adaStump prediction method only returns the probability estimate of the class outcome (i.e., the probability that the class variable is 1). Its complement can be easily calculated by doing 1 – the vector obtained.

Although there is some difference in some cases, it tends to be minimal. We can say now that we can produce almost the same outcome using 1% of the memory and more than 2.5 times faster. However, we can do better

Usually, AdaBoosting generates stumps based on the same condition. In this case, for example, we can see by accessing to the model frame (fit_with.adaStump$model), that the condition x2ybr >= 2.5 is repeated constantly. Of course, all these rows can be collapsed so that the condition is evaluated only once. For that purpose you can use pruneTree()

Although the difference between both methods is much greater in this case (there is a pre-calculation step where some decimals are discarded) it is still acceptable for most use cases. So, choosing to do the reduction or not is a decision based on the level of precision needed.