Hello again! We're up to the last lesson in the fourth class, Lesson 4.6 on Ensemble Learning.
In real life, when we have important decisions to make, we often choose to make them using a committee. Having different experts sitting down together, with different perspectives on the problem, and letting them vote, is often a very effective and robust way of making good decisions. The same is true in machine learning. We can often improve predictive performance by having a bunch of different machine learning methods, all producing classifiers for the same problem, and then letting them vote when it comes to classifying an unknown test instance.
One of the disadvantages is that this produces output that is hard to analyze. There are actually approaches that try and produce a single comprehensible structure, but we're not going to be looking at any of those. So the output will be hard to analyze, but you often get very good performance. It's a fairly recent technique in machine learning.
We're going to look at four methods, called "bagging", "randomization", "boosting", and "stacking". They're all implemented in Weka, of course.
With bagging, we want to produce several different decision structures. Let's say we use J48 to produce decision trees, then we want to produce slightly different decision trees. We can do that by having several different training sets of the same size. We can get those by sampling the original training set. In fact, in bagging, you sample the set "with replacement", which means that sometimes you might get two of the same [instances] chosen in your sample.
We produce several different training sets, and then we build a model for each one -- let's say a decision tree -- using the same machine learning scheme, or using some other machine learning scheme. Then we combine the predictions of the different models by voting, or if it's a regression situation you would average the numeric result rather than voting on it.
This is very suitable for learning schemes that are called "unstable". Unstable learning schemes are ones where a small change in the training data can make a big change in the model. Decision trees are a really good example of this. You can get a decision tree and just make a tiny little change in the training data and get a completely different kind of decision tree. Whereas with NaiveBayes, if you think about how NaiveBayes works, little changes in the training set aren't going to make much difference to the result of NaiveBayes, so that's a "stable" machine learning method.
In Weka we have a "Bagging" classifier in the meta set. I'm going to choose meta > Bagging: here it is. We can choose here the bag size -- this is saying a bag size of 100%, which is going to sample the training set to get another set the same size, but it's going to sample "with replacement". That means we're going to get different sets of the same size every time we sample, but each set might contain repeats of the original training [instances]. Here we choose which classifier we want to bag, and we can choose the number of bagging iterations here, and a random-number seed. That's the bagging method.
The next one I want to talk about is "random forests". Here, instead of randomizing the training data, we randomize the algorithm. How you randomize the algorithm depends on what the algorithm is. Random forests are when you're using decision tree algorithms. Remember when we talked about how J48 works? -- it selects the best attribute for splitting on each time. You can randomize this procedure by not necessarily selecting the very best, but choosing a few of the best options, and randomly picking amongst them. That gives you different trees every time. Generally, if you bag decision trees, if you randomize them and bag the result, you get better performance.
In Weka, we can look under "tree" classifiers for RandomForest. Again, that's got a bunch of parameters. The maximum depth of the trees produced -- I think 0 would be unlimited depth. The number of features we're going to use. We might select, say 4 features; we would select from the top 4 features -- every time we decide on the decision to put in the tree, we select that from among the top 4 candidates. The number of trees we're going to produce, and so on. That's random forests.
Here's another kind of algorithm: it's called "boosting". It's iterative: new models are influenced by the performance of previously built models. Basically, the idea is that you create a model, and then you look at the instances that are misclassified by that model. These are the hard instances to classify, the ones it gets wrong. You put extra weight on those instances to make a training set for producing the next model in the iteration. This encourages the new model to become an "expert" for instances that were misclassified by all the earlier models. The intuitive justification for this is that in a real life committee, committee members should complement each other's expertise by focusing on different aspects of the problem. In the end, to combine them we use voting, but we actually weight models according to their performance. There's a very good scheme called AdaBoostM1, which is in Weka and is a standard and very good boosting implementation -- it often produces excellent results. There are few parameters to this as well; particularly the number of iterations.
The final ensemble learning method is called "stacking". Here we're going to have base learners, just like the learners we talked about previously. We're going to combine them not with voting, but by using a meta-learner, another learner scheme that combines the output of the base learners. We're going to call the base learners level-0 models, and the meta-learner is a level-1 model. The predictions of the base learners are input to the meta-learner. Typically you use different machine learning schemes as the base learners to get different experts that are good at different things. You need to be a little bit careful in the way you generate data to train the level-1 model: this involves quite a lot of cross-validation, I won't go into that.
In Weka, there's a meta classifier called "Stacking", as well as "StackingC" -- which is a more efficient version of Stacking. Here is Stacking; you can choose different meta-classifiers here, and the number of stacking folds. We can choose different classifiers; different level-0 classifiers, and a different meta-classifier. In order to create multiple level-0 models, you need to specify a meta-classifier as the level-0 model. It gets a little bit complicated; you need to fiddle around with Weka to get that working.
That's it then. We've been talking about combining multiple models into ensembles to produce an ensemble for learning, and the analogy is with committees of humans. Diversity helps, especially when learners are unstable. And we can create diversity in different ways. In bagging, we create diversity by resampling the training set. In random forests, we create diversity by choosing alternative branches to put in our decision trees. In boosting, we create diversity by focusing on where the existing model makes errors; and in stacking, we combine results from a bunch of different kinds of learner using another learner, instead of just voting.
There's a chapter in the course text on Ensemble learning -- it's quite a large topic, really.
There's an activity that you should go and do before we proceed to the next class, the last class in this course. We'll learn about putting it all together, taking a more global view of the machine learning process.
We'll see you then. Bye for now!