WEBVTT
Kind: captions
Language: en
00:00:11.000 --> 00:00:12.120
Hello again!
00:00:12.120 --> 00:00:18.000
In real life, when we have important decisions
to make, we often choose to make them using
00:00:18.000 --> 00:00:19.180
a committee.
00:00:19.180 --> 00:00:24.720
Having different experts sitting down together,
with different perspectives on the problem,
00:00:24.720 --> 00:00:31.020
and letting them vote, is often a very effective
and robust way of making good decisions.
00:00:31.020 --> 00:00:33.880
The same is true in machine learning.
00:00:33.880 --> 00:00:39.170
We can often improve predictive performance
by having a bunch of different machine learning
00:00:39.170 --> 00:00:44.170
methods, all producing classifiers for the
same problem, and then letting them vote when
00:00:44.170 --> 00:00:47.590
it comes to classifying an unknown test instance.
00:00:47.590 --> 00:00:51.239
One of the disadvantages is that this produces
output that is hard to analyze.
00:00:51.239 --> 00:00:56.699
There are actually approaches that try and
produce a single comprehensible structure,
00:00:56.700 --> 00:01:00.300
but we’re not going to be looking at any
of those.
00:01:00.300 --> 00:01:03.760
So the output will be hard to analyze, but
you often get very good performance.
00:01:03.769 --> 00:01:08.210
It’s a fairly recent technique in machine
learning.
00:01:08.210 --> 00:01:14.490
We’re going to look at four methods, called
“bagging”, “randomization”, “boosting”,
00:01:14.490 --> 00:01:15.720
and “stacking”.
00:01:15.720 --> 00:01:18.980
They’re all implemented in Weka, of course.
00:01:18.980 --> 00:01:24.700
The idea with bagging, we want to produce
several different decision structures.
00:01:24.700 --> 00:01:30.590
Let’s say we use J48 to produce decision
trees, then we want to produce slightly different
00:01:30.590 --> 00:01:31.590
decision trees.
00:01:31.590 --> 00:01:36.240
We can do that by having several different
training sets of the same size.
00:01:36.240 --> 00:01:41.440
We can get those by sampling the original
training set.
00:01:41.440 --> 00:01:46.780
In fact, in bagging, you sample the set “with
replacement”, which means that sometimes
00:01:46.780 --> 00:01:53.400
you might get two of the same [instances]
chosen in your sample.
00:01:53.400 --> 00:01:59.280
We produce several different training sets,
and then we build a model for each one – let’s
00:01:59.280 --> 00:02:03.250
say a decision tree – using the same machine
learning scheme, or using some other machine
00:02:03.250 --> 00:02:04.710
learning scheme.
00:02:04.710 --> 00:02:09.479
Then we combine the predictions of the different
models by voting, or if it’s a regression
00:02:09.479 --> 00:02:14.979
situation you would average the numeric result
rather than voting on it.
00:02:14.980 --> 00:02:20.360
This is very suitable for learning schemes
that are called “unstable”.
00:02:20.370 --> 00:02:25.450
Unstable learning schemes are ones where a
small change in the training data can make
00:02:25.450 --> 00:02:27.510
a big change in the model.
00:02:27.510 --> 00:02:29.439
Decision trees are a really good example of
this.
00:02:29.439 --> 00:02:32.999
You can get a decision tree and just make
a tiny little change in the training data
00:02:32.999 --> 00:02:36.760
and get a completely different kind of decision
tree.
00:02:36.760 --> 00:02:42.319
Whereas with Naïve Bayes, if you think about
how Naïve Bayes works, little changes in
00:02:42.319 --> 00:02:46.359
the training set aren’t going to make much
difference to the result of Naïve Bayes,
00:02:46.359 --> 00:02:49.760
so that’s a “stable” machine learning
method.
00:02:49.760 --> 00:02:54.270
In Weka we have a “Bagging” classifier
in the meta set.
00:02:54.270 --> 00:03:03.700
I’m going to choose meta > Bagging: here
it is.
00:03:03.700 --> 00:03:08.530
We can choose here the bag size – this is
saying a bag size of 100%, which is going
00:03:08.530 --> 00:03:13.480
to sample the training set to get another
set the same size, but it’s going to sample
00:03:13.480 --> 00:03:15.230
“with replacement”.
00:03:15.230 --> 00:03:20.890
That means we’re going to get different
sets of the same size each time we sample,
00:03:20.890 --> 00:03:25.650
but each set might contain repeats of the
original training [instances].
00:03:25.650 --> 00:03:30.840
Here we choose which classifier we want to
bag, and we can choose the number of bagging
00:03:30.849 --> 00:03:33.430
iterations here, and a random-number seed.
00:03:33.430 --> 00:03:35.209
That’s the bagging method.
00:03:35.209 --> 00:03:38.219
The next one I want to talk about is “random
forests”.
00:03:38.219 --> 00:03:42.310
Here, instead of randomizing the training
data, we randomize the algorithm.
00:03:42.310 --> 00:03:46.130
How you randomize the algorithm depends on
what the algorithm is.
00:03:46.130 --> 00:03:49.970
Random forests are when you’re using decision
tree algorithms.
00:03:49.970 --> 00:03:56.300
Remember when we talked about how J48 works?
– it selects the best attribute for splitting
00:03:56.310 --> 00:03:57.459
on each time.
00:03:57.459 --> 00:04:03.239
You can randomize this procedure by not necessarily
selecting the very best, but choosing a few
00:04:03.239 --> 00:04:06.260
of the best options, and randomly picking
amongst them.
00:04:06.260 --> 00:04:08.560
That gives you different trees every time.
00:04:08.560 --> 00:04:17.130
Generally, if you bag decision trees, if you
randomize them and bag the result, you get
00:04:17.130 --> 00:04:20.190
better performance.
00:04:20.190 --> 00:04:27.650
In Weka, we can look under “tree” classifiers
for RandomForest.
00:04:31.400 --> 00:04:34.160
Again, that’s got a bunch of parameters.
00:04:34.160 --> 00:04:39.190
The maximum depth of the trees produced – I
think 0 would be unlimited depth.
00:04:39.190 --> 00:04:40.930
The number of features we’re going to use.
00:04:40.930 --> 00:04:48.880
We might select, say 4 features; we would
select from the top 4 features – every time
00:04:48.889 --> 00:04:56.750
we decide on what decision to put in the tree,
we select that from among the top 4 candidates.
00:04:56.750 --> 00:04:59.030
The number of trees we’re going to produce,
and so on.
00:04:59.030 --> 00:05:00.759
That’s random forests.
00:05:00.759 --> 00:05:05.080
Here’s another kind of algorithm: it’s
called “boosting”.
00:05:05.080 --> 00:05:11.530
It’s iterative: new models are influenced
by the performance of previously built models.
00:05:11.530 --> 00:05:16.720
Basically, the idea is that you create a model,
and then you look at the instances that are
00:05:16.720 --> 00:05:18.470
misclassified by that model.
00:05:18.470 --> 00:05:22.449
These are the hard instances to classify,
the ones it gets wrong.
00:05:22.449 --> 00:05:29.550
You put extra weight on those instances to
make a training set for producing the next
00:05:29.550 --> 00:05:31.629
model in the iteration.
00:05:31.629 --> 00:05:37.510
This encourages the new model to become an
“expert” for instances that were misclassified
00:05:37.510 --> 00:05:39.460
by all the earlier models.
00:05:39.460 --> 00:05:44.380
The intuitive justification for this is that
in a real life committee, committee members
00:05:44.380 --> 00:05:50.139
should complement each other’s expertise
by focusing on different aspects of the problem.
00:05:50.139 --> 00:05:55.560
In the end, to combine them we use voting,
but we actually weight models according to
00:05:55.560 --> 00:05:56.960
their performance.
00:05:56.960 --> 00:06:06.099
There’s a very good scheme called AdaBoostM1,
which is in Weka and is a standard and very
00:06:06.099 --> 00:06:12.069
good boosting implementation – it often
produces excellent results.
00:06:12.069 --> 00:06:19.240
There are few parameters to this as well;
in particular the number of iterations.
00:06:19.240 --> 00:06:23.240
The final ensemble learning method is called
“stacking”.
00:06:23.240 --> 00:06:27.960
Here we’re going to have base learners,
just like the learners we talked about previously.
00:06:27.960 --> 00:06:33.360
We’re going to combine them not with voting,
but by using a meta-learner, another learner
00:06:33.360 --> 00:06:36.849
scheme that combines the output of the base
learners.
00:06:36.849 --> 00:06:42.869
We’re going to call the base learners level-0
models, and the meta-learner is a level-1 model.
00:06:42.869 --> 00:06:48.240
The predictions of the base learners are input
to the meta-learner.
00:06:48.240 --> 00:06:52.460
Typically you use different machine learning
schemes as the base learners to get different
00:06:52.460 --> 00:06:55.740
experts that are good at different things.
00:06:55.740 --> 00:07:00.879
You need to be a little bit careful in the
way you generate data to train the level-1
00:07:00.879 --> 00:07:05.490
model: this involves quite a lot of cross-validation,
I won’t go into that.
00:07:05.490 --> 00:07:14.810
In Weka, there’s a meta classifier called
“Stacking”, as well as “StackingC”
00:07:14.810 --> 00:07:18.510
– which is a more efficient version of Stacking.
00:07:18.510 --> 00:07:28.580
Here is Stacking; you can choose different
meta-classifiers here, and the number of stacking folds.
00:07:28.580 --> 00:07:35.140
We can choose different classifiers; different
level-0 classifiers, and a different meta-classifier.
00:07:35.140 --> 00:07:41.140
In order to create multiple level-0 models,
you need to specify a meta-classifier as the
00:07:41.150 --> 00:07:43.340
level-0 model.
00:07:43.340 --> 00:07:47.349
It gets a little bit complicated; you need
to fiddle around with Weka to get that working.
00:07:47.349 --> 00:07:48.669
That’s it then.
00:07:48.680 --> 00:07:54.020
We’ve been talking about combining multiple
models into ensembles to produce an ensemble
00:07:54.020 --> 00:07:57.539
for learning, and the analogy is with committees
of humans.
00:07:57.539 --> 00:08:01.539
Diversity helps, especially when learners
are unstable.
00:08:01.540 --> 00:08:03.620
And we can create diversity in different ways.
00:08:03.629 --> 00:08:08.069
In bagging, we create diversity by resampling
the training set.
00:08:08.069 --> 00:08:13.780
In random forests, we create diversity by
choosing alternative branches to put in our
00:08:13.780 --> 00:08:15.199
decision trees.
00:08:15.199 --> 00:08:20.839
In boosting, we create diversity by focusing
on where the existing model makes errors;
00:08:20.840 --> 00:08:25.780
and in stacking, we combine results from a
bunch of different kinds of learner using
00:08:25.780 --> 00:08:27.920
another learner, instead of just voting.