1
00:00:17,199 --> 00:00:24,199
Hi! Here in Lesson 3.4, we're continuing our
exploration of simple classifiers by looking
2
00:00:24,230 --> 00:00:29,219
at classifiers that produce decision trees.
3
00:00:29,219 --> 00:00:33,269
We're going to look at J48.
4
00:00:33,269 --> 00:00:35,920
We've used this classifier quite a bit so far.
5
00:00:35,920 --> 00:00:38,399
Let's have a look at how it works inside.
6
00:00:38,399 --> 00:00:45,399
J48 is based on a top-down strategy, a recursive
divide and conquer strategy.
7
00:00:46,479 --> 00:00:51,680
You select which attribute to split on at
the root node, and then you create a branch
8
00:00:51,680 --> 00:00:57,960
for each possible attribute value, and that
splits the instances into subsets, one for
9
00:00:57,960 --> 00:01:01,589
each branch that extends from the root node.
10
00:01:01,589 --> 00:01:06,210
Then you repeat the the procedure recursively
for each branch, selecting an attribute at each node,
11
00:01:06,210 --> 00:01:12,180
and you use only instances that
reach that branch to make the selection.
12
00:01:12,180 --> 00:01:19,180
At the end you stop, perhaps you might continue
until all instances have the same class.
13
00:01:19,439 --> 00:01:26,439
The trick is, the question is, how do you
select a good attribute for the root node.
14
00:01:29,490 --> 00:01:36,119
This is the weather data, and as you can see,
outlook has been selected for the root node.
15
00:01:37,140 --> 00:01:42,899
Here are the four possibilities: outlook,
windy, humidity, and temperature.
16
00:01:42,899 --> 00:01:46,819
These are the consequences of splitting on
each of these attributes.
17
00:01:48,000 --> 00:01:53,819
What we're really looking for is a pure split,
a split into pure nodes.
18
00:01:54,709 --> 00:02:00,709
We would be delighted if we found an attribute
that split exactly into one node where they
19
00:02:00,709 --> 00:02:04,749
are all yeses, another node where they
are all nos, and perhaps a third node where
20
00:02:04,749 --> 00:02:05,779
they are all yeses again.
21
00:02:05,779 --> 00:02:06,990
That would be the best thing.
22
00:02:06,990 --> 00:02:12,340
What we don't want is mixtures, because when
we get mixtures of yeses and nos at a node,
23
00:02:12,340 --> 00:02:14,740
then we've got to split again.
24
00:02:14,740 --> 00:02:17,040
You can see that splitting on outlook looks
pretty good.
25
00:02:17,040 --> 00:02:24,040
We get one branch with two yeses and three
nos, then we get a pure yes branch for overcast,
26
00:02:24,209 --> 00:02:29,950
and, when outlook is rainy, we get three yeses
and two nos.
27
00:02:29,950 --> 00:02:35,120
How are we going to quantify this to decide
which one of these attributes produces the
28
00:02:35,120 --> 00:02:42,120
purest nodes? We're on a quest here for purity.
29
00:02:42,269 --> 00:02:51,110
The aim is to get the smallest tree, and top-down
tree induction methods use some kind of heuristic.
30
00:02:51,110 --> 00:02:58,110
The most popular heuristic to produce pure
nodes is an information theory-based heuristic.
31
00:03:00,269 --> 00:03:04,030
I'm not going to explain information theory
to you, that would be another MOOC of its
32
00:03:04,030 --> 00:03:06,939
own -- quite an interesting one, actually.
33
00:03:06,939 --> 00:03:12,909
Information theory was founded by Claude Shannon,
an American mathematician and scientist who
34
00:03:12,909 --> 00:03:14,880
died about 12 years ago.
35
00:03:14,880 --> 00:03:16,439
He was an amazing guy.
36
00:03:16,439 --> 00:03:17,680
He did some amazing things.
37
00:03:17,680 --> 00:03:23,000
One of the most amazing things, I think, is
that he could ride a unicycle and juggle clubs
38
00:03:23,000 --> 00:03:26,799
at the same time when he was in his 80's.
39
00:03:26,799 --> 00:03:29,829
That's pretty impressive.
40
00:03:29,829 --> 00:03:35,519
He came up the whole idea of information theory
and quantifying entropy, which measures information
41
00:03:35,519 --> 00:03:37,019
in bits.
42
00:03:37,019 --> 00:03:44,019
This is the formula for entropy: the sum of
p log p's for each of the possible outcomes.
43
00:03:44,659 --> 00:03:47,299
I'm not really going to explain it to you.
44
00:03:47,299 --> 00:03:51,299
All of those minus signs are there because
logarithms are negative if numbers are less
45
00:03:51,299 --> 00:03:53,930
than 1 and probabilities always are less than
1.
46
00:03:53,930 --> 00:03:56,709
So, the entropy comes out to be a positive
number.
47
00:03:57,460 --> 00:03:59,939
What we do is we look at the information gain.
48
00:03:59,939 --> 00:04:07,170
How much information in bits do you gain by
knowing the value of an attribute? That is,
49
00:04:07,170 --> 00:04:12,180
the entropy of the distribution before the
split minus the entropy of the distribution
50
00:04:12,180 --> 00:04:14,359
after the split.
51
00:04:14,359 --> 00:04:17,480
Here's how it works out for the weather data.
52
00:04:18,440 --> 00:04:19,620
These are the number of bits.
53
00:04:19,620 --> 00:04:24,150
If you split on outlook, you gain 0.247 bits.
54
00:04:24,150 --> 00:04:28,750
I know you might be surprise to see fractional
numbers of bits, normally we think of 1 bit
55
00:04:28,750 --> 00:04:34,690
or 8 bits or 32 bits, but information theory
shows how you can regard bits as fractions.
56
00:04:34,690 --> 00:04:36,330
These produce fractional numbers of bits.
57
00:04:36,330 --> 00:04:38,820
I don't want to go into the details.
58
00:04:38,820 --> 00:04:46,310
You can see, knowing the value for windy gives
you only 0.048 bits of information.
59
00:04:46,310 --> 00:04:53,310
Humidity is quite a bit better; temperature
is way down there at 0.029 bits.
60
00:04:53,310 --> 00:04:58,630
We're going to choose the attribute that gains
the most bits of information, and that, in
61
00:04:58,630 --> 00:05:00,460
this case, is outlook.
62
00:05:00,460 --> 00:05:05,610
At the top level of this tree, the root node,
we're going to split on outlook.
63
00:05:05,610 --> 00:05:11,000
Having decided to split on outlook, we need
to look at each of 3 branches that emanate
64
00:05:11,000 --> 00:05:16,250
from outlook corresponding to the 3 possible
values of outlook, and consider what to do
65
00:05:16,250 --> 00:05:17,600
at each of those branches.
66
00:05:17,600 --> 00:05:22,710
At the first branch, we might split on temperature,
windy or humidity.
67
00:05:22,710 --> 00:05:25,980
We're not going to split on outlook again
because we know that outlook is sunny.
68
00:05:25,980 --> 00:05:30,020
For all instances that reach this place, the
outlook is sunny.
69
00:05:30,020 --> 00:05:32,950
For the other 3 things, we do exactly the
same thing.
70
00:05:32,950 --> 00:05:38,750
We evaluate the information gain for temperature
at that point, for windy and humidity, and
71
00:05:38,750 --> 00:05:39,640
we choose the best.
72
00:05:39,640 --> 00:05:44,230
In this case, it's humidity with a gain of
0.971 bits.
73
00:05:44,230 --> 00:05:50,710
You can see that, if we branch on humidity,
then we get pure nodes: 3 nos in one and 2
74
00:05:50,710 --> 00:05:52,010
yeses in the other.
75
00:05:52,010 --> 00:05:54,120
When we get that, we don't need to split anymore.
76
00:05:54,850 --> 00:06:00,800
We're on a quest for purity.
77
00:06:00,800 --> 00:06:01,520
That's how it works.
78
00:06:01,520 --> 00:06:05,990
It just carries on until it reaches the end,
until it has pure nodes.
79
00:06:05,990 --> 00:06:11,860
Let's open up Weka, and just do this with
the nominal weather data.
80
00:06:11,860 --> 00:06:15,520
Of course, we've done this before, but I'll
just do it again.
81
00:06:15,520 --> 00:06:17,650
It won't take long.
82
00:06:18,140 --> 00:06:22,640
J48 is the workhorse data mining algorithm.
83
00:06:22,640 --> 00:06:23,910
There's the data.
84
00:06:23,910 --> 00:06:26,430
We're going to choose J48.
85
00:06:26,430 --> 00:06:28,500
It's a tree classifier.
86
00:06:29,840 --> 00:06:38,110
We're going to run this, and we get a tree
-- the very tree I showed you before -- split
87
00:06:38,110 --> 00:06:42,150
first on outlook: sunny, overcast and rainy.
88
00:06:42,150 --> 00:06:47,360
Then, if it's sunny, split on humidity, 3
instances reach that node.
89
00:06:47,360 --> 00:06:51,610
Then split on normal, 3 yes instances reach that node,
and so on.
90
00:06:51,610 --> 00:06:58,610
We can look at the tree using Visualize
the tree in the right-click menu.
91
00:06:58,990 --> 00:07:03,060
Here it is.
92
00:07:03,060 --> 00:07:08,140
These are the number of yes instances that
reach this node and the number of no instances.
93
00:07:08,140 --> 00:07:12,900
In the case of this particular tree, of course
we're using cross validation here.
94
00:07:12,900 --> 00:07:16,250
It's done an 11th run on the whole dataset.
95
00:07:16,250 --> 00:07:19,750
It's given us these numbers by looking at
the training set.
96
00:07:19,750 --> 00:07:24,520
In fact, this becomes a pure node here.
97
00:07:24,520 --> 00:07:29,950
Sometimes you get 2 numbers here -- 3/2 or 3/1.
98
00:07:29,950 --> 00:07:35,670
The first number indicates the number of correct
things that reach that node, so in this case
99
00:07:35,670 --> 00:07:36,980
the number of nos.
100
00:07:36,980 --> 00:07:41,090
If there was another number following the
3, that would indicate the number of yeses,
101
00:07:41,090 --> 00:07:43,460
that is, incorrect things that reach that
node.
102
00:07:43,460 --> 00:07:47,850
But that doesn't occur in this very simple
situation.
103
00:07:50,740 --> 00:07:55,560
There you have it, J48: top-down induction
of decision trees.
104
00:07:56,490 --> 00:07:59,360
It's soundly based in information theory.
105
00:07:59,360 --> 00:08:02,230
It's a pretty good data mining algorithm.
106
00:08:02,230 --> 00:08:08,590
10 years ago I might have said it's the best
data mining algorithm, but some even better
107
00:08:08,590 --> 00:08:11,280
ones, I think, have been produced since then.
108
00:08:11,280 --> 00:08:18,200
However, the real advantage of J48 is that
it's reliable and robust, and, most importantly,
109
00:08:18,200 --> 00:08:20,980
it produces a tree that people can understand.
110
00:08:20,980 --> 00:08:23,850
It's very easy to understand the output of J48.
111
00:08:23,850 --> 00:08:28,090
That's really important when you're applying
data mining.
112
00:08:28,090 --> 00:08:31,430
There are a lot of different criteria you
could use for attribute selection.
113
00:08:31,430 --> 00:08:33,459
Here we're using information gain.
114
00:08:33,459 --> 00:08:38,000
Actually, in practice, these don't normally
make a huge difference.
115
00:08:38,000 --> 00:08:42,269
There are some important modifications that
need to be done to this algorithm to be useful
116
00:08:42,269 --> 00:08:42,820
in practice.
117
00:08:42,820 --> 00:08:45,310
I've only really explained the basic principles.
118
00:08:45,310 --> 00:08:51,590
The actual J48 incorporates some more complex
stuff to make it work under different circumstances
119
00:08:51,590 --> 00:08:52,300
in practice.
120
00:08:52,300 --> 00:08:56,370
We'll talk about those in the next lesson.
121
00:08:56,370 --> 00:09:02,580
Section 4.3 of the text Divide-and-conquer:
Constructing decision trees explains the simple
122
00:09:02,580 --> 00:09:05,360
version of J48 that I've explained here.
123
00:09:06,100 --> 00:09:09,590
Now you should go and do the activity associated
with this lesson.
124
00:09:09,590 --> 00:09:12,220
Good luck! See you next time!