1
00:00:15,920 --> 00:00:22,260
Hello again! In this lesson we're going to
look at an important new concept called baseline
2
00:00:22,260 --> 00:00:30,610
accuracy. We're going to actually use a new
dataset, the diabetes dataset.
3
00:00:30,610 --> 00:00:37,260
I've got Weka here, and I'm going to open
diabetes.arff.
4
00:00:37,260 --> 00:00:38,600
There it is.
5
00:00:39,000 --> 00:00:40,700
Have a quick look at this dataset.
6
00:00:40,700 --> 00:00:47,700
The class is tested_negative or tested_positive
for diabetes.
7
00:00:48,990 --> 00:00:55,670
We've got attributes like preg, which I think
has to do with the number of times they've
8
00:00:55,670 --> 00:00:59,020
been pregnant; age, which is the age.
9
00:00:59,020 --> 00:01:06,020
Of course, we can learn more about this dataset
by looking at the ARFF file itself.
10
00:01:07,590 --> 00:01:10,990
Here is the diabetes dataset.
11
00:01:10,990 --> 00:01:14,020
You can see it's diabetes in Pima Indians.
12
00:01:17,500 --> 00:01:19,510
There's a lot of information here.
13
00:01:19,510 --> 00:01:26,510
The attributes: number of times pregnant,
plasma, glucose concentration, and so on.
14
00:01:26,510 --> 00:01:29,830
Diabetes pedigree function.
15
00:01:31,070 --> 00:01:35,070
I'm going to use percentage split.
16
00:01:36,140 --> 00:01:38,970
I'm going to try a few different classifiers.
17
00:01:38,970 --> 00:01:45,970
Let's look at J48 first, our old friend J48.
18
00:01:50,380 --> 00:01:55,010
We get [76%] with J48.
19
00:01:55,600 --> 00:01:57,280
I'm going to look at some other classifiers.
20
00:01:57,280 --> 00:02:01,120
You learn about these classifiers later on
in this course, but right now we're just going
21
00:02:01,120 --> 00:02:02,600
to look at a few.
22
00:02:02,600 --> 00:02:09,300
Look at NaiveBayes classifier in the bayes
category, and run that.
23
00:02:09,300 --> 00:02:15,640
Here we get 77%, a little bit better, but
probably not significant.
24
00:02:15,640 --> 00:02:20,640
Let's choose in the lazy category IBk.
25
00:02:20,640 --> 00:02:25,170
Again, we'll learn about this later on.
26
00:02:25,170 --> 00:02:29,220
Here we get 73%, quite a bit worse.
27
00:02:29,220 --> 00:02:36,220
We'll use one final one, the PART, partial
rules in the rules category.
28
00:02:40,110 --> 00:02:43,060
Here we get 74%.
29
00:02:43,060 --> 00:02:47,840
We'll learn about these classifiers later,
but they are just different classifiers, alternative
30
00:02:47,840 --> 00:02:49,750
to J48.
31
00:02:49,750 --> 00:02:54,520
You can see that J48 and NaiveBayes are pretty
good, probably about the same.
32
00:02:54,520 --> 00:02:57,370
The 1% difference between them probably isn't
significant.
33
00:02:57,370 --> 00:03:00,110
IBk and PART are probably about the same performance.
34
00:03:00,110 --> 00:03:01,610
Again, 1% between them.
35
00:03:01,610 --> 00:03:06,720
There is a fair gap, I guess, between those
bottom two and the top two, which probably
36
00:03:06,720 --> 00:03:07,760
is significant.
37
00:03:08,890 --> 00:03:10,900
I'd like to think about these figures.
38
00:03:10,900 --> 00:03:15,590
76%, is that good to get 76% accuracy?
39
00:03:15,590 --> 00:03:21,670
If we go back and look at this dataset, the class,
40
00:03:21,670 --> 00:03:28,720
we see that there are 500 negative instances
and 268 positive instances.
41
00:03:28,720 --> 00:03:35,720
If you had to guess, you'd guess it would
be negative, and you'd be right 500/768
42
00:03:35,720 --> 00:03:38,800
(the sum of these two things, the total number
of instances).
43
00:03:39,000 --> 00:03:41,390
You'd be right that fraction of the time.
44
00:03:41,390 --> 00:03:48,390
500/768 if you always guess [negative], and
that works out to 65%.
45
00:03:48,950 --> 00:04:00,670
Actually, there's a rules classifier called
ZeroR, which does exactly that.
46
00:04:00,670 --> 00:04:07,670
The ZeroR classifier just looks for the most
popular class and guesses that all the time.
47
00:04:08,420 --> 00:04:15,420
If I run this on the training set, that will
give us the exact same number, 500/768,
48
00:04:16,300 --> 00:04:17,330
which is 65%.
49
00:04:19,470 --> 00:04:23,830
It's a very, very simple, kind of trivial
classifier, that always just guesses the most
50
00:04:23,830 --> 00:04:25,650
popular class.
51
00:04:25,650 --> 00:04:29,680
It's ok to evaluate that on the training set,
because it's hardly using the training set
52
00:04:29,680 --> 00:04:32,120
at all to form the classifier.
53
00:04:32,120 --> 00:04:37,240
That's what we would call the baseline.
54
00:04:37,240 --> 00:04:43,540
The baseline gives 65% accuracy, and J48 gives
76% accuracy.
55
00:04:43,540 --> 00:04:47,830
It's significantly above the baseline, but
not all that much above the baseline.
56
00:04:47,830 --> 00:04:52,990
It's always good when you're looking at these
figures to consider what the very simplest kind of classifier,
57
00:04:52,990 --> 00:04:56,240
the baseline classifier, would get you.
58
00:04:56,240 --> 00:05:01,350
Sometimes, baseline might give you the best
results.
59
00:05:01,350 --> 00:05:03,110
I'm going to open a dataset here.
60
00:05:03,110 --> 00:05:05,050
We're not going to discuss this dataset.
61
00:05:05,050 --> 00:05:11,660
It's a bit of a strange dataset, not really
designed for this kind of classification.
62
00:05:11,660 --> 00:05:12,940
It's called supermarket.
63
00:05:12,940 --> 00:05:18,630
I'm going to open supermarket, and without
even looking at it, I'm just going to apply
64
00:05:18,630 --> 00:05:19,950
a few schemes here.
65
00:05:19,950 --> 00:05:26,930
I'm going to apply ZeroR, and I get 64%.
66
00:05:27,530 --> 00:05:32,130
I'm going to apply J48,
67
00:05:34,530 --> 00:05:38,790
and I think I'll use a percentage split for evaluation because
68
00:05:38,790 --> 00:05:41,020
it's not fair to use the training set here.
69
00:05:41,020 --> 00:05:43,720
Now I get 63%.
70
00:05:43,720 --> 00:05:46,580
That's worse than the baseline.
71
00:05:46,580 --> 00:05:48,180
If I try NaiveBayes.
72
00:05:49,990 --> 00:05:53,520
These are the ones I tried before.
73
00:05:53,520 --> 00:05:57,070
I get again 63%, worse than the baseline.
74
00:05:57,070 --> 00:06:04,070
If I choose IBk, this is going to take a little
while here, it's a rather slow scheme.
75
00:06:09,910 --> 00:06:11,670
Here we are; it's finished now.
76
00:06:11,670 --> 00:06:13,500
Only 38%.
77
00:06:13,500 --> 00:06:17,010
That is way, way worse than the baseline.
78
00:06:17,010 --> 00:06:24,010
We'll just try PART, partial decision rules.
79
00:06:26,200 --> 00:06:28,000
Here we get 63%.
80
00:06:30,160 --> 00:06:36,350
The upshot is that the baseline actually gave
a better performance than any of these classifiers,
81
00:06:36,350 --> 00:06:41,580
and one of them was really atrocious compared
with the baseline.
82
00:06:41,580 --> 00:06:47,030
This is because, for this dataset, the attributes
are not really informative.
83
00:06:47,030 --> 00:06:52,310
The rule here is, don't just apply Weka to
a dataset blindly.
84
00:06:52,310 --> 00:06:54,900
You need to understand what's going on.
85
00:06:54,900 --> 00:07:02,970
When you do apply Weka to a dataset, always
make sure that you try the baseline classifier,
86
00:07:02,970 --> 00:07:06,360
ZeroR, before doing anything else.
87
00:07:06,360 --> 00:07:09,100
In general, simplicity is best.
88
00:07:09,100 --> 00:07:14,250
Always try simple classifiers before you try
more complicated ones.
89
00:07:14,250 --> 00:07:18,210
Also, you should consider, when you get these
small differences whether the differences
90
00:07:18,210 --> 00:07:19,830
are likely to be significant.
91
00:07:19,830 --> 00:07:24,820
We saw these 1% differences in the last lesson
that were probably not at all significant.
92
00:07:24,820 --> 00:07:27,430
You should always try a simple baseline.
93
00:07:27,430 --> 00:07:29,180
You should look at the dataset.
94
00:07:29,180 --> 00:07:36,070
We shouldn't blindly apply Weka to a dataset;
we should try to understand what's going on.
95
00:07:36,070 --> 00:07:37,140
That's this lesson.
96
00:07:37,140 --> 00:07:44,140
Off you go and do the activity associated
with this lesson, and I'll see you soon!