1
00:00:16,539 --> 00:00:18,380
Hi! Good to see you again.
2
00:00:18,380 --> 00:00:23,949
One of the things I like to do in my time
is play music, and that little bit of Mozart
3
00:00:23,949 --> 00:00:26,249
you hear at the beginning of these videos,
4
00:00:26,249 --> 00:00:31,019
that's me and three friends playing a clarinet quartet.
5
00:00:31,019 --> 00:00:35,589
I play in an orchestra, and last night I was
playing some jazz with a little trio.
6
00:00:35,589 --> 00:00:39,680
If you want to hear us play, if you go to
Google and just find my home page.
7
00:00:39,680 --> 00:00:43,399
Type my name, Ian Witten.
8
00:00:43,399 --> 00:00:50,399
You'll get me here, and every time you visit
this page, I'll play you a tune.
9
00:00:58,519 --> 00:01:04,170
If you refresh the page, I'll play you another
tune.
10
00:01:08,770 --> 00:01:11,090
That's what I do.
11
00:01:11,090 --> 00:01:13,290
Anyway, that's not what we're here for.
12
00:01:13,290 --> 00:01:20,290
We're here to talk about Lesson 2.6, which
is about cross-validation results.
13
00:01:20,290 --> 00:01:24,610
We learned about cross-validation in the last
lesson.
14
00:01:24,610 --> 00:01:29,590
I said that cross-validation was a better
way of evaluating your machine learning algorithm,
15
00:01:29,590 --> 00:01:35,640
evaluating your classifier, than repeated
holdout, repeating the holdout method.
16
00:01:35,640 --> 00:01:37,360
Cross-validation does things 10 times.
17
00:01:37,360 --> 00:01:39,670
You can use holdout to do things 10 times,
18
00:01:39,870 --> 00:01:43,970
but cross-validation is a better way of doing things.
19
00:01:43,970 --> 00:01:45,720
Let's just do a little experiment here.
20
00:01:45,720 --> 00:01:52,720
I'm going to start up Weka and open the diabetes
dataset.
21
00:01:56,000 --> 00:02:02,000
The baseline accuracy, which ZeroR gives me --
22
00:02:02,000 --> 00:02:05,510
that's the default classifier, by the way, rules/ZeroR --
23
00:02:05,510 --> 00:02:11,540
if I just run that, well, it will evaluate it
using cross-validation.
24
00:02:11,540 --> 00:02:15,600
Actually, for a true baseline, I should just
use the training set.
25
00:02:15,600 --> 00:02:22,570
That'll just look at the chances
of getting a correct result if we simply guess
26
00:02:22,570 --> 00:02:27,060
the most likely class, in this case 65.1%.
27
00:02:27,060 --> 00:02:28,790
That's the baseline accuracy.
28
00:02:28,790 --> 00:02:32,110
That's the first thing you should do with
any dataset.
29
00:02:32,110 --> 00:02:36,130
Then we're going to look at J48, which is
down here under trees.
30
00:02:38,550 --> 00:02:39,280
There it is.
31
00:02:39,280 --> 00:02:44,150
I'm going to evaluate it with 10-fold cross-validation.
32
00:02:44,150 --> 00:02:47,290
It takes just a second to do that.
33
00:02:47,290 --> 00:02:57,650
I get a result of 73.8%, and we can change
the random-number seed like we did before.
34
00:02:57,650 --> 00:03:00,790
The default is 1; let's put a random-number
seed of 2.
35
00:03:00,790 --> 00:03:02,370
Run it again.
36
00:03:02,370 --> 00:03:04,680
I get 75%.
37
00:03:04,680 --> 00:03:06,000
Do it again.
38
00:03:06,000 --> 00:03:09,210
Change it to, say, 3; I can choose anything
I want, of course.
39
00:03:09,210 --> 00:03:14,930
Run it again, and I get 75.5%.
40
00:03:14,930 --> 00:03:21,930
These are the numbers I get on this slide
with 10 different random-number seeds.
41
00:03:22,040 --> 00:03:28,720
Those are the same numbers on this slide in
the right-hand column, the 10 values I got,
42
00:03:28,720 --> 00:03:32,290
73.8%, 75.0%, 75.5%, and so on.
43
00:03:32,290 --> 00:03:39,290
I can calculate the mean, which for that right-hand
column is 74.5%,
44
00:03:39,290 --> 00:03:41,290
and the sample standard deviation,
45
00:03:41,290 --> 00:03:46,630
which is 0.9%, using just the same formulas
that we used before.
46
00:03:46,630 --> 00:03:49,330
Before we use these formulas for the holdout
method,
47
00:03:49,330 --> 00:03:52,830
we repeated the holdout 10 times.
48
00:03:52,830 --> 00:03:59,340
These are the results you get on this dataset,
if you repeat holdout, that is using 90% for
49
00:03:59,340 --> 00:04:04,680
training and 10% for testing, which is, of
course, what we're doing with 10-fold cross-validation.
50
00:04:04,680 --> 00:04:11,090
I would get those results there, and if I
average those, I get a mean of 74.8%, which
51
00:04:11,090 --> 00:04:18,090
is satisfactorily close to 74.5%, but I get
a larger standard deviation, quite a lot larger
52
00:04:18,090 --> 00:04:25,090
standard deviation of 4.6%, as opposed to
0.9% with cross-validation.
53
00:04:26,000 --> 00:04:32,680
Now, you might be asking yourself why use
10-fold cross-validation.
54
00:04:32,680 --> 00:04:38,950
With Weka we can use 20-fold cross-validation
or anything, we just set the number folds
55
00:04:38,950 --> 00:04:43,950
here beside the cross-validation box to whatever
we want.
56
00:04:43,950 --> 00:04:46,450
So we can use 20-fold cross-validation.
57
00:04:46,450 --> 00:04:50,330
What that would do would be to divide the
dataset into 20 equal parts
58
00:04:50,930 --> 00:04:52,570
and repeat 20 times.
59
00:04:52,570 --> 00:04:58,370
Take one part out, train on the other 95%
of the dataset, and then do it a 21st time
60
00:04:58,370 --> 00:04:59,750
on the whole dataset.
61
00:05:01,200 --> 00:05:03,030
So, why 10, why not 20?
62
00:05:03,030 --> 00:05:04,810
Well, that's a good question really,
63
00:05:04,810 --> 00:05:08,040
and there's not a very good answer.
64
00:05:08,040 --> 00:05:14,000
We want to use quite a lot of data for training,
because, in the final analysis, we're going
65
00:05:14,000 --> 00:05:19,960
to use the entire dataset for training.
66
00:05:19,960 --> 00:05:23,720
If we're using 10-fold cross-validation, then
we're using 90% of the dataset.
67
00:05:23,720 --> 00:05:28,100
Maybe it would be a little better to use 95%
of the dataset for training
68
00:05:28,100 --> 00:05:30,500
with 20-fold cross-validation.
69
00:05:31,400 --> 00:05:33,300
On the other hand, we want to make sure that
70
00:05:33,300 --> 00:05:37,250
what we evaluate on is a valid statistical sample.
71
00:05:37,250 --> 00:05:43,870
So, in general, it's not necessarily a good
idea to use a large number of folds with cross-validation.
72
00:05:43,870 --> 00:05:50,720
Also, of course, 20-fold cross-validation
will take twice as long as 10-fold cross-validation.
73
00:05:50,720 --> 00:05:54,210
The upshot is that there isn't a really good
answer to this question, but the standard
74
00:05:54,210 --> 00:06:00,620
thing to do is to use 10-fold cross-validation,
and that's why it's Weka's default.
75
00:06:00,620 --> 00:06:05,350
We've shown in this lesson that cross-validation
really is better than repeated holdout.
76
00:06:05,350 --> 00:06:11,180
Remember, on the last slide, we found that
we got about the same mean for repeated holdout
77
00:06:11,180 --> 00:06:18,180
as for cross-validation, but we got a much
smaller variance for cross-validation.
78
00:06:18,350 --> 00:06:25,350
We know that the evaluation in this machine
learning method, J48, on this dataset, diabetes,
79
00:06:26,440 --> 00:06:33,440
we get 74.5% accuracy, probably somewhere
between 73.5% and 75.5%.
80
00:06:33,810 --> 00:06:38,580
That is actually substantially larger than
the baseline.
81
00:06:38,580 --> 00:06:43,360
So, J48 is doing something for us better than
the baseline.
82
00:06:43,900 --> 00:06:48,660
Cross-validation reduces the variance of the
estimate.
83
00:06:48,660 --> 00:06:50,240
That's the end of this class.
84
00:06:51,240 --> 00:06:54,300
Off you go and do the activity.
85
00:06:54,300 --> 00:06:56,850
I'll see you at the next class.
86
00:06:56,850 --> 00:06:58,450
Bye for now!