1
00:00:16,320 --> 00:00:21,939
Hello again! In the last lesson, we looked
at training and testing.
2
00:00:21,939 --> 00:00:29,820
We saw that we can evaluate a classifier on
an independent test set, or using a percentage split,
3
00:00:29,820 --> 00:00:35,730
with a certain percentage of the dataset
used to train and the rest used for testing,
4
00:00:35,730 --> 00:00:41,230
or -- and this is generally a very bad idea -- we
can evaluate it on the training set itself,
5
00:00:41,230 --> 00:00:45,550
which gives misleadingly optimistic performance
figures.
6
00:00:45,550 --> 00:00:51,820
In this lesson, we're going to look a little
bit more at training and testing.
7
00:00:51,820 --> 00:01:03,640
In fact, what we're going to do is repeatedly
train and test using percentage split.
8
00:01:03,640 --> 00:01:08,420
Now, in the last lesson, we saw that if you
simply repeat the training and testing, you
9
00:01:08,420 --> 00:01:13,610
get the same result each time because Weka
initializes the random number generator before
10
00:01:13,610 --> 00:01:18,500
it does each run to make sure that you know
what's going on when you do the same experiment
11
00:01:18,500 --> 00:01:19,220
again tomorrow.
12
00:01:19,220 --> 00:01:22,090
But, there is a way of overriding that.
13
00:01:22,090 --> 00:01:28,820
So, we will be using independent random numbers
on different occasions to produce a percentage
14
00:01:28,820 --> 00:01:34,210
split of the dataset into a training and test
set.
15
00:01:34,210 --> 00:01:37,780
I'm going to open the segment-challenge data
again.
16
00:01:37,780 --> 00:01:40,130
That's what we used before.
17
00:01:40,130 --> 00:01:44,700
Notice there are 1500 instances here;
18
00:01:44,700 --> 00:01:45,729
that's quite a lot.
19
00:01:45,729 --> 00:01:48,649
I'm going to go to Classify.
20
00:01:48,649 --> 00:01:54,549
I'm going to choose J48, our standard method,
I guess.
21
00:01:54,549 --> 00:02:00,710
I'm going to use a percentage split, and because
we've got 1500 instances, I'm going to choose
22
00:02:00,710 --> 00:02:05,329
90% for training and just 10% for testing.
23
00:02:05,329 --> 00:02:12,070
I reckon that 10% -- that's 150 instances -- for
testing is going to give us a reasonable estimate,
24
00:02:12,070 --> 00:02:16,720
and we might as well train on as many as we
can to get the most accurate classifier.
25
00:02:16,720 --> 00:02:25,520
I'm going to run this, and the accuracy figure
I get -- this is what I got in the last lesson --
26
00:02:25,520 --> 00:02:27,740
is 96.6667%.
27
00:02:29,340 --> 00:02:34,949
Now, this is misleadingly high accuracy here.
28
00:02:34,949 --> 00:02:41,000
I'm going to call that 96.7%, or 0.967.
29
00:02:41,000 --> 00:02:45,560
And then, I'm going to do it again and just
see how much variation we get of that figure
30
00:02:45,560 --> 00:02:49,500
initializing the random number generator
to different amounts each time.
31
00:02:50,460 --> 00:02:57,460
If I go to the More options menu, I get a
number of options here which are quite useful:
32
00:02:57,770 --> 00:03:00,150
outputting the model, we're doing that;
33
00:03:00,150 --> 00:03:01,680
outputting statistics;
34
00:03:01,680 --> 00:03:03,890
we can output different evaluation measures;
35
00:03:03,890 --> 00:03:05,770
we're doing the confusion matrix;
36
00:03:05,770 --> 00:03:08,060
we're storing the prediction for visualization;
37
00:03:08,060 --> 00:03:10,860
we can output the predictions if we want;
38
00:03:10,860 --> 00:03:14,370
we can do a cost-sensitive evaluation;
39
00:03:14,370 --> 00:03:20,980
and we can set the random seed for cross-validation
or percentage split.
40
00:03:20,980 --> 00:03:22,300
That's set by default to 1.
41
00:03:22,300 --> 00:03:26,170
I'm going to change that to 2, a different
random seed.
42
00:03:26,170 --> 00:03:31,490
We could also output the source code for the
classifier if we wanted, but I just want to
43
00:03:31,490 --> 00:03:32,950
change the random seed.
44
00:03:32,950 --> 00:03:35,450
Then I want to run it again.
45
00:03:35,450 --> 00:03:42,450
Before we got 0.967, and this time we get 0.94,
94%.
46
00:03:43,180 --> 00:03:45,310
Quite different, you see.
47
00:03:45,310 --> 00:03:52,090
If I were then to change this again to, say,
3, and run it again.
48
00:03:52,090 --> 00:03:53,900
Again I get 94%.
49
00:03:53,900 --> 00:04:03,830
If I change it again to 4 and run it again,
I get 96.7%.
50
00:04:03,830 --> 00:04:05,200
Let's do one more.
51
00:04:05,200 --> 00:04:12,200
Change it to 5, run it again, and now I get
95.3%.
52
00:04:14,330 --> 00:04:15,710
Here's a table with these figures in.
53
00:04:15,710 --> 00:04:21,480
If we run it 10 times, we get this set of
results.
54
00:04:21,480 --> 00:04:26,330
Given this set of experimental results, we
can calculate the mean and standard deviation.
55
00:04:26,330 --> 00:04:33,770
The sample mean is the sum of all of these
error figures -- or these success rates, I should say --
56
00:04:33,770 --> 00:04:37,200
divided by the number, 10 of
them.
57
00:04:37,200 --> 00:04:41,760
That's 0.949, about 95%.
58
00:04:41,760 --> 00:04:43,290
That's really what we would expect to get.
59
00:04:43,290 --> 00:04:46,910
That's a better estimate than the 96.7% that
we started out with.
60
00:04:46,910 --> 00:04:49,460
A more reliable estimate.
61
00:04:49,460 --> 00:04:51,420
We can calculate the sample variance.
62
00:04:51,420 --> 00:04:57,200
We take the deviation from the mean, we subtract
the mean from each of these numbers, we square that,
63
00:04:57,200 --> 00:05:02,560
add them up, and we divide, not by n,
but by n - 1.
64
00:05:02,560 --> 00:05:04,730
That might surprise you, perhaps.
65
00:05:04,730 --> 00:05:11,730
The reason for it being n - 1 is because we've
actually calculated the mean from this sample.
66
00:05:12,650 --> 00:05:19,060
When the mean is calculated from the sample,
you need to divide by n - 1, leading to a slightly larger
67
00:05:19,060 --> 00:05:22,090
variance estimate than if you were to divide
by n.
68
00:05:22,090 --> 00:05:32,740
We take the square root of that, and in this
case, we get a standard deviation of 1.8%.
69
00:05:32,740 --> 00:05:39,190
Now you can see that the real performance
of J48 on the segment-challenge dataset is
70
00:05:39,190 --> 00:05:44,460
approximately 95% accuracy, plus or minus
approximately 2%.
71
00:05:44,460 --> 00:05:50,550
Anywhere, let's say, between 93-97% accuracy.
72
00:05:50,550 --> 00:05:55,470
These figures that you get, that Weka puts
out for you, are misleading.
73
00:05:55,470 --> 00:06:04,720
You need to be careful how you interpret them,
because the result is certainly not 95.333%.
74
00:06:04,720 --> 00:06:08,550
There's a lot of variation on a lot of these
figures.
75
00:06:09,900 --> 00:06:13,870
Remember, the basic assumption is the training
and test sets are sampled independently from
76
00:06:13,870 --> 00:06:18,940
an infinite population, and you should expect
a slight variation in results -- perhaps more
77
00:06:18,940 --> 00:06:21,660
than just a slight variation in results.
78
00:06:21,660 --> 00:06:27,680
You can estimate the variation in results
by setting the random-number seed and repeating
79
00:06:27,680 --> 00:06:29,520
the experiment.
80
00:06:29,520 --> 00:06:33,520
You can calculate the mean and the standard
deviation experimentally, which is what we
81
00:06:33,520 --> 00:06:34,240
just did.
82
00:06:35,270 --> 00:06:38,740
Off you go now, and do the activity associated
with this lesson.
83
00:06:39,140 --> 00:06:40,240
I'll see you in the next lesson.
84
00:06:40,540 --> 00:06:42,090
Bye!