1
00:00:18,000 --> 00:00:24,420
Hi! Well, Class 2 has gone flying by, and
here are some things I'd like to discuss.
2
00:00:24,420 --> 00:00:28,640
First of all, we made some mistakes in the
answers to the activities.
3
00:00:28,640 --> 00:00:29,640
Sorry about that.
4
00:00:29,640 --> 00:00:32,480
We've corrected them.
5
00:00:32,480 --> 00:00:37,530
Secondly -- a general point -- some people
have been asking questions, for example, about
6
00:00:37,530 --> 00:00:38,870
huge datasets.
7
00:00:38,870 --> 00:00:44,059
How big a dataset can Weka deal with? The
answer is pretty big, actually.
8
00:00:44,059 --> 00:00:47,710
But it depends on what you do, and it's a
fairly complicated question to discuss.
9
00:00:47,710 --> 00:00:52,570
If it's not big enough, there are ways of
improving things.
10
00:00:52,570 --> 00:00:57,619
Anyway, issues like that should be discussed
on the Weka mailing list, or you should look
11
00:00:57,619 --> 00:01:03,920
in the Weka FAQ, where there's quite a lot
of discussion on this particular issue.
12
00:01:03,920 --> 00:01:07,939
The Weka API: the programming interface to
Weka.
13
00:01:07,939 --> 00:01:11,430
You can incorporate the Weka routines in your
program.
14
00:01:11,430 --> 00:01:14,220
It's wonderful stuff, but it's not covered
in this MOOC.
15
00:01:14,220 --> 00:01:19,430
So the right place to discuss those issues
is the Weka mailing list.
16
00:01:19,430 --> 00:01:22,240
Finally, personal emails to me.
17
00:01:22,240 --> 00:01:27,360
You know, there are 5,000 people on this MOOC,
and I can't cope with personal emails, so
18
00:01:27,360 --> 00:01:33,960
please send them to the mailing list and not
to me personally.
19
00:01:33,960 --> 00:01:38,460
I'd like to discuss the issues of numeric
precision in Weka.
20
00:01:38,460 --> 00:01:44,670
Weka prints percentages to 4 decimal places;
it prints most numbers to 4 decimal places.
21
00:01:44,670 --> 00:01:47,520
That's misleadingly high accuracy.
22
00:01:47,520 --> 00:01:49,350
Don't take these at face value.
23
00:01:49,350 --> 00:01:59,250
For example, here we've done an experiment
using a 40% percentage split, and we get 92.3333%
24
00:01:59,250 --> 00:02:00,659
accuracy printed out.
25
00:02:00,659 --> 00:02:07,240
Well, that's the exact right answer to the
wrong question.
26
00:02:07,240 --> 00:02:11,540
We're not interested in the performance on
this particular test set.
27
00:02:11,540 --> 00:02:18,530
What we're interested in is how Weka will
do in general on data from this source.
28
00:02:18,530 --> 00:02:25,110
We certainly can't infer that that's this
percentage to 4 decimal place accuracy.
29
00:02:25,110 --> 00:02:29,430
In Class 2, we're trying to sensitize you
to the fact that these figures aren't to be
30
00:02:29,430 --> 00:02:31,300
taken at face value.
31
00:02:31,300 --> 00:02:35,090
For example, there we are with a 40% split.
32
00:02:35,090 --> 00:02:42,090
If we do a 30% split we get 92.381%.
33
00:02:43,530 --> 00:02:46,590
The difference between these two numbers is
completely insignificant.
34
00:02:46,590 --> 00:02:50,260
You shouldn't be saying this is better than
the other number.
35
00:02:50,260 --> 00:02:56,270
They are both the same, really, within the
amount of statistical fuzz that's involved
36
00:02:56,270 --> 00:02:57,660
in the experiment.
37
00:02:57,660 --> 00:03:06,760
We're trying to train you to write your answers
to the nearest percentage point, or perhaps
38
00:03:06,760 --> 00:03:07,900
1 decimal place.
39
00:03:07,900 --> 00:03:12,489
Those are the answers that are being accepted
as correct.
40
00:03:12,489 --> 00:03:17,090
The reason we're doing that is to try to train
you to think about these numbers and what
41
00:03:17,090 --> 00:03:22,090
they really represent, rather than just copy/pasting
whatever Weka prints out.
42
00:03:22,090 --> 00:03:25,520
These numbers need to be interpreted.
43
00:03:25,520 --> 00:03:37,840
For example, in Activity 2.6 in question 2,
the 4-digit answer would be 0.7354%, and 0.7
44
00:03:37,840 --> 00:03:41,520
and 0.74 are the only accepted answers.
45
00:03:41,520 --> 00:03:51,810
In question 5, the 4-decimal place accuracy
is 1.7256%, and we would accept 1.73%, 1.7% and 2%.
46
00:03:51,819 --> 00:03:55,790
We're a bit selective in what we'll accept
here.
47
00:03:58,740 --> 00:04:02,790
I want to move on to the user classifier now.
48
00:04:04,280 --> 00:04:10,030
Some people got some confusing results, because
they created splits that involved the class
49
00:04:10,030 --> 00:04:13,330
attribute.
50
00:04:13,330 --> 00:04:16,739
When you're dealing with the test set, you
don't know the class attribute -- that's what
51
00:04:16,739 --> 00:04:18,120
you're trying to find out.
52
00:04:18,120 --> 00:04:22,750
So it doesn't make sense to create splits
in the decision tree that involve testing
53
00:04:22,750 --> 00:04:24,889
the class attribute.
54
00:04:24,889 --> 00:04:31,819
If you do that, you're going to get 0 accuracy
on test data, because the class value cannot
55
00:04:31,819 --> 00:04:37,259
be evaluated on the test data.
56
00:04:37,259 --> 00:04:40,800
That was the cause of that confusion.
57
00:04:40,800 --> 00:04:44,080
Here's the league table for the user classifier.
58
00:04:44,080 --> 00:04:47,909
J48 gets 96.2%, just as a reference point.
59
00:04:47,909 --> 00:04:51,719
Magda did really well and got very close to
that, with 93.9%.
60
00:04:51,719 --> 00:04:58,719
It took her 6.5-7 minutes, according to the
script that she mailed in.
61
00:05:01,409 --> 00:05:04,909
Myles did pretty well -- 93.5%.
62
00:05:04,909 --> 00:05:09,369
In the class, I got 78% in just a few seconds.
63
00:05:09,369 --> 00:05:14,710
I think if you get over 90% you're doing pretty
well on this dataset for the user classifier.
64
00:05:14,710 --> 00:05:21,710
The point is not to get a good result, it's
to think about the process of classification.
65
00:05:23,680 --> 00:05:30,050
Let's move to Activity 2.2, partitioning the
datasets for training and testing.
66
00:05:30,050 --> 00:05:40,080
Question 1 asked you to evaluate J48 with
percentage split, using 10% for the training
67
00:05:40,080 --> 00:05:43,650
set, 20%, 40%, 60%, and 80%.
68
00:05:43,650 --> 00:05:50,650
What you observed is that the accuracy increases
as we go through that set of numbers.
69
00:05:51,960 --> 00:05:55,169
"Performance always increases" for those numbers.
70
00:05:55,169 --> 00:05:57,939
It doesn't always increase in general.
71
00:05:57,939 --> 00:06:03,979
In general, you would expect an increasing
trend -- the more training data the better
72
00:06:03,979 --> 00:06:08,559
the performance, asymptoting off at some point.
73
00:06:08,559 --> 00:06:12,569
You would expect some fluctuation, though,
so sometimes you would expect it to go down
74
00:06:12,569 --> 00:06:13,499
and up again.
75
00:06:13,499 --> 00:06:20,119
In this particular case, performance always
increases.
76
00:06:20,119 --> 00:06:28,500
You were asked to estimate J48's true accuracy
on the segment-challenge dataset in Question 4.
77
00:06:28,509 --> 00:06:34,240
Well, "true accuracy" -- what do we mean by
"true accuracy"? I guess maybe it's not very
78
00:06:34,240 --> 00:06:40,770
well defined, but what one thinks of is if
you have a large enough training set, the
79
00:06:40,770 --> 00:06:45,300
performance of J48 is going to increase up
to some kind of point, and what would that
80
00:06:45,300 --> 00:06:55,780
point be? Actually, if you do this -- in fact,
you've done it! -- you found that between
81
00:06:55,789 --> 00:07:05,370
60% training sets and 97-98% training sets
using the percentage split option consistently
82
00:07:05,379 --> 00:07:10,619
yield correctly classified instances in the
range 94-97%.
83
00:07:10,619 --> 00:07:15,960
So 95% is probably the best fit from this
selection of possible numbers.
84
00:07:15,960 --> 00:07:22,339
It's true, by the way, that greater weight
is normally given to the training portion
85
00:07:22,339 --> 00:07:23,240
of this split.
86
00:07:23,240 --> 00:07:31,330
Usually when we use percentage split, we would
use 2/3, or maybe 3/4, or maybe 90% of the
87
00:07:31,339 --> 00:07:34,909
training data, and the smaller amount for
the test data.
88
00:07:36,600 --> 00:07:41,520
Questions 6 and 7 were confusing, and we've
changed those.
89
00:07:41,520 --> 00:07:48,890
The issue there was how a classifier's performance,
and secondly the reliability of the estimate
90
00:07:48,899 --> 00:07:53,490
of the classifier's performance, is expected
to increase as the volume of the training
91
00:07:53,490 --> 00:07:54,699
data increases.
92
00:07:56,020 --> 00:07:59,949
Or, how they change with the size of the dataset.
93
00:07:59,949 --> 00:08:05,249
The performance is expected to increase as
the volume of training data increases, and
94
00:08:05,249 --> 00:08:11,490
the reliability of the estimate is also expected
to increase as the volume of test data increases.
95
00:08:11,490 --> 00:08:14,689
With the percentage split option, there's
a trade-off between the amount of test data
96
00:08:14,689 --> 00:08:16,289
and the amount of training data.
97
00:08:16,289 --> 00:08:22,669
That's what that question is trying to get
at.
98
00:08:22,669 --> 00:08:31,030
Activity 2.3 Question 5: "How do the mean
and standard deviation estimates depend on
99
00:08:31,039 --> 00:08:40,900
the number of samples?" Well, the answer is
that roughly speaking both stay the same.
100
00:08:40,900 --> 00:08:45,460
Let me find Activity 2.3, Question 5.
101
00:08:46,340 --> 00:08:57,740
As you increase the number of samples, you
expect the estimated mean to converge to the true
102
00:08:57,740 --> 00:09:02,850
value of the mean, and the estimated standard
deviation to converge to the true standard
103
00:09:02,850 --> 00:09:04,150
deviation.
104
00:09:04,150 --> 00:09:09,050
So, they would both stay about the same.
105
00:09:09,050 --> 00:09:14,160
This is, in fact, now marked as correct.
106
00:09:14,160 --> 00:09:24,080
Actually, because of the "n - 1" in the denominator
of the formula for variance, it's true that
107
00:09:24,080 --> 00:09:29,820
the standard deviation decreases a tiny bit,
but it's a very small effect.
108
00:09:29,820 --> 00:09:34,770
So we've also accepted that answer as correct.
109
00:09:34,770 --> 00:09:38,630
That's how the mean and standard deviation
estimates depend on the number of samples.
110
00:09:38,630 --> 00:09:45,630
Perhaps a more important question is how the
reliability of the mean would change.
111
00:09:46,340 --> 00:09:51,710
What decreases is the standard error of the
estimate of the mean, which is the standard
112
00:09:51,710 --> 00:09:57,740
deviation of the theoretical distribution
of the large population of such estimates.
113
00:09:57,740 --> 00:10:04,740
The estimate of the mean is a better, more
reliable estimate with a larger training set size.
114
00:10:10,160 --> 00:10:17,610
"The supermarket dataset is weird." Yes, it
is weird: it's intended to be weird.
115
00:10:17,610 --> 00:10:25,960
Actually, in the supermarket dataset, each
instance represents a supermarket trolley,
116
00:10:25,960 --> 00:10:30,450
and, instead of putting a 0 for every item
you don't buy -- of course, when we go to
117
00:10:30,450 --> 00:10:36,660
the supermarket, we don't buy most of the
items in the supermarket -- the ARFF file
118
00:10:36,660 --> 00:10:39,800
codes that as a question mark, which stands
for "missing value".
119
00:10:39,800 --> 00:10:43,380
We're going to discuss missing values in Class 5.
120
00:10:44,320 --> 00:10:49,990
This dataset is suitable for association rule
learning, which we're not doing in this course.
121
00:10:49,990 --> 00:10:54,570
The message I'm trying to emphasize here is
that you need to understand what you're doing,
122
00:10:54,570 --> 00:10:57,220
not just process datasets blindly.
123
00:10:57,220 --> 00:10:59,250
Yes, it is weird.
124
00:11:00,520 --> 00:11:06,990
There's been some discussion on the mailing
list about cross-validation and the extra model.
125
00:11:06,990 --> 00:11:10,500
When you do cross-validation, you're trying
to do two things.
126
00:11:10,500 --> 00:11:19,540
You're trying to get an estimate of the expected
accuracy of a classifier, and you're trying
127
00:11:19,540 --> 00:11:21,930
to actually produce a really good classifier.
128
00:11:21,930 --> 00:11:27,090
To produce a really good classifier to use
in the future, you want to use the entire
129
00:11:27,090 --> 00:11:30,880
training set to train up the classifier.
130
00:11:30,880 --> 00:11:35,070
To get an estimate of its accuracy, however,
you can't do that unless you have an independent
131
00:11:35,070 --> 00:11:36,680
test set.
132
00:11:36,680 --> 00:11:44,190
So cross-validation takes 90% for training
and 10% for testing, repeats that 10 times,
133
00:11:44,190 --> 00:11:46,700
and averages the results to get an estimate.
134
00:11:46,700 --> 00:11:53,440
Once you've got the estimate, if you want
an actual classifier to use, the best classifier
135
00:11:53,440 --> 00:11:56,960
is one built on the full training set.
136
00:11:56,960 --> 00:12:00,760
The same is true with a percentage split option.
137
00:12:00,760 --> 00:12:05,190
Weka will evaluate the percentage split, but
then it will print the classifier that it
138
00:12:05,190 --> 00:12:10,600
produces from the entire training set to give
you a classifier to use on your problem in
139
00:12:10,600 --> 00:12:11,410
the future.
140
00:12:12,920 --> 00:12:16,310
There's been a little bit of discussion on
advanced stuff.
141
00:12:16,310 --> 00:12:19,570
I think maybe a follow-up course might be
a good idea here.
142
00:12:19,570 --> 00:12:24,430
Someone noticed that if you apply a filter
to the training set, you need to apply exactly
143
00:12:24,430 --> 00:12:28,690
the same filter to the test set, which is
sometimes a bit difficult to do, particularly
144
00:12:28,690 --> 00:12:33,220
if the training and test sets are produced
by cross-validation.
145
00:12:33,220 --> 00:12:40,010
There's an advanced classifier called the
"FilteredClassifier" which addresses that problem.
146
00:12:40,010 --> 00:12:45,160
In his response to a question on the supermarket
dataset, Peter mentioned "unbalanced" datasets,
147
00:12:45,160 --> 00:12:47,110
and the cost of different kinds of error.
148
00:12:47,110 --> 00:12:51,900
This is something that Weka can take into
account with a cost sensitive evaluation,
149
00:12:51,900 --> 00:12:58,090
and there is a classifier called the CostSensitiveClassifier
that allows you to do that.
150
00:12:58,090 --> 00:13:03,490
Finally, someone just asked a question on
attribute selection: how do you select a good
151
00:13:03,490 --> 00:13:09,050
subset of attributes? Excellent question!
There's a whole attribute Selection panel,
152
00:13:09,050 --> 00:13:11,490
which we're not able to talk about in this
MOOC.
153
00:13:11,490 --> 00:13:15,100
This is just an introductory MOOC on Weka.
154
00:13:15,100 --> 00:13:20,680
Maybe we'll come up with an advanced, followup
MOOC where we're able to discuss some of these
155
00:13:20,680 --> 00:13:22,030
more advanced issues.
156
00:13:23,340 --> 00:13:24,400
That's it.
157
00:13:24,400 --> 00:13:29,940
I just want to finish with a picture that
someone sent in of two wekas in an enclosure.
158
00:13:30,610 --> 00:13:36,350
It's rare to see wekas in the wild -- I've seen them a couple of times myself, but not very often.
159
00:13:36,350 --> 00:13:43,270
More likely, to see a weka you need to go
to a place where they keep captured wekas
160
00:13:43,270 --> 00:13:45,170
for you to look at.
161
00:13:45,170 --> 00:13:48,450
Here are two wekas that Leah from Vancouver
sent in.
162
00:13:50,000 --> 00:13:50,980
That's it.
163
00:13:50,980 --> 00:13:55,240
Now Class 3 is up now, and off you go with
Class 3.
164
00:13:55,240 --> 00:13:56,960
Good luck! We'll talk to you later.
165
00:13:56,960 --> 00:13:58,100
Bye for now!