1
00:00:16,460 --> 00:00:20,980
Hi! Welcome back for another few minutes in
New Zealand.
2
00:00:20,980 --> 00:00:27,410
In the last lesson, Lesson 5.1, we learned
that Weka only helps you with a small part
3
00:00:27,410 --> 00:00:33,370
of the overall data mining process, the technical
part, which is perhaps the easy part.
4
00:00:33,370 --> 00:00:38,690
In this lesson, we're going to learn that
there are many pitfalls and pratfalls even
5
00:00:38,690 --> 00:00:40,470
in that part.
6
00:00:41,860 --> 00:00:43,149
Let me just define these for you.
7
00:00:43,149 --> 00:00:48,840
A "pitfall" is a hidden or unsuspected danger
or difficulty, and there are plenty of those
8
00:00:48,840 --> 00:00:51,059
in the field of machine learning.
9
00:00:51,059 --> 00:00:57,690
A "pratfall" is a stupid and humiliating action,
which is very easy to do when you're working
10
00:00:57,690 --> 00:01:01,870
with data.
11
00:01:01,870 --> 00:01:04,710
The first lesson is that you should be skeptical.
12
00:01:04,710 --> 00:01:08,860
In data mining it's very easy to cheat.
13
00:01:08,860 --> 00:01:14,659
Whether you're cheating consciously or unconsciously,
it's easy to mislead yourself or mislead others
14
00:01:14,659 --> 00:01:18,440
about the significance of your results.
15
00:01:18,440 --> 00:01:25,440
For a reliable test, you should use a completely
fresh sample of data that has never been seen before.
16
00:01:25,440 --> 00:01:29,390
You should save something for the very end,
that you don't use until you've selected your
17
00:01:29,390 --> 00:01:33,579
algorithm, decided how you're going to apply
it, and the filters, and so on.
18
00:01:33,579 --> 00:01:39,659
At the very, very end, having done all that,
run it on some fresh data to get an estimate
19
00:01:39,659 --> 00:01:41,570
of how it will perform.
20
00:01:41,570 --> 00:01:47,500
Don't be tempted to then change it to improve
it so that you get better results on that data.
21
00:01:47,500 --> 00:01:51,659
Always do your final run on fresh data.
22
00:01:51,659 --> 00:01:56,189
We've talked a lot about overfitting, and
this is basically the same kind of problem.
23
00:01:56,189 --> 00:02:00,820
Of course, you know not to test on the training
set.
24
00:02:00,820 --> 00:02:05,030
We've talked about that endlessly throughout
this course.
25
00:02:05,030 --> 00:02:09,370
Data that's been used for development in any
way is tainted.
26
00:02:09,370 --> 00:02:14,650
Any time you use some data to help you make
a choice of the filter, or the classifier,
27
00:02:14,650 --> 00:02:20,250
or how you're going to treat your problem,
then that data is tainted.
28
00:02:20,250 --> 00:02:24,470
You should be using completely fresh data
to get evaluation results.
29
00:02:24,470 --> 00:02:29,400
Leave some evaluation data aside for the very
end of the process.
30
00:02:29,400 --> 00:02:34,239
That's the first piece of advice.
31
00:02:34,239 --> 00:02:38,280
Another thing I haven't told you about in
this course so far is missing values.
32
00:02:38,280 --> 00:02:45,280
In real datasets, it's very common that some
of the data values are missing.
33
00:02:45,370 --> 00:02:46,579
They haven't been recorded.
34
00:02:48,220 --> 00:02:53,579
They might be unknown; we might have forgotten
to record them; they might be irrelevant.
35
00:02:55,810 --> 00:03:00,310
There are two basic strategies for dealing
with missing values in a dataset.
36
00:03:00,310 --> 00:03:05,970
You can omit instances where the attribute
value is missing, or somehow find a way of
37
00:03:05,970 --> 00:03:08,780
omitting that particular attribute in that
instance.
38
00:03:08,780 --> 00:03:13,260
Or you can treat missing as a separate possible
value.
39
00:03:15,060 --> 00:03:20,790
You need to ask yourself, is there significance
in the fact that a value is missing? They
40
00:03:20,799 --> 00:03:24,419
say that if you've got something wrong with
you and go to the doctor, and he does some
41
00:03:24,419 --> 00:03:30,370
tests on you: if you just record the tests
that he does -- not the results of the test,
42
00:03:30,370 --> 00:03:34,669
but just the ones he chooses to do -- there's
a very good chance that you can work out what's
43
00:03:34,669 --> 00:03:39,919
wrong with you just from the existence of
the tests, not from their results.
44
00:03:39,919 --> 00:03:43,180
That's because the doctor chooses tests intelligently.
45
00:03:43,180 --> 00:03:48,680
The fact that he doesn't choose a test doesn't
mean that that value is missing, or accidentally
46
00:03:48,680 --> 00:03:49,660
not there.
47
00:03:49,660 --> 00:03:54,139
There's huge significance in the fact that
he's chosen not to do certain tests.
48
00:03:54,139 --> 00:03:59,019
This is a situation where "missing" should
be treated as a separate possible value.
49
00:03:59,019 --> 00:04:03,709
There's significance in the fact that a value
is missing.
50
00:04:03,709 --> 00:04:06,959
But in other situations, a value might be
missing simply because a piece of equipment
51
00:04:06,959 --> 00:04:11,180
malfunctioned, or for some other reason -- maybe
someone forgot something.
52
00:04:11,180 --> 00:04:16,799
Then there's no significance in the fact that
it's missing.
53
00:04:16,799 --> 00:04:20,850
Pretty well all machine learning algorithms
deal with missing values.
54
00:04:20,850 --> 00:04:25,889
In an ARFF file, if you put a question mark
as a data value, that's treated as a missing
55
00:04:25,889 --> 00:04:27,600
value.
56
00:04:27,600 --> 00:04:30,530
All methods in Weka can deal with missing
values.
57
00:04:30,530 --> 00:04:33,759
But they make different assumptions about
them.
58
00:04:33,759 --> 00:04:39,460
If you don't appreciate this, it's easy to
get misled.
59
00:04:39,460 --> 00:04:45,550
Let me just take two simple and well known
(to us) examples -- OneR and J48.
60
00:04:45,550 --> 00:04:47,460
They deal with missing values in different
ways.
61
00:04:47,460 --> 00:05:00,740
I'm going to load the nominal weather data
and run OneR on it: I get 43%.
62
00:05:00,740 --> 00:05:10,600
Let me run J48 on it, to get 50%.
63
00:05:10,600 --> 00:05:11,750
I'm going to
64
00:05:11,750 --> 00:05:21,940
edit this dataset by changing the value of
"outlook" for the first four "no" instances
65
00:05:21,940 --> 00:05:24,040
to "missing".
66
00:05:24,040 --> 00:05:26,580
That's how we do it here in this editor.
67
00:05:26,580 --> 00:05:32,060
If we were to write this file out in ARFF
format, we'd find that these values are written
68
00:05:32,060 --> 00:05:36,600
into the file as question marks.
69
00:05:37,380 --> 00:05:42,870
Now, if we look at "outlook", you can see
that it says here there are 4 missing values.
70
00:05:42,870 --> 00:05:49,870
If you count up these labels -- 2, 4, and
4 -- that's 10 labels.
71
00:05:50,350 --> 00:05:54,370
Plus another 4 that are missing, to make the
14 instances.
72
00:05:54,370 --> 00:06:00,120
Let's go back to J48 and run it again.
73
00:06:00,120 --> 00:06:02,400
We still get 50%, the same result.
74
00:06:03,400 --> 00:06:09,620
Of course, this is a tiny dataset, but the
fact is that the results here are not affected
75
00:06:09,620 --> 00:06:12,530
by the fact that a few of the values are missing.
76
00:06:12,530 --> 00:06:22,280
However, if we run OneR, I get a much higher
accuracy, a 93% accuracy.
77
00:06:26,370 --> 00:06:31,660
The rule that I've got is "branch on outlook",
which is what we had before I think.
78
00:06:31,660 --> 00:06:36,590
Here it says there are 4 possibilities: if
it's sunny, it's a yes; if it's overcast it's
79
00:06:36,590 --> 00:06:41,130
a yes; if it's rainy, it's a yes; and if it's
missing, it's a no.
80
00:06:41,130 --> 00:06:45,870
Here, OneR is using the fact that a value
is missing as significant, as something you
81
00:06:45,870 --> 00:06:46,970
can branch on.
82
00:06:46,970 --> 00:06:53,010
Whereas if you were to look at a J48 tree,
it would never have a branch that corresponded
83
00:06:53,010 --> 00:06:54,280
to a missing value.
84
00:06:54,280 --> 00:06:56,160
It treats them differently.
85
00:06:56,160 --> 00:07:00,910
It is very important to know and remember.
86
00:07:00,910 --> 00:07:07,910
The final thing I want to tell you about in
this lesson is the "no free lunch" theorem.
87
00:07:08,290 --> 00:07:11,930
There's no free lunch in data mining.
88
00:07:11,930 --> 00:07:13,440
Here's a way to illustrate it.
89
00:07:13,440 --> 00:07:17,430
Suppose you've got a 2-class problem with
100 binary attributes.
90
00:07:17,430 --> 00:07:22,260
Let's say you've got a huge training set with
a million instances and their classifications
91
00:07:22,260 --> 00:07:25,690
in the training set.
92
00:07:25,690 --> 00:07:31,910
The number of possible instances is 2 to the
100 (2^100), because there are 100 binary
93
00:07:31,910 --> 00:07:33,120
attributes.
94
00:07:33,120 --> 00:07:34,980
And you know 10^6 of them.
95
00:07:34,980 --> 00:07:40,160
So you don't know the classes of 2^100 - 10^6
examples.
96
00:07:40,160 --> 00:07:47,780
Let me tell you that 2^100 - 10^6 is 99.999...%
of 2^100.
97
00:07:47,780 --> 00:07:52,220
There's this huge number of examples that
you just don't know the classes of.
98
00:07:52,220 --> 00:07:56,780
How could you possibly figure them out? If
you apply a data mining scheme to this, it
99
00:07:56,780 --> 00:08:02,130
will figure them out, but how could you possibly
figure out all of those things just from the
100
00:08:02,130 --> 00:08:06,750
tiny amount of data that you've been given.
101
00:08:06,750 --> 00:08:11,220
In order to generalize, every learner must
embody some knowledge or assumptions beyond
102
00:08:11,220 --> 00:08:14,440
the data it's given.
103
00:08:14,440 --> 00:08:18,680
Each learning algorithm implicitly provides
a set of assumptions.
104
00:08:18,680 --> 00:08:23,400
The best way to think about those assumptions
is to think back to the Boundary Visualizer
105
00:08:23,400 --> 00:08:26,320
we looked at in Lesson 4.1.
106
00:08:26,320 --> 00:08:30,150
You saw that different machine learning schemes
are capable of drawing different kinds of
107
00:08:30,150 --> 00:08:33,230
boundaries in instance space.
108
00:08:33,230 --> 00:08:39,530
These boundaries correspond to a set of assumptions
about the sort of decisions we can make.
109
00:08:39,530 --> 00:08:44,350
There's no universal best algorithm; there's
no free lunch.
110
00:08:44,350 --> 00:08:46,900
There's no single best algorithm.
111
00:08:46,900 --> 00:08:52,080
Data mining is an experimental science, and
that's why we've been teaching you how to
112
00:08:52,080 --> 00:08:55,010
experiment with data mining yourself.
113
00:08:56,240 --> 00:08:57,920
This is just a summary.
114
00:08:57,920 --> 00:09:02,250
Be skeptical: when people tell you about data
mining results and they say that it gets this
115
00:09:02,250 --> 00:09:07,450
kind of accuracy, then to be sure about that
you want to have them test their classifier
116
00:09:07,450 --> 00:09:12,570
on your new, fresh data that they've never
seen before.
117
00:09:12,570 --> 00:09:15,480
Overfitting has many faces.
118
00:09:15,480 --> 00:09:19,640
Different learning schemes make different
assumptions about missing values, which can
119
00:09:19,640 --> 00:09:21,400
really change the results.
120
00:09:21,400 --> 00:09:26,950
There is no universal best learning algorithm.
121
00:09:26,950 --> 00:09:32,240
Data mining is an experimental science, and
it's very easy to be misled by people quoting
122
00:09:32,240 --> 00:09:37,160
the results of data mining experiments.
123
00:09:37,160 --> 00:09:37,890
That's it for now.
124
00:09:37,890 --> 00:09:40,540
Off you go and do the activity.
125
00:09:40,540 --> 00:09:42,080
We'll see you in the next lesson.
126
00:09:42,080 --> 00:09:43,670
Bye for now!