1
00:00:16,770 --> 00:00:19,360
Hi! I'm sitting here in New Zealand.
2
00:00:19,360 --> 00:00:21,440
It's on the globe behind me.
3
00:00:21,440 --> 00:00:24,700
That's New Zealand, at the top of the world,
surrounded by water.
4
00:00:24,700 --> 00:00:27,540
But that's not where I'm from originally.
5
00:00:27,540 --> 00:00:30,790
I moved here about 20 years ago.
6
00:00:30,790 --> 00:00:35,790
Here on this map, of course, this is New Zealand
-- Google puts things with the north at the
7
00:00:35,790 --> 00:00:37,480
top, which is probably what you're used to.
8
00:00:37,480 --> 00:00:42,430
I came here from the University of Calgary
in Canada, where I was for many years.
9
00:00:42,430 --> 00:00:45,540
I used to be head of computer science for
the University of Calgary.
10
00:00:45,540 --> 00:00:50,860
But, originally, I'm from Belfast, Northern
Ireland, which is here in the United Kingdom.
11
00:00:50,860 --> 00:00:55,410
So, my accent actually is Northern Irish,
not New Zealand.
12
00:00:55,410 --> 00:00:59,010
This is not a New Zealand accent.
13
00:00:59,010 --> 00:01:03,440
We're going to talk here in the last lesson
of Class 3 about another machine learning
14
00:01:03,440 --> 00:01:08,330
method called the nearest neighbor, or instance-based,
machine learning method.
15
00:01:08,330 --> 00:01:15,330
When people talk about rote learning, they
just talk about remember stuff without really
16
00:01:15,360 --> 00:01:17,550
thinking about it.
17
00:01:17,550 --> 00:01:20,110
It's the simplest kind of learning.
18
00:01:20,110 --> 00:01:23,030
Nearest neighbor implements rote learning.
19
00:01:23,030 --> 00:01:28,750
It just remembers the training instances,
and then, to classify a new instance, it searches
20
00:01:28,750 --> 00:01:33,520
the training set for one that is most like
the new instance.
21
00:01:33,520 --> 00:01:36,530
The representation of the knowledge here is
just the set of instances.
22
00:01:36,530 --> 00:01:38,680
It's a kind of lazy learning.
23
00:01:38,680 --> 00:01:42,610
The learner does nothing until it has to do
some predictions.
24
00:01:42,610 --> 00:01:46,970
Confusingly, it's also called instance-based
learning.
25
00:01:46,970 --> 00:01:52,100
Nearest neighbor learning and instance-based
learning are the same thing.
26
00:01:52,100 --> 00:01:56,390
Here is just a little picture of 2-dimensional
instance space.
27
00:01:56,390 --> 00:02:02,330
The blue points and the white points are two
different classes -- yes and no, for example.
28
00:02:02,330 --> 00:02:06,330
Then we've got an unknown instance, the red
one.
29
00:02:06,330 --> 00:02:07,800
We want to know which class it's in.
30
00:02:07,800 --> 00:02:12,820
So, we simply find the closest instance in
each of the classes and see which is closest.
31
00:02:12,820 --> 00:02:15,050
In this case, it's the blue class.
32
00:02:15,050 --> 00:02:19,280
So, we would classify that red point as though
it belonged to the blue class.
33
00:02:19,280 --> 00:02:25,450
If you think about this, that's implicitly
drawing a line between the two clouds of points.
34
00:02:25,450 --> 00:02:32,410
It's a straight line here, the perpendicular
bisector of the line that joins the two closest points.
35
00:02:32,410 --> 00:02:36,570
The nearest neighbor method produces a linear
decision boundary.
36
00:02:36,570 --> 00:02:39,470
Actually, it's a little bit more complicated
than that.
37
00:02:39,470 --> 00:02:46,700
It produces a piece-wise linear decision boundary
with sometimes a bunch of little linear pieces
38
00:02:46,700 --> 00:02:48,690
of the decision boundary.
39
00:02:50,840 --> 00:02:54,860
Of course, the trick is what do we mean by
"most like".
40
00:02:54,860 --> 00:03:00,480
We need a similarity function, and conventionally,
people use the regular distance function,
41
00:03:00,480 --> 00:03:07,920
the Euclidean distance, which is the sum of
the squares of the differences between the attributes.
42
00:03:07,920 --> 00:03:11,480
Actually, it's the square root of the sum
of the squares, but since we're just comparing
43
00:03:11,480 --> 00:03:14,220
two instances, we don't need to take the square root.
44
00:03:15,130 --> 00:03:20,510
Or, you might use the Manhattan or city block
distance, which is the sum of the absolute
45
00:03:20,510 --> 00:03:22,290
differences between the attribute values.
46
00:03:23,410 --> 00:03:26,830
Of course, I've been talking about numeric
attributes here.
47
00:03:26,830 --> 00:03:33,290
If attributes are nominal, we need the difference
between different attribute values.
48
00:03:34,050 --> 00:03:38,210
Conventionally, people just say the distance
is 1 if the attribute values are different
49
00:03:38,210 --> 00:03:39,960
and 0 if they are the same.
50
00:03:39,960 --> 00:03:44,820
It might be a good idea with nearest neighbor
learning to normalize the attributes so that
51
00:03:44,820 --> 00:03:49,780
they all lie between 0 and 1, so the distance
isn't skewed by some attribute that happens
52
00:03:49,780 --> 00:03:54,270
to be on some gigantic scale.
53
00:03:54,270 --> 00:03:57,070
What about noisy instances.
54
00:03:57,070 --> 00:04:04,000
If we have a noisy dataset, then by accident
we might find an incorrectly classified training
55
00:04:04,000 --> 00:04:06,850
instance as the nearest one to our test instance.
56
00:04:06,850 --> 00:04:12,360
You can guard against that by using the k-nearest-neighbors.
57
00:04:12,360 --> 00:04:17,280
k might be 3 or 5, and you look for the 3
or the 5 nearest neighbors and choose the
58
00:04:17,280 --> 00:04:22,030
majority class amongst those when classifying
an unknown point.
59
00:04:22,030 --> 00:04:24,540
That's the k-nearest-neighbor method.
60
00:04:24,540 --> 00:04:31,650
In Weka, it's called IBk (instance-based learning
with parameter k), and it's in the lazy class.
61
00:04:31,650 --> 00:04:38,000
Let's open the glass dataset.
62
00:04:40,230 --> 00:04:46,490
Go to Classify and choose the lazy classifier
IBk.
63
00:04:48,430 --> 00:04:49,960
Let's just run it.
64
00:04:49,960 --> 00:04:56,960
We get an accuracy of 70.6%.
65
00:04:57,150 --> 00:04:59,650
The model is not really printed here, because
there is no model.
66
00:04:59,650 --> 00:05:02,520
It's just the set of training instances.
67
00:05:03,440 --> 00:05:05,960
We're using 10-fold cross-validation, of course.
68
00:05:05,960 --> 00:05:12,960
Let's change the value of k, this kNN is the
k value.
69
00:05:15,070 --> 00:05:16,470
It's set by default to 1.
70
00:05:16,470 --> 00:05:23,470
(The number of neighbors to use.) We'll change
that to, say, 5 and run that.
71
00:05:24,840 --> 00:05:31,380
In this case, we get a slightly worse result,
67.8% with k as 5.
72
00:05:32,120 --> 00:05:35,010
This is not such a noisy dataset, I guess.
73
00:05:35,010 --> 00:05:41,810
If we change it to 20 and run it again.
74
00:05:41,810 --> 00:05:44,840
We get 65% accuracy, slightly worse again.
75
00:05:45,440 --> 00:05:52,280
If we had a noisy dataset, we might find that
the accuracy figures improved as k got little
76
00:05:52,280 --> 00:05:53,490
bit larger.
77
00:05:53,490 --> 00:05:55,960
Then, it would always start to decrease again.
78
00:05:55,960 --> 00:06:00,940
If we set k to be an extreme value, close
to the size of the whole dataset, then we're
79
00:06:00,940 --> 00:06:06,550
taking the distance of the test instance
to all of the points in the dataset and averaging
80
00:06:06,550 --> 00:06:10,090
those, which will probably give us something
close to the baseline accuracy.
81
00:06:10,090 --> 00:06:15,440
Here, if I set k to be a ridiculous value
like 100.
82
00:06:15,440 --> 00:06:22,150
I'm going to take the 100 nearest instances
and average their classes.
83
00:06:22,150 --> 00:06:29,070
We get an accuracy of 35%, which, I think
is pretty close to the baseline accuracy for
84
00:06:29,070 --> 00:06:30,090
this dataset.
85
00:06:30,090 --> 00:06:36,710
Let me just find that out with ZeroR, the
baseline accuracy is indeed 35%.
86
00:06:40,280 --> 00:06:42,000
Nearest neighbor is a really good method.
87
00:06:42,000 --> 00:06:44,090
It's often very accurate.
88
00:06:44,090 --> 00:06:45,430
It can be slow.
89
00:06:45,430 --> 00:06:52,050
A simple implementation would involve scanning
the entire training dataset to make each prediction,
90
00:06:52,050 --> 00:06:56,690
because we've got to calculate the distance
of the unknown test instance from all of the
91
00:06:56,690 --> 00:06:59,320
training instances to see which is closest.
92
00:06:59,320 --> 00:07:03,190
There are more sophisticated data structures
that can make this faster, so you don't need
93
00:07:03,190 --> 00:07:06,300
to scan the whole dataset every time.
94
00:07:06,300 --> 00:07:09,900
It assumes all attributes are equally important.
95
00:07:09,900 --> 00:07:14,810
If that wasn't the case, you might want to
look at schemes for selecting or weighting
96
00:07:14,810 --> 00:07:18,240
attributes depending on their importance.
97
00:07:18,240 --> 00:07:23,220
If we've got noisy instances, than we can
use a majority vote over the k nearest neighbors,
98
00:07:23,220 --> 00:07:27,370
or we might weight instances according to
their prediction accuracy.
99
00:07:27,370 --> 00:07:32,850
Or, we might try to identify reliable prototypes,
one for each of the classes.
100
00:07:32,850 --> 00:07:34,270
This is a very old method.
101
00:07:34,270 --> 00:07:37,650
Statisticians have used k-nearest-neighbor
since the 1950's.
102
00:07:37,650 --> 00:07:41,210
There's an interesting theoretical result.
103
00:07:41,210 --> 00:07:49,090
If the number (n) of training instances approaches
infinity, and k also gets larger in such a
104
00:07:49,090 --> 00:07:58,550
way that k/n approaches 0, but k also approaches
infinity, the error of the k-nearest-neighbor
105
00:07:58,550 --> 00:08:03,190
method approaches the theoretical minimum
error for that dataset.
106
00:08:03,190 --> 00:08:08,800
There is a theoretical guarantee that with
a huge dataset and large values of k, you're
107
00:08:08,800 --> 00:08:12,800
going to get good results from nearest neighbor
learning.
108
00:08:12,800 --> 00:08:16,800
There's a section in the text, Section 4.7
on Instance-based learning.
109
00:08:16,800 --> 00:08:19,830
This is the last lesson of Class 3.
110
00:08:19,830 --> 00:08:23,300
Off you go and do the activity, and I'll see
you in Class 4.
111
00:08:23,300 --> 00:08:25,030
Bye for now!