1
00:00:17,480 --> 00:00:26,320
Hi! This is the last lesson in the course
Data mining with Weka, Lesson 5.4 - Summary.
2
00:00:26,320 --> 00:00:31,300
We'll just have a quick summary of what we've
learned here.
3
00:00:31,300 --> 00:00:36,839
One of the main points I've been trying to
convey is that there's no magic in data mining.
4
00:00:36,839 --> 00:00:42,710
There's a huge array of alternative techniques,
and they're all fairly straightforward algorithms.
5
00:00:42,710 --> 00:00:45,170
We've seen the principles of many of them.
6
00:00:45,170 --> 00:00:50,170
Perhaps we don't understand the details, but
we've got the basic idea of the main methods
7
00:00:50,170 --> 00:00:53,329
of machine learning used in data mining.
8
00:00:53,329 --> 00:00:57,620
And there is no single, universal best method.
9
00:00:57,620 --> 00:01:00,899
Data mining is an experimental science.
10
00:01:00,899 --> 00:01:06,070
You need to find out what works best on your
problem.
11
00:01:06,070 --> 00:01:07,780
Weka makes it easy for you.
12
00:01:07,780 --> 00:01:11,210
Using Weka you can try out different methods,
you can try out different filters, different
13
00:01:11,210 --> 00:01:12,960
learning methods.
14
00:01:12,960 --> 00:01:14,910
You can play around with different datasets.
15
00:01:14,910 --> 00:01:17,160
It's very easy to do experiments in Weka.
16
00:01:17,160 --> 00:01:22,030
Perhaps you might say it's too easy, because
it's important to understand what you're doing,
17
00:01:22,030 --> 00:01:25,729
not just blindly click around and look at
the results.
18
00:01:25,729 --> 00:01:30,569
That's what I've tried to emphasize in this
course -- understanding and evaluating what
19
00:01:30,569 --> 00:01:31,660
you're doing.
20
00:01:31,660 --> 00:01:36,759
There are many pitfalls you can fall into
if you don't really understand what's going
21
00:01:36,759 --> 00:01:38,030
on behind the scenes.
22
00:01:38,030 --> 00:01:43,649
It's not a matter of just blindly applying
the tools in the workbench.
23
00:01:43,649 --> 00:01:48,550
We've stressed in the course the focus on
evaluation, evaluating what you're doing,
24
00:01:48,550 --> 00:01:54,950
and the significance of the results of the
evaluation.
25
00:01:54,950 --> 00:01:57,679
Different algorithms differ in performance,
as we've seen.
26
00:01:57,679 --> 00:02:00,789
In many problems, it's not a big deal.
27
00:02:00,789 --> 00:02:06,060
The differences between the algorithms are
really not very important in many situations,
28
00:02:06,060 --> 00:02:12,080
and you should perhaps be spending more time
on looking at the features and how the problem
29
00:02:12,080 --> 00:02:19,340
is described and the operational context that
you're working in, rather than stressing about
30
00:02:19,349 --> 00:02:21,680
getting the absolute best algorithm.
31
00:02:21,680 --> 00:02:25,280
It might not make all that much difference
in practice.
32
00:02:25,280 --> 00:02:29,080
Use your time wisely.
33
00:02:29,080 --> 00:02:31,709
There's a lot of stuff that we've missed out.
34
00:02:31,709 --> 00:02:35,569
I'm really sorry I haven't been able to cover
more of this stuff.
35
00:02:35,569 --> 00:02:41,299
There's a whole technology of filtered classifiers,
where you want to filter the training data,
36
00:02:41,299 --> 00:02:43,099
but not the test data.
37
00:02:43,099 --> 00:02:49,230
That's especially true when you've got a supervised
filter, where the results of the filter depend
38
00:02:49,230 --> 00:02:53,760
on the class values of the training instances.
39
00:02:53,760 --> 00:02:59,069
You want to filter the training data, but
not the test data, or maybe take a filter
40
00:02:59,069 --> 00:03:04,650
designed for the training data and apply the
same filter to the test data without re-optimizing
41
00:03:04,650 --> 00:03:06,590
it for the test data, which would be cheating.
42
00:03:08,800 --> 00:03:11,290
You often want to do this during cross-validation.
43
00:03:11,290 --> 00:03:15,639
The trouble in Weka is that you can't get
hold of those cross-validation folds; it's
44
00:03:15,639 --> 00:03:17,469
all done internally.
45
00:03:17,469 --> 00:03:21,819
Filtered classifiers are a simple way of dealing
with this problem.
46
00:03:21,819 --> 00:03:25,680
We haven't talked about costs of different
decisions and different kinds of errors, but
47
00:03:25,680 --> 00:03:29,510
in real life different errors have different
costs.
48
00:03:29,510 --> 00:03:35,999
We've talked about optimizing the error rate,
or the classification accuracy, but really,
49
00:03:35,999 --> 00:03:40,310
in most situations, we should be talking about
costs, not raw accuracy figures, and these
50
00:03:40,310 --> 00:03:43,519
are different things.
51
00:03:43,519 --> 00:03:48,290
There's a whole panel in the Weka Explorer
for attribute selection, which helps you select
52
00:03:48,290 --> 00:03:55,099
a subset of attributes to use when learning,
and in many situations it's really valuable,
53
00:03:55,099 --> 00:04:00,209
before you do any learning, to select an appropriate
small subset of attributes to use.
54
00:04:01,950 --> 00:04:04,170
There are a lot of clustering techniques in
Weka.
55
00:04:04,170 --> 00:04:07,529
Clustering is where you want to learn something
even when there is no class value: you want
56
00:04:07,529 --> 00:04:12,060
to cluster the instances according to their
attribute values.
57
00:04:12,060 --> 00:04:16,380
Association rules are another kind of learning
technique where we're looking for associations
58
00:04:16,380 --> 00:04:17,630
between attributes.
59
00:04:17,630 --> 00:04:22,770
There's no particular class, but we're looking
for any strong associations between any of
60
00:04:22,770 --> 00:04:23,960
the attributes.
61
00:04:23,960 --> 00:04:27,639
Again, that's another panel in the Explorer.
62
00:04:27,639 --> 00:04:29,000
Text classification.
63
00:04:29,000 --> 00:04:35,740
There are some fantastic text filters in Weka
which allow you to handle textual data as
64
00:04:35,740 --> 00:04:41,010
words, or as characters, or n-grams (sequences
of three, four, or five consecutive characters).
65
00:04:42,200 --> 00:04:45,060
You can do text mining using Weka.
66
00:04:45,060 --> 00:04:52,000
Finally, we've focused exclusively on the
Weka Explorer, but the Weka Experimenter is
67
00:04:52,000 --> 00:04:54,340
also worth getting to know.
68
00:04:54,340 --> 00:04:59,880
We've done a fair amount of rather boring,
tedious, calculations of means and standard
69
00:04:59,880 --> 00:05:07,260
deviations manually by changing the random-number
seed and running things again.
70
00:05:07,260 --> 00:05:09,400
That's very tedious to do by hand.
71
00:05:09,400 --> 00:05:12,840
The Experimenter makes it very easy to do
this automatically.
72
00:05:12,840 --> 00:05:19,900
So, there's a lot more to learn, and I'm wondering
if you'd be interested in an Advanced Data
73
00:05:19,900 --> 00:05:21,200
Mining with Weka course.
74
00:05:21,200 --> 00:05:25,630
I'm toying with the idea of putting one on,
and I'd like you to let us know what you think
75
00:05:25,630 --> 00:05:29,350
about the idea, and what you'd like to see
included.
76
00:05:30,940 --> 00:05:33,690
Let me just finish off here with a final thought.
77
00:05:33,690 --> 00:05:37,550
We've been talking about data, data mining.
78
00:05:37,550 --> 00:05:43,380
Data is recorded facts, a change of state
in the world, perhaps.
79
00:05:43,380 --> 00:05:48,360
That's the input to our data mining process,
and the output is information, the patterns
80
00:05:48,360 --> 00:05:54,050
-- the expectations -- that underlie that
data: patterns that can be used for prediction
81
00:05:54,050 --> 00:05:58,320
in useful applications in the real world.
82
00:05:58,320 --> 00:06:02,310
We've going from data to information.
83
00:06:02,310 --> 00:06:07,500
Moving up in the world of people, not computers,
"knowledge" is the accumulation of your entire
84
00:06:07,500 --> 00:06:14,500
set of expectations, all the information that
you have and how it works together -- a large
85
00:06:14,680 --> 00:06:20,550
store of expectations and the different situations
where they apply.
86
00:06:20,550 --> 00:06:24,690
Finally, I like to define "wisdom" as the
value attached to knowledge.
87
00:06:24,690 --> 00:06:32,610
I'd like to encourage you to be wise when
using data mining technology.
88
00:06:32,610 --> 00:06:33,910
You've learned a lot in this course.
89
00:06:33,910 --> 00:06:39,280
You've got a lot of power now that you can
use to analyze your own datasets.
90
00:06:39,280 --> 00:06:44,420
Use this technology wisely for the good of
the world.
91
00:06:44,420 --> 00:06:46,390
That's my final thought for you.
92
00:06:47,470 --> 00:06:51,640
There is an activity associated with this
lesson, a little revision activity.
93
00:06:51,640 --> 00:07:00,460
Go and do that, and then do the final assessment,
and we will send you your certificate if you
94
00:07:00,460 --> 00:07:02,060
do well enough.
95
00:07:02,060 --> 00:07:07,590
Good luck! It's been good talking to you,
and maybe we'll see you in an advanced version
96
00:07:07,590 --> 00:07:09,680
of this course.
97
00:07:09,680 --> 00:07:11,100
Bye for now!