1
00:00:17,630 --> 00:00:24,750
Hello again! This is the last class of Data
Mining with Weka, and we're going to step
2
00:00:24,750 --> 00:00:29,070
back a little bit and take a look at some
more global issues with regard to the data
3
00:00:29,070 --> 00:00:29,859
mining process.
4
00:00:29,859 --> 00:00:38,400
It's a short class with just four lessons:
the data mining process, pitfalls and pratfalls,
5
00:00:38,400 --> 00:00:41,730
data mining and ethics, and finally, a quick
summary.
6
00:00:42,760 --> 00:00:45,760
Let's get on with Lesson 5.1.
7
00:00:45,760 --> 00:00:50,570
This might be your vision of the data mining
process.
8
00:00:50,570 --> 00:00:53,100
You've got some data or someone gives you
some data.
9
00:00:53,100 --> 00:00:54,860
You've got Weka.
10
00:00:54,860 --> 00:01:00,720
You apply Weka to the data, you get some kind
of cool result from that, and everyone's happy.
11
00:01:02,820 --> 00:01:05,509
If so, I've got bad news for you.
12
00:01:05,509 --> 00:01:08,500
It's not going to be like that at all.
13
00:01:08,500 --> 00:01:11,579
Really, this would be a better way to think
about it.
14
00:01:11,579 --> 00:01:15,650
You're going to have a circle; you're going
to go round and round the circle.
15
00:01:15,650 --> 00:01:19,770
It's true that Weka is important -- it's in
the very middle of the circle here.
16
00:01:19,770 --> 00:01:26,069
It's going to be crucial, but it's only a
small part of what you have to do.
17
00:01:26,069 --> 00:01:30,590
Perhaps the biggest problem is going to be
to ask the right kind of question.
18
00:01:30,590 --> 00:01:37,380
You need to be answering a question, not just
vaguely exploring a collection of data.
19
00:01:38,420 --> 00:01:44,680
Then, you need to get together the data that
you can get hold of that gives you a chance
20
00:01:44,689 --> 00:01:49,329
of answering this question using data mining
techniques.
21
00:01:49,329 --> 00:01:50,950
It's hard to collect the data.
22
00:01:50,950 --> 00:01:56,670
You're probably going to have an initial dataset,
but you might need to add some demographic
23
00:01:56,670 --> 00:02:00,319
data, or some weather data, or some data about
other stuff.
24
00:02:00,319 --> 00:02:05,079
You're going to have to go to the web and
find more information to augment your dataset.
25
00:02:05,079 --> 00:02:11,819
Then you'll merge all that together: do some
database hacking to get a dataset that contains
26
00:02:11,819 --> 00:02:17,410
all the attributes that you think you might
need -- or that you think Weka might need.
27
00:02:17,410 --> 00:02:19,069
Then you're going to have to clean the data.
28
00:02:19,069 --> 00:02:24,890
The bad news is that real world data is always
very messy.
29
00:02:24,890 --> 00:02:29,610
That's a long and painstaking process of looking
around, looking at the data, trying to understand it,
30
00:02:29,610 --> 00:02:35,390
trying to figure out what the anomalies
are and whether it's good to delete them or not.
31
00:02:35,390 --> 00:02:37,260
That's going to take a while.
32
00:02:37,260 --> 00:02:40,550
Then you're going to need to define some new
features, probably.
33
00:02:40,550 --> 00:02:44,810
This is the feature engineering process, and
it's the key to successful data mining.
34
00:02:44,810 --> 00:02:49,030
Then, finally, you're going to use Weka, of
course.
35
00:02:49,030 --> 00:02:54,860
You might go around this circle a few times
to get a nice algorithm for classification,
36
00:02:54,860 --> 00:03:00,420
and then you're going to need to deploy the
algorithm in the real world.
37
00:03:00,420 --> 00:03:03,340
Each of these processes is difficult.
38
00:03:04,340 --> 00:03:08,340
You need to think about the question that
you want to answer.
39
00:03:08,440 --> 00:03:13,330
"Tell me something cool about this data" is
not a good enough question.
40
00:03:13,330 --> 00:03:17,890
You need to know what you want to know from
the data.
41
00:03:17,890 --> 00:03:19,660
Then you need to gather it.
42
00:03:19,660 --> 00:03:23,110
There's a lot of data around, like I said
at the very beginning, but the trouble is
43
00:03:23,110 --> 00:03:30,110
that we need classified data to use classification
techniques in data mining.
44
00:03:30,290 --> 00:03:36,080
We need expert judgements on the data, expert
classifications, and there's not so much data
45
00:03:36,080 --> 00:03:42,810
around that includes expert classifications,
or correct results.
46
00:03:42,810 --> 00:03:45,680
They say that more data beats a clever algorithm.
47
00:03:45,680 --> 00:03:49,910
So rather than spending time trying to optimize
the exact algorithm you're going to use in
48
00:03:49,910 --> 00:03:53,670
Weka, you might be better off employed in
getting more and more data.
49
00:03:53,670 --> 00:04:00,570
Then you've got to clean it, and like I said
before, real data is very mucky.
50
00:04:00,570 --> 00:04:04,650
That's going to be a painstaking matter of
looking through it and looking for anomalies.
51
00:04:04,650 --> 00:04:08,000
Feature engineering, the next step, is the
key to data mining.
52
00:04:08,000 --> 00:04:12,930
We'll talk about how Weka can help you a little
bit in a minute.
53
00:04:12,930 --> 00:04:16,340
Then you've got to deploy the result.
54
00:04:16,340 --> 00:04:18,490
Implementing it -- well, that's the easy part.
55
00:04:18,490 --> 00:04:24,430
The difficult part is to convince your boss
to use this result from this data mining process
56
00:04:24,430 --> 00:04:29,620
that he probably finds very mysterious and
perhaps doesn't trust very much.
57
00:04:29,620 --> 00:04:36,620
Getting anything actually deployed in the
real world is a pretty tough call.
58
00:04:37,060 --> 00:04:43,370
The key technical part of all this is feature
engineering, and Weka has a lot of [filters]
59
00:04:43,370 --> 00:04:44,200
that will help with this.
60
00:04:44,200 --> 00:04:46,150
Here are just a few of them.
61
00:04:46,150 --> 00:04:53,150
It might be worth while defining a new feature,
a new attribute that's a mathematical expression
62
00:04:54,530 --> 00:04:56,120
involving existing attributes.
63
00:04:56,120 --> 00:04:59,890
Or you might want to modify an existing attribute.
64
00:04:59,890 --> 00:05:05,240
With AddExpression, you can use any kind of
mathematical formula to create a new attribute
65
00:05:05,240 --> 00:05:08,050
from existing ones.
66
00:05:08,050 --> 00:05:13,730
You might want to normalize or center your
data, or standardize it statistically.
67
00:05:13,730 --> 00:05:18,210
Transform a numeric attribute to have a zero
mean -- that's "center".
68
00:05:18,210 --> 00:05:21,830
Or transform it to a given numeric range -- that's
"normalize".
69
00:05:21,830 --> 00:05:28,830
Or give it a zero mean and unit variance,
that's a statistical operation called "standardization".
70
00:05:30,530 --> 00:05:37,500
You might want to take those numeric attributes
and discretize them into nominal values.
71
00:05:37,500 --> 00:05:43,440
Weka has both supervised and unsupervised
attribute discretization filters.
72
00:05:44,790 --> 00:05:46,000
There are a lot of other transformations.
73
00:05:46,000 --> 00:05:51,480
For example, the PrincipalComponents transformation
involves a matrix analysis of the data to
74
00:05:51,480 --> 00:05:54,150
select the principal components in a linear space.
75
00:05:54,150 --> 00:05:58,920
That's mathematical, and Weka contains a good
implementation.
76
00:05:58,920 --> 00:06:04,220
RemoveUseless will remove attributes that
don't vary at all, or vary too much.
77
00:06:04,220 --> 00:06:07,800
Actually, I think we encountered that in one
of our activities.
78
00:06:07,800 --> 00:06:14,800
Then, there are a couple of filters that help
you deal with time series, when your instances
79
00:06:14,830 --> 00:06:17,300
represent a series over time.
80
00:06:17,300 --> 00:06:21,080
You probably want to take the difference between
one instance and the next, or a difference
81
00:06:21,080 --> 00:06:27,680
with some kind of lag -- one instance and
the one 5 before it, or 10 before it.
82
00:06:27,680 --> 00:06:33,650
These are just a few of the filters that Weka
contains to help you with your feature engineering.
83
00:06:33,650 --> 00:06:39,250
The message of this lesson is that Weka is
only a small part of the entire data mining
84
00:06:39,250 --> 00:06:41,810
process, and it's the easiest part.
85
00:06:41,810 --> 00:06:46,310
In this course, we've chosen to tell you about
the easiest part of the process! I'm sorry
86
00:06:46,310 --> 00:06:46,780
about that.
87
00:06:46,780 --> 00:06:50,230
The other bits are, in practice, much more
difficult.
88
00:06:50,230 --> 00:06:56,270
There's an old programmer's blessing: "May
all your problems be technical ones".
89
00:06:56,270 --> 00:07:01,170
It's the other problems -- the political problems
in getting hold of the data, and deploying
90
00:07:01,170 --> 00:07:06,610
the result -- those are the ones that tend
to be much more onerous in the overall data
91
00:07:06,610 --> 00:07:07,330
mining process.
92
00:07:07,330 --> 00:07:09,920
So good luck!
93
00:07:09,920 --> 00:07:12,400
There's some stuff about this in the course
text.
94
00:07:12,400 --> 00:07:17,810
Section 1.3 contains information on Fielded
Applications, all of which have gone through
95
00:07:17,810 --> 00:07:24,480
this kind of process in order to get them
out there and used in the field.
96
00:07:24,480 --> 00:07:26,200
There's an activity associated with this lesson.
97
00:07:26,200 --> 00:07:29,180
Off you go and do it, and we'll see you in
the next lesson.
98
00:07:29,180 --> 00:07:36,180
Bye for now!