1
00:00:16,160 --> 00:00:19,950
Hi! Welcome to Lesson 5.3 of Data Mining with
Weka.
2
00:00:19,950 --> 00:00:23,369
Before we start, I thought I'd show you where
I live.
3
00:00:23,369 --> 00:00:28,669
I told you before that I moved to New Zealand
many years ago.
4
00:00:28,669 --> 00:00:29,939
I live in a place called Hamilton.
5
00:00:29,939 --> 00:00:35,220
Let me just zoom in and see if we can find
Hamilton in the North Island of New Zealand,
6
00:00:35,220 --> 00:00:37,670
around the center of the North Island.
7
00:00:37,670 --> 00:00:44,030
This is where the University of Waikato is.
8
00:00:44,030 --> 00:00:47,660
Here is the university; this is where I live.
9
00:00:47,660 --> 00:00:52,160
This is my journey to work: I cycle every
morning through the countryside.
10
00:00:52,160 --> 00:00:53,930
As you can see, it's really nice.
11
00:00:53,930 --> 00:00:55,390
I live out here in the country.
12
00:00:55,390 --> 00:01:02,390
I'm a sheep farmer! I've got four sheep, three
in the paddock and one in the freezer.
13
00:01:02,500 --> 00:01:05,780
I cycle in -- it takes about half an hour
-- and I get to the university.
14
00:01:05,780 --> 00:01:11,970
I have the distinction of being able to go
from one week to the next without ever seeing
15
00:01:11,970 --> 00:01:16,090
a traffic light, because I live out on the
same edge of town as the university.
16
00:01:16,090 --> 00:01:21,500
When I get to the campus of the University
of Waikato, it's a very beautiful campus.
17
00:01:21,500 --> 00:01:23,060
We've got three lakes.
18
00:01:23,060 --> 00:01:27,349
There are two of the lakes, and another lake
down here.
19
00:01:27,349 --> 00:01:32,330
It's a really nice place to work! So I'm very
happy here.
20
00:01:32,330 --> 00:01:39,330
Let's move on to talk about data mining and
ethics.
21
00:01:39,530 --> 00:01:46,530
In Europe, they have a lot of pretty stringent
laws about information privacy.
22
00:01:47,000 --> 00:01:51,450
For example, if you're going to collect any
personal information about anyone, a purpose
23
00:01:51,450 --> 00:01:52,860
must be stated.
24
00:01:52,860 --> 00:01:57,750
The information should not be disclosed to
others without consent.
25
00:01:57,750 --> 00:02:01,390
Records kept on individuals must be accurate
and up to date.
26
00:02:01,390 --> 00:02:03,920
People should be able to review data about
themselves.
27
00:02:03,920 --> 00:02:08,110
Data should be deleted when it's no longer
needed.
28
00:02:08,110 --> 00:02:12,690
Personal information must not be transmitted
to other locations.
29
00:02:12,690 --> 00:02:17,390
Some data is too sensitive to be collected,
except in extreme circumstances.
30
00:02:17,390 --> 00:02:20,489
This is true in some countries in Europe,
particularly Scandinavia.
31
00:02:20,489 --> 00:02:24,230
It's not true, of course, in the United States.
32
00:02:24,230 --> 00:02:29,750
Data mining is about collecting and utilizing
recorded information, and it's good to be
33
00:02:29,750 --> 00:02:32,600
aware of some of these ethical issues.
34
00:02:32,600 --> 00:02:39,000
People often try to anonymize data so that
it's safe to distribute for other people to
35
00:02:39,000 --> 00:02:42,790
work on, but anonymization is much harder
than you think.
36
00:02:42,790 --> 00:02:44,760
Here's a little story for you.
37
00:02:44,760 --> 00:02:49,500
When Massachusetts released medical records
summarizing every state employee's hospital
38
00:02:49,500 --> 00:02:54,780
record in the mid-1990's, the Governor gave
a public assurance that it had been anonymized
39
00:02:54,780 --> 00:02:59,950
by removing all identifying information -- name,
address, and social security number.
40
00:02:59,950 --> 00:03:06,040
He was surprised to receive is own health
records (which included a lot of private information)
41
00:03:06,040 --> 00:03:11,040
in the mail shortly afterwards! People could
be re-identified from the information that
42
00:03:11,040 --> 00:03:13,490
was left there.
43
00:03:13,490 --> 00:03:18,220
There's been quite a bit of research done
on re-identification techniques.
44
00:03:18,220 --> 00:03:24,370
For example, using publicly available records
on the internet, 50% of Americans can be identified
45
00:03:24,370 --> 00:03:28,010
from their city, birth date, and sex.
46
00:03:28,010 --> 00:03:34,470
85% can be identified if you include their
zip code as well.
47
00:03:34,470 --> 00:03:40,140
There was some interesting work done on a
movie database.
48
00:03:40,140 --> 00:03:47,140
Netflix released a database of 100 million
records of movie ratings.
49
00:03:47,290 --> 00:03:51,810
They got individuals to rate movies [on the
scale] 1-5, and they had a whole bunch of
50
00:03:51,810 --> 00:03:56,100
people doing this -- a total of 100 million
records.
51
00:03:56,100 --> 00:04:02,060
It turned out that you could identify 99%
of people in the database if you knew their
52
00:04:02,060 --> 00:04:06,420
ratings for 6 movies and approximately when
they saw them.
53
00:04:06,420 --> 00:04:11,650
Even if you only know their ratings for 2
movies, you can identify 70% of people.
54
00:04:11,650 --> 00:04:16,349
This means you can use the database to find
out the other movies that these people watched.
55
00:04:16,349 --> 00:04:19,300
They might not want you to know that.
56
00:04:19,300 --> 00:04:25,500
Re-identification is remarkably powerful,
and it is incredibly hard to anonymize data
57
00:04:25,500 --> 00:04:30,660
effectively in a way that doesn't destroy
the value of the entire dataset for data mining
58
00:04:30,660 --> 00:04:33,310
purposes.
59
00:04:33,310 --> 00:04:37,540
Of course, the purpose of data mining is to
discriminate: that's what we're trying to do!
60
00:04:37,540 --> 00:04:42,070
We're trying to learn rules that discriminate
one class from another in the data -- who
61
00:04:42,070 --> 00:04:48,000
gets the loan? -- who gets a special offer?
But, of course, certain kinds of discrimination
62
00:04:48,000 --> 00:04:50,720
are unethical, not to mention illegal.
63
00:04:50,720 --> 00:04:56,570
For example, racial, sexual, and religious
discrimination is certainly unethical, and
64
00:04:56,570 --> 00:04:59,550
in most places illegal.
65
00:04:59,550 --> 00:05:01,910
But it depends on the context.
66
00:05:01,910 --> 00:05:06,500
Sexual discrimination is usually illegal ... except for doctors.
67
00:05:06,500 --> 00:05:11,350
Doctors are expected to take gender into account
when they make their make their diagnoses.
68
00:05:11,350 --> 00:05:16,400
They don't want to tell a man that he is pregnant,
for example.
69
00:05:16,400 --> 00:05:20,010
Also, information that appears innocuous may
not be.
70
00:05:20,010 --> 00:05:26,880
For example, area codes -- zip codes in the
US -- correlate strongly with race; membership
71
00:05:26,880 --> 00:05:29,100
of certain organizations correlates with gender.
72
00:05:29,100 --> 00:05:34,260
So although you might have removed the explicit
racial and gender information from you database,
73
00:05:34,260 --> 00:05:37,880
it still might be able to be inferred from
other information that's there.
74
00:05:37,880 --> 00:05:48,550
It's very hard to deal with data: it has a way of revealing secrets about itself in unintended ways.
75
00:05:48,550 --> 00:05:55,550
Another ethical issue concerning data mining
is that correlation does not imply causation.
76
00:05:56,610 --> 00:06:02,169
Here's a classic example: as ice cream sales
increase, so does the rate of drownings.
77
00:06:02,169 --> 00:06:06,970
Therefore, ice cream consumption causes drowning?
Probably not.
78
00:06:06,970 --> 00:06:12,320
They're probably both caused by warmer temperatures
-- people going to beaches.
79
00:06:12,320 --> 00:06:17,800
What data mining reveals is simply correlations,
not causation.
80
00:06:17,800 --> 00:06:20,010
Really, we want causation.
81
00:06:20,010 --> 00:06:25,550
We want to be able to predict the effects
of our actions, but all we can look at using
82
00:06:25,550 --> 00:06:27,919
data mining techniques is correlation.
83
00:06:27,919 --> 00:06:34,919
To understand about causation, you need a
deeper model of what's going on.
84
00:06:36,340 --> 00:06:40,150
I just wanted to alert you to some of the
issues, some of the ethical issues, in data
85
00:06:40,150 --> 00:06:46,790
mining, before you go away and use what you've
learned in this course on your own datasets:
86
00:06:46,790 --> 00:06:51,270
issues about the privacy of personal information;
the fact that anonymization is harder than
87
00:06:51,270 --> 00:06:57,650
you think; re-identification of individuals
from supposedly anonymized data is easier
88
00:06:57,650 --> 00:07:03,699
than you think; data mining and discrimination
-- it is, after all, about discrimination;
89
00:07:03,699 --> 00:07:08,250
and the fact that correlation does not imply
causation.
90
00:07:08,250 --> 00:07:13,729
There's a section in the textbook, Data mining
and ethics, which you can read for more background
91
00:07:13,729 --> 00:07:18,030
information, and there's a little activity
associated with this lesson, which you should
92
00:07:18,030 --> 00:07:20,190
go and do now.
93
00:07:20,190 --> 00:07:23,900
I'll see you in the next lesson, which is
the last lesson of the course.
94
00:07:23,900 --> 00:07:26,500
Bye for now!