Hi! Welcome to Lesson 5.3 of Data Mining with Weka. Before we start, I thought I'd show you where I live. I told you before that I moved to New Zealand many years ago. I live in a place called Hamilton. Let me just zoom in and see if we can find Hamilton in the North Island of New Zealand, around the center of the North Island.
This is where the University of Waikato is. Here is the university; this is where I live. This is my journey to work: I cycle every morning through the countryside. As you can see, it's really nice. I live out here in the country. I'm a sheep farmer! I've got four sheep, three in the paddock and one in the freezer. I cycle in -- it takes about half an hour -- and I get to the university. I have the distinction of being able to go from one week to the next without ever seeing a traffic light, because I live out on the same edge of town as the university. When I get to the campus of the University of Waikato, it's a very beautiful campus. We've got three lakes. There are two of the lakes, and another lake down here. It's a really nice place to work! So I'm very happy here.
Let's move on to talk about data mining and ethics. In Europe, they have a lot of pretty stringent laws about information privacy. For example, if you're going to collect any personal information about anyone, a purpose must be stated. The information should not be disclosed to others without consent. Records kept on individuals must be accurate and up to date. People should be able to review data about themselves. Data should be deleted when it's no longer needed. Personal information must not be transmitted to other locations. Some data is too sensitive to be collected, except in extreme circumstances. This is true in some countries in Europe, particularly Scandinavia. It's not true, of course, in the United States.
Data mining is about collecting and utilizing recorded information, and it's good to be aware of some of these ethical issues.
People often try to anonymize data so that it's safe to distribute for other people to work on, but anonymization is much harder than you think. Here's a little story for you. When Massachusetts released medical records summarizing every state employee's hospital record in the mid-1990's, the Governor gave a public assurance that it had been anonymized by removing all identifying information -- name, address, and social security number. He was surprised to receive is own health records (which included a lot of private information) in the mail shortly afterwards! People could be re-identified from the information that was left there.
There's been quite a bit of research done on re-identification techniques. For example, using publicly available records on the internet, 50% of Americans can be identified from their city, birth date, and sex. 85% can be identified if you include their zip code as well.
There was some interesting work done on a movie database. Netflix released a database of 100 million records of movie ratings. They got individuals to rate movies [on the scale] 1-5, and they had a whole bunch of people doing this -- a total of 100 million records. It turned out that you could identify 99% of people in the database if you knew their ratings for 6 movies and approximately when they saw them. Even if you only know their ratings for 2 movies, you can identify 70% of people. This means you can use the database to find out the other movies that these people watched. They might not want you to know that. Re-identification is remarkably powerful, and it is incredibly hard to anonymize data effectively in a way that doesn't destroy the value of the entire dataset for data mining purposes.
Of course, the purpose of data mining is to discriminate: that's what we're trying to do! We're trying to learn rules that discriminate one class from another in the data -- who gets the loan? -- who gets a special offer? But, of course, certain kinds of discrimination are unethical, not to mention illegal. For example, racial, sexual, and religious discrimination is certainly unethical, and in most places illegal.
But it depends on the context. Sexual discrimination is usually illegal … except for doctors. Doctors are expected to take gender into account when they make their make their diagnoses. They don't want to tell a man that he is pregnant, for example. Also, information that appears innocuous may not be. For example, area codes -- zip codes in the US -- correlate strongly with race; membership of certain organizations correlates with gender. So although you might have removed the explicit racial and gender information from you database, it still might be able to be inferred from other information that's there. It's very hard to deal with data: it has a way of revealing secrets about itself in unintended ways.
Another ethical issue concerning data mining is that correlation does not imply causation. Here's a classic example: as ice cream sales increase, so does the rate of drownings. Therefore, ice cream consumption causes drowning? Probably not. They're probably both caused by warmer temperatures -- people going to beaches. What data mining reveals is simply correlations, not causation. Really, we want causation. We want to be able to predict the effects of our actions, but all we can look at using data mining techniques is correlation. To understand about causation, you need a deeper model of what's going on.
I just wanted to alert you to some of the issues, some of the ethical issues, in data mining, before you go away and use what you've learned in this course on your own datasets: issues about the privacy of personal information; the fact that anonymization is harder than you think; re-identification of individuals from supposedly anonymized data is easier than you think; data mining and discrimination -- it is, after all, about discrimination; and the fact that correlation does not imply causation.
There's a section in the textbook, Data mining and ethics, which you can read for more background information, and there's a little activity associated with this lesson, which you should go and do now. I'll see you in the next lesson, which is the last lesson of the course. Bye for now!