Hi! Welcome to the course Data Mining with Weka. I'm Ian Witten from the University of Waikato in New Zealand, and I'm presenting the videos for this course, which is being prepared by the Department of Computer Science at the University of Waikato.
Data mining is a mature technology that a lot of people are beginning to take very seriously, and a lot of other people find very mysterious. The real aim of this course is to take the mystery out of data mining. This is a practical course on how to use the Weka workbench, which you will download as part of the course, for data mining. We explain the basic principles of several popular algorithms and how to use them in practical applications.
In the world today, we're overwhelmed with data. Every time you swipe your credit card, every item you checkout out at the supermarket, every time you send a text, make a phone call, send an email, or type a key on a computer even, every time you walk past a security camera -- it all generates a little bit of data in a database. Data mining is about going from the raw data to information, information that can be used to make predictions that are useful in the real world.
Let me give you an example. You're at the supermarket checkout. The till records every item you bought. At the end, you hand over your loyalty card; they give you a couple of percent off, and you give them your name and address, and, indirectly, access to all sorts of demographic information about you and people like you.
Everybody likes a bargain. It's been a good day today, because thanks to those coupons they sent you in the mail last week you've been able to stock up on some things you wouldn't normally have bought, but that you bought today because they are such a good deal. Next week, they'll send you some more coupons, and you'll go shopping again and buy some more stuff. They do some experiments on you -- you know, they try to figure out how much more you would buy if the price was just that little bit less.
These coupons are a kind of mechanism for personalized pricing. They have access to all sorts of data from you and people like you, in order to do these experiments and figure these things out. Everybody wins: you get your bargains, they sell more stuff. It sounds like a good deal to me.
Here's another application. Suppose you and your partner want a child, but you can't have one. It's fun trying, but it can get a little bit frustrating, and, ultimately, very frustrating -- perhaps even tragic. In artificial insemination, they take some eggs from the woman's ovaries; then they fertilize them with partner or donor sperm; and then they select from amongst the embryos produced some to implant back into the womb. You want to select the ones with the best chance of success of producing a live birth, but you don't want too many live births! The embryologist has access to all sorts of data on these embryos. I think there are 50-100 pieces of information that they record about individual embryos, and they have historical data on which ones produced a live birth -- a success.
So here's an ideal situation for data mining. We have lots of historical data; we have data on the present situation; and we want to select those embryos that have the best chance of success. Now, that's a good application for data mining -- bringing a child to a couple who want one.
I talk about data mining and machine learning. Data mining is the application, and machine learning is the algorithms we use. We're talking about using machine learning algorithms for the purposes of data mining.
This is "Data Mining with Weka," so the next question is, “What's Weka?” This is a weka here, this little bird. It's a flightless bird, like its better known cousin the kiwi, found only in the islands of New Zealand. This is what it sounds like, coming to you from New Zealand.
However, in our context Weka is a data mining workbench. It's an acronym for the Waikato Environment for Knowledge Analysis; we just call it Weka. It contains a large number of algorithms for classification, and a lot of algorithms for data preprocessing, feature selection, clustering, finding association rules, things like that. It's a very comprehensive workbench, and it's free, open source software that you will download as part of this course in the next lesson. It runs on any computer. It's written in Java, runs on Linux, Windows, Mac.
You'll download it, run it on your workstation, and use it during the course. You're going to learn how to load data into Weka and look at it. You're going to learn about preprocessing, cleaning up data using filters, exploring it using visualization, applying classification algorithms, interpreting the output, understanding evaluation methods (evaluation is very important in this area) understanding various representations for models and how popular machine learning algorithms work, and be aware of common pitfalls with data mining. The ultimate goal really is to empower you to use Weka on your own data, and -- most importantly -- to understand what it is you are doing.
This is the first class. In this class, you're going to get started with Weka. You're going to install it; explore the Weka Explorer interface and explore some data sets; build a classifier; interpret the output of the classifier; use filters; and visualize your data set. There's lots of things to do in this class. Here's the structure of the course.
There are five classes altogether. Each class consists of about six lessons. Class 1 is Getting Started with Weka. Then we're going to look at Evaluation in Class 2, Simple Classifiers in Class 3, More Classifiers in Class 4, and Putting it all Together in Class 5.
These are the six lessons in Class 1. Each lesson comprises a short video -- 5-10 minutes, like this one -- followed by an activity, an activity that involves you doing something yourself. You don't learn by me talking to you. You learn by actually doing things. So, we have lots of activities for you that involve using the Weka workbench.
In the middle of the class is a mid-class assessment, and at the end there is a post-class assessment. The marks for these are combined, and if you get more than 70%, you will receive a signed certificate from the University of Waikato certifying that you have completed this course. The activities are an important part of the course, but they are not part of the assessment. We really think you should do the activities, but you don't have to do them for assessment purposes. It's up to you.
As well as that, associated with the course is a textbook called Data Mining. It discusses
data mining and Weka in depth. It's a great book -- I know it's a great book because I wrote it myself, with a couple of friends. The publisher has kindly agreed to make available large chunks of this textbook to you online so that you can use it for background reading. It's only background reading. You don't have to read the textbook; it's there if you want to delve into some of the ideas and algorithms in more depth. What you need to do are the activities and the assessments, and watch the videos, of course. That's it.
I just thought I'd show you were I am. I'm in New Zealand, that's where Weka is from. That's where I'm sitting right now. This is the world as we see it in New Zealand. We're at the top, you're probably down at the bottom somewhere. We're at the top, in the center, and that arrow to the North Island of New Zealand is where the University of Waikato is.
That's it for now. There is an activity associated with this lesson, so go ahead and do it. Of course, you haven't learned very much in this lesson, so it's not a very important activity. Don't worry about it too much -- you're not expected to do a lot of reading to do this activity. Just have a go and see how you get on, and I'll see you again in the next lesson. I'm looking forward to that. Goodbye for now.