Hands-on Machine Learning

The goal of this presentation is to give a very applied, hands-on introduction to a range of Machine Learning techniques. There will be a quick discussion of what kinds of data you can have, what kinds of tasks machine learning techniques allow you to, and, finally, a survey of common techniques. At the end, if you know what kinds of data you have, and what your goal is, this will let you get down to just a list of techniques that could be appropriate for your task.

Ideally, each technique will include a hands-on section for just that technique. This should cover any tools that implement the algorithm, how to get your data into those tools, and how to extract the model from those tools and incorporate it into your own code.

Input Data

There are basically just two sorts of input data: nominal and numeric.

Nominal values are things like "Red", "Orange", "Poltergeist", etc. They're a closed set of discrete values, typically strings instead of numbers. However, you can use numbers as a nominal value, so long as they're from a closed set. These are usually from human judgment: there's not necessarily a good line between red and orange, but a human makes the call and records the value. Alternately, a human decides where the arbitrary line is, and writes a bit of code that makes the decision based on that line.

Numeric values are exactly what they sound like: numbers! They can take many, many values, and are generally based on direct measurements. For example, the weight of a fruit a robot is holding, or the number of occurances of a given word in a document.

Machine Learning Tasks

Generally speaking, you can ask a Machine Learning algorithm to do one of three things:

Numeric prediction

Description: "Given the input you've seen in the past, and this set of current values, what values should I expect given this input?"

A trivial example: if you have a dataset that's pairs of (Yesterday's high temperature, today's high temperature), you could train a numeric predictor that would give you an estimate of today's high temperature given yesterday's. Or, given the highs for the last week, it could predict the highs for the next 3 days.

Labelling/Classification

Description: "Given the input you've seen in the past, and this set of current values, what would you label this?"

The best-known example of this is the spam filter. After labelling a bunch of email as spam or not-spam, you train a classifier. Then, you can use that classifier on new email to decide whether the machine believes it to be spam or not. Of course, this generalizes, and you can use exactly the same technique to separate personal, work, and hobby email.

Clustering

Description: "Given the input you've seen in the past, and this set of current values, what previous inputs is it most like?"

This task is very similar to classification, with one big difference: you don't have labels. For example, if you have a bunch of measurements of flowers, you can use clustering to discover if there are underlying patterns you've missed out on, perhaps representing growing conditions or a difference in (sub)species.

Specific Techniques

Decision Trees

Description: A decision tree is something like a flow chart. It's a tree of decision boxes; you start at the root and, based on your data, follow decisions down to leaf nodes. At the leaf nodes, you'll typically have a label.

Training:

Evaluation:

Application:

Naive Bayes Classifier

Task: Labelling

Input data types: nominal

Description: Naive Bayes is a statistical technique for predicting the probability of all labels given a set of inputs. For instance, let's assume we've trained a naive Bayes system on (color, kind of fruit) pairs. Then, we can ask it for the probability distribution of "kind of fruit" given the color "yellow." This will tell us that it's almost certainly a banana or lemon, but it could be an apple, and might occasionally be an orange, etc. That is, it returns a list of labels with an associated probability.

Training:

Evaluation:

Application:

Support Vector Machines

Task: Labelling

Input data types: numeric

Description: Support Vector Machines work by finding lines that separate data points. Its input values are labelled points in a high-dimensional space.

Training: libsvm and svmlight.

Evaluation:

Application:

Polynomial Regression

Task: Numeric Prediction

Input data types: numeric

Description: This isn't technically machine learning. It's actually just an inference technique, but it's often a good technique to try as a baseline.

Training:

Evaluation:

Application:

Neural Networks

Task: Numeric Prediction

Input data types: numeric

Description: A neural network allows you predict a number of continuous numeric values based on other continuous values. "A Neural Network is the second best way to solve any problem."

Training:

Evaluation:

Application:

k-Means Clustering

Task: Clustering

Input data types: numeric or nominal

Description: k-Means clustering allows you to take a set of feature vectors and decide which group of feature vectors to associate it with. In a fruit-market universe, this will cluster all the "round, red, dense" things together, separate from the "orange, round, dense" things.