Stanford Machine Learning - Lecture 5

Resources

This lecture will use more of the probability theory covered in the review notes here (Available from the handouts page of the SEE site).

The classifiers in this lecture are entirely based on probability and statistics.

Course Lecture Notes

This lecture moves on to the second set of lecture notes. (Available from the handouts page of the SEE site).

Again, I recommend going back and forth between the video and the lecture notes. After you watch a portion of the video, read the corresponding section in the lecture notes to ensure you understand it correctly.

Additional Lecture Notes

Unfortunately, Alex Holehouse’s notes do not cover this lecture as far as I can tell.

My Notes

Discriminative Algorithms and Generative Algorithms

The algorithms we’ve looked at so far are discriminative algorithms.

A discriminative algorithm learns p ( y

x) directly.

A generative algorithm models the probability density function of the features for a given class, p(x

y). Once you have this, you can compute the p (y

x) using Bayes’ Theorem.

Bayes’ Theorem allows you to compute the probability that an input x belongs to class y using the following equation.

Where:
p(y | x) is the probability that input ‘x’ belongs to class ‘y’.
p(x | y) is the probability that the value ‘x’ was generated by class ‘y’. Given by the pdf of each class.
p(y) is the frequency of class ‘y’.
p(x) is given by:

Gaussian Discriminant Analysis

**The basic idea here is that we will be assuming that the data points for each class have a gaussian distribution, and we’ll be finding the mean and variance of the gaussian for each class. Where the two gaussians intersect will become our decision boundary. This produces a slightly different decision boundary than gradient descent.

The graph below shows the pdf for a healthy person’s temperature in Celsius (blue graph) and a sick person’s temperature (red). These are p(x

y = 0) and p(x

y = 1).

You can use the point where they intersect as a threshold for determining whether a person is healthy or sick depending on their temperature.

Note that the Gaussian Discriminant doesn’t involve gradient descent or anything like that. You don’t have to “fit” a gaussian to the data, you can compute the parameters of the gaussian directly from your training set.

So how is p (y

x) different from p (x

y)? In the above example, p (y

x) is the probability that a person with a temperature of 105F is sick , while p(x

y) is the probability that a sick person will have a temperature of 105F. You can see how they’re different–it’s very likely this person is sick, but its uncommon for a sick person to have a temperature that high.

If you plot p(y

x) on the graph, what you’ll get is a sigmoid function that hits 0.5 right where the two gaussians intersect.

The above graph shows the pdf’s again, but also shows the graphs for p(y

x). For example, the blue line shows the computed probability that a person is sick given their temperature x.

This graph was made in Excel with discrete values, so that’s why the lines aren’t smoother.

Because these two distributions have such different variances, there is a second crossover point around 36 degrees Celsius. If both distributions had a standard deviation of 1, the probability functions p(y

x) would look more like the one professor Ng draws:

For classification, an input ‘x’ is assigned to the class with the highest probability p(y

x). The equation for expressing this is a little confusing at first glance:

“arg max over y” means find the value y which “maximizes” the value of the function p(y

x) for our input x.

Gaussian Discriminant and Logistic Regression

The difference between this and logistic regression is that the class boundary will be slightly different, as will the steepness of the sigmoid.

Pros and Cons of Generative Algorithms

**With the Gaussian discriminant classifier, it only works if the training data actually has a Gaussian distribution.

Logistic regression will do better if the data is not actually Gaussian.

Gaussian discriminant analysis can often perform well with less data. It sounds like this is because you’re providing the algorithm with more information by essentially specifying that the data has a Gaussian distribution.

With logistic regression, you’re not making a statement about the data distribution, so it can require more data, but can be more robust to your model selection.

Naive Bayes

To implement a Naive Bayes classifier for spam:

For every word in your dictionary:

1. Compute the fraction of spam e-mails which contain the word, and the fraction that do not contain it.
2. Compute the fraction of non-spam e-mails which contain the word, and the fraction that do not contain it.

The classifier uses all of these fractions to compute the probability that a given e-mail is spam.

****

The fraction of spam e-mails which contained word j.
__

Laplace Smoothing

In the spam filter example, what happens if no spam e-mails contain the word aardvark? Then you’d have 0 term in your equations.

Laplace smoothing for spam filter: just add 1 to the number of times you’ve seen a word, and add 2 to the number of examples you’ve seen. So you’ll get a very small, but non-0 probability of seeing that word.