If we have to distinguish between many similar classes (ex: dog breeds), the problem is called fine grained classification.

Support vector machines

They are learning algorithms used for binary classification of data. What a SVM does is represent examples as points in space separated by as wide a margin as possible. New examples are mapped in the same space and depending on which side of the boundary they fell are classified into 2 categories.

It can use Kernels, which take a number of landmarks in space and use the distance between an example X and a landmark L as a feature in the hypothesis. There are multiple ways to compute the distance:

Gaussian distance: $f1 = similarity(x,l) = \exp(-\frac{

Gaussian distribution for different parameters

Polynomial kernel

String kernel

etc

Usually when we train we use each of the train examples as a landmark.

Natural Language Processing

Natural language processing is an area of computer science and artificial intelligence concerned with the interactions between computers and human languages, in particular how to program computers to process and analyze large amounts of natural language data.

Unsupervised learning

We give our algorithm a set of unlabeled data and we ask it for example to find order in the data.

Recommender systems - content based

nu - number of users nm - number of movies r(i,j) = 1 if user i has rated movie j y(i,j) = rating given by user i to movie j (defined only if r(i,j)==1) θ^j = parameter vector for user j x^i = feature vector for movie i m^j = number of movies rated by user j to learn θ^j

For each user we learn a parameter vector θ^j of size nx1, and for each unrated movie i we predict user j will rate it (θ^j)’*x^i stars.

For a user with parameters θ and a movie with (learned) features x predict rating θ^T*x.

Finding similar movies to movie i: find movies j where

Mean normalization

border,500px

Anomaly Detection

2 feature vectors (red) and the resulting gauss distributions; with green we have anomalous features

Given a set of features without labels, try to learn a way of finding out if new features are anomalous. Examples:

given aircraft engine heat output and vibrations as features, and for new data points (new engines) determine the probability that they are defective

given user’s actions on a website as features, determine the probability of an action being a fraud / suspicious

given some metrics for computers in a cluster, determine the probability of something being wrong with one of the computers

For this we can use the Gaussian distribution. We plot all features on the X axis, and we learn from them the parameters μ and σ^2. Then, we plot new examples onto the X axis and see how well they fit into the distribution. To find out μ and σ^ we use the following formulas:

This algorithm can be combined with supervised learning. Suppose we have 10000 good examples, and 20 anomalous ones. We can split them up as follows:

training set: 6000 good examples

cross validation set: 2000 good, 10 anomalous - we can use this to choose ε

test set: 2000 good, 10 anomalous

Anomaly detection

Supervised learning

Very small number of positive examples (y=1), large number of negative examples(y=0)

Large number of both positive and negative examples

Many different types of anomalies, hard to learn what anomaly looks like

Enough positive examples so that the algorithm gets a sense of what positive examples look like

Future anomalies may not resemble current ones

Future positive examples will resemble those in the training set

Fraud detection

Email spam classification

Monitoring machines in a data center

Weather prediction

Non gaussian features

If we have features whose distribution is non gaussian we need to find ways to transform their distribution into a gaussian one (replace x with log(x), or sqrt(x)).

Adding new features

We can combine features to generate new features, that quantify relationships between existing features. For example, when monitoring a server we might come up with the feature: cpu_usage/network_traffic.

Multivariate Gaussian distribution

Multivariate gauss distributions

To capture the relationships between features we can opt to use one multivariate gaussian distribution instead of several gaussian distributions.

If a cluster centroid ends up without any points assigned to it we can delete it (common) or randomly reinitialize it.

Cost function: $\frac{1}{m}\sum_{i=1}^m

Random initialization

To initialize our centroids we will pick K random points from the dataset, and set our μ centroids at their positions. To make sure K-means doesn’t reach a local optima, especially for a small K, we will run the algorithm more than once.

Choosing K

We can use the elbow method, where we look at a plot of J versus K and choose based on how sharply the line turns.

You can choose depending on the downstream use of the data.

Reinforcement learning

Reinforcement learning

Here we have an agent and an environment which the agent can interact with. Based on his interactions and our goal, we give him a score. His goal is to optimize that score.