students = read.table("data/students.tsv", header =F,# file does not contain a header (`F` is short for `FALSE`), so we must manually specify column names sep ="\t",# file is tab-delimited col.names = c("age","score","name")# column names)

We can now access the different columns in the data frame with students$age, students$score, and students$name.

csv file with header

Here we have the same data, but now the file is comma-delimited and contains a header. We can import this file with

1234

students = read.table("data/students.tsv", sep =",", header =T# first line contains column names, so we can immediately call `students$age` )

(Note: there is also a read.csv function that uses sep = "," by default.)

help

There are many more options that read.table can take. For a list of these, just type help(read.table) (or ?read.table) at the prompt to access documentation.

123

# These work for other functions as well.help(read.table)?read.table

ggplot2

With these R basics in place, let’s dive into the ggplot2 package.

Installation

One of R’s greatest strengths is its excellent set of packages. To install a package, you can use the install.packages() function.

1

install.packages("ggplot2")

To load a package into your current R session, use library().

1

library(ggplot2)

Scatterplots with qplot()

Let’s look at how to create a scatterplot in ggplot2. We’ll use the iris data frame that’s automatically loaded into R.

What does the data frame contain? We can use the head function to look at the first few rows.

12345678910

head(iris)# by default, head displays the first 6 rows. see `?head`head(iris, n =10)# we can also explicitly set the number of rows to displaySepal.Length Sepal.Width Petal.Length Petal.Width Species
5.13.51.40.2 setosa
4.93.01.40.2 setosa
4.73.21.30.2 setosa
4.63.11.50.2 setosa
5.03.61.40.2 setosa
5.43.91.70.4 setosa

(The data frame actually contains three types of species: setosa, versicolor, and virginica.)

# But we can also supply a different weight.# Here the height of each bar is the total running time of the director's movies.qplot(director, weight = minutes, data = movies, geom ="bar", ylab ="total length (min.)")

Imagine you have a sequence of snapshots from a day in Justin Bieber’s life, and you want to label each image with the activity it represents (eating, sleeping, driving, etc.). How can you do this?

One way is to ignore the sequential nature of the snapshots, and build a per-image classifier. For example, given a month’s worth of labeled snapshots, you might learn that dark images taken at 6am tend to be about sleeping, images with lots of bright colors tend to be about dancing, images of cars are about driving, and so on.

By ignoring this sequential aspect, however, you lose a lot of information. For example, what happens if you see a close-up picture of a mouth – is it about singing or eating? If you know that the previous image is a picture of Justin Bieber eating or cooking, then it’s more likely this picture is about eating; if, however, the previous image contains Justin Bieber singing or dancing, then this one probably shows him singing as well.

Thus, to increase the accuracy of our labeler, we should incorporate the labels of nearby photos, and this is precisely what a conditional random field does.

Part-of-Speech Tagging

Let’s go into some more detail, using the more common example of part-of-speech tagging.

In POS tagging, the goal is to label a sentence (a sequence of words or tokens) with tags like ADJECTIVE, NOUN, PREPOSITION, VERB, ADVERB, ARTICLE.

So let’s build a conditional random field to label sentences with their parts of speech. Just like any classifier, we’ll first need to decide on a set of feature functions \(f_i\).

Feature Functions in a CRF

In a CRF, each feature function is a function that takes in as input:

a sentence s

the position i of a word in the sentence

the label \(l_i\) of the current word

the label \(l_{i-1}\) of the previous word

and outputs a real-valued number (though the numbers are often just either 0 or 1).

(Note: by restricting our features to depend on only the current and previous labels, rather than arbitrary labels throughout the sentence, I’m actually building the special case of a linear-chain CRF. For simplicity, I’m going to ignore general CRFs in this post.)

For example, one possible feature function could measure how much we suspect that the current word should be labeled as an adjective given that the previous word is “very”.

Features to Probabilities

Next, assign each feature function \(f_j\) a weight \(\lambda_j\) (I’ll talk below about how to learn these weights from the data). Given a sentence s, we can now score a labeling l of s by adding up the weighted features over all words in the sentence:

Example Feature Functions

So what do these feature functions look like? Examples of POS tagging features could include:

\(f_1(s, i, l_i, l_{i-1}) = 1\) if \(l_i =\) ADVERB and the ith word ends in “-ly”; 0 otherwise.
** If the weight \(\lambda_1\) associated with this feature is large and positive, then this feature is essentially saying that we prefer labelings where words ending in -ly get labeled as ADVERB.

\(f_2(s, i, l_i, l_{i-1}) = 1\) if \(i = 1\), \(l_i =\) VERB, and the sentence ends in a question mark; 0 otherwise.
** Again, if the weight \(\lambda_2\) associated with this feature is large and positive, then labelings that assign VERB to the first word in a question (e.g., “Is this a sentence beginning with a verb?”) are preferred.

\(f_3(s, i, l_i, l_{i-1}) = 1\) if \(l_{i-1} =\) ADJECTIVE and \(l_i =\) NOUN; 0 otherwise.
** Again, a positive weight for this feature means that adjectives tend to be followed by nouns.

\(f_4(s, i, l_i, l_{i-1}) = 1\) if \(l_{i-1} =\) PREPOSITION and \(l_i =\) PREPOSITION.
** A negative weight \(\lambda_4\) for this function would mean that prepositions don’t tend to follow prepositions, so we should avoid labelings where this happens.

And that’s it! To sum up: to build a conditional random field, you just define a bunch of feature functions (which can depend on the entire sentence, a current position, and nearby labels), assign them weights, and add them all together, transforming at the end to a probability if necessary.

Now let’s step back and compare CRFs to some other common machine learning techniques.

That’s because CRFs are indeed basically the sequential version of logistic regression: whereas logistic regression is a log-linear model for classification, CRFs are a log-linear model for sequential labels.

Looks like HMMs…

Recall that Hidden Markov Models are another model for part-of-speech tagging (and sequential labeling in general). Whereas CRFs throw any bunch of functions together to get a label score, HMMs take a generative approach to labeling, defining

\(p(l,s) = p(l_1) \prod_i p(l_i | l_{i-1}) p(w_i | l_i)\)

where

\(p(l_i | l_{i-1})\) are transition probabilities (e.g., the probability that a preposition is followed by a noun);

\(p(w_i | l_i)\) are emission probabilities (e.g., the probability that a noun emits the word “dad”).

So how do HMMs compare to CRFs? CRFs are more powerful – they can model everything HMMs can and more. One way of seeing this is as follows.

Note that the log of the HMM probability is \(\log p(l,s) = \log p(l_0) + \sum_i \log p(l_i | l_{i-1}) + \sum_i \log p(w_i | l_i)\). This has exactly the log-linear form of a CRF if we consider these log-probabilities to be the weights associated to binary transition and emission indicator features.

Thus, the score \(p(l|s)\) computed by a CRF using these feature functions is precisely proportional to the score computed by the associated HMM, and so every HMM is equivalent to some CRF.

However, CRFs can model a much richer set of label distributions as well, for two main reasons:

CRFs can define a much larger set of features. Whereas HMMs are necessarily local in nature (because they’re constrained to binary transition and emission feature functions, which force each word to depend only on the current label and each label to depend only on the previous label), CRFs can use more global features. For example, one of the features in our POS tagger above increased the probability of labelings that tagged the first word of a sentence as a VERB if the end of the sentence contained a question mark.

Learning Weights

Let’s go back to the question of how to learn the feature weights in a CRF. One way, unsurprisingly, is to use gradient descent.

Assume we have a bunch of training examples (sentences and associated part-of-speech labels). Randomly initialize the weights of our CRF model. To shift these randomly initialized weights to the correct ones, for each training example…

Note that the first term in the gradient is the contribution of feature \(f_i\) under the true label, and the second term in the gradient is the expected contribution of feature \(f_i\) under the current model. This is exactly the form you’d expect gradient ascent to take.

Repeat the previous steps until some stopping condition is reached (e.g., the updates fall below some threshold).

In other words, every step takes the difference between what we want the model to learn and the model’s current state, and moves \(\lambda_i\) in the direction of this difference.

Finding the Optimal Labeling

Suppose we’ve trained our CRF model, and now a new sentence comes in. How do we do label it?

The naive way is to calculate \(p(l | s)\) for every possible labeling l, and then choose the label that maximizes this probability. However, since there are \(k^m\) possible labels for a tag set of size k and a sentence of length m, this approach would have to check an exponential number of labels.

A better way is to realize that (linear-chain) CRFs satisfy an optimal substructure property that allows us to use a (polynomial-time) dynamic programming algorithm to find the optimal label, similar to the Viterbi algorithm for HMMs.

A More Interesting Application

Okay, so part-of-speech tagging is kind of boring, and there are plenty of existing POS taggers out there. When might you use a CRF in real life?

Suppose you want to mine Twitter for the types of presents people received for Christmas:

To gather data for the graphs above, I simply looked for phrases of the form “I want XXX for Christmas” and “I got XXX for Christmas”. However, a more sophisticated CRF variant could use a GIFT part-of-speech-like tag (even adding other tags like GIFT-GIVER and GIFT-RECEIVER, to get even more information on who got what from whom) and treat this like a POS tagging problem. Features could be based around things like “this word is a GIFT if the previous word was a GIFT-RECEIVER and the word before that was ‘gave’” or “this word is a GIFT if the next two words are ‘for Christmas’”.

How was the Netflix Prize won? I went through a lot of the Netflix Prize papers a couple years ago, so I’ll try to give an overview of the techniques that went into the winning solution here.

Normalization of Global Effects

Suppose Alice rates Inception 4 stars. We can think of this rating as composed of several parts:

A baseline rating (e.g., maybe the mean over all user-movie ratings is 3.1 stars).

An Alice-specific effect (e.g., maybe Alice tends to rate movies lower than the average user, so her ratings are -0.5 stars lower than we normally expect).

An Inception-specific effect (e.g., Inception is a pretty awesome movie, so its ratings are 0.7 stars higher than we normally expect).

A less predictable effect based on the specific interaction between Alice and Inception that accounts for the remainder of the stars (e.g., Alice really liked Inception because of its particular combination of Leonardo DiCaprio and neuroscience, so this rating gets an additional 0.7 stars).

So instead of having our models predict the 4-star rating itself, we could first try to remove the effect of the baseline predictors (the first three components) and have them predict the specific 0.7 stars. (I guess you can also think of this as a simple kind of boosting.)

More generally, additional baseline predictors include:

A factor that allows Alice’s rating to (linearly) depend on the (square root of the) number of days since her first rating. (For example, have you ever noticed that you become a harsher critic over time?)

A factor that allows Alice’s rating to depend on the number of days since the movie’s first rating by anyone. (If you’re one of the first people to watch it, maybe it’s because you’re a huge fan and really excited to see it on DVD, so you’ll tend to rate it higher.)

A factor that allows Alice’s rating to depend on the number of people who have rated Inception. (Maybe Alice is a hipster who hates being part of the crowd.)

A factor that allows Alice’s rating to depend on the movie’s overall rating.

(Plus a bunch of others.)

And, in fact, modeling these biases turned out to be fairly important: in their paper describing their final solution to the Netflix Prize, Bell and Koren write that

Of the numerous new algorithmic contributions, I would like to highlight one – those humble baseline predictors (or biases), which capture main effects in the data. While the literature mostly concentrates on the more sophisticated algorithmic aspects, we have learned that an accurate treatment of main effects is probably at least as signficant as coming up with modeling breakthroughs.

(For a perhaps more concrete example of why removing these biases is useful, suppose you know that Bob likes the same kinds of movies that Alice does. To predict Bob’s rating of Inception, instead of simply predicting the same 4 stars that Alice rated, if we know that Bob tends to rate movies 0.3 stars higher than average, then we could first remove Alice’s bias and then add in Bob’s: 4 + 0.5 + 0.3 = 4.8.)

Neighborhood Models

Let’s now look at some slightly more sophisticated models. As alluded to in the section above, one of the standard approaches to collaborative filtering is to use neighborhood models.

Briefly, a neighborhood model works as follows. To predict Alice’s rating of Titanic, you could do two things:

Item-item approach: find a set of items similar to Titanic that Alice has also rated, and take the (weighted) mean of Alice’s ratings on them.

User-user approach: find a set of users similar to Alice who rated Titanic, and again take the mean of their ratings of Titanic.

The main questions, then, are (let’s stick to the item-item approach for simplicity):

How do we find the set of similar items?

How do we weight these items when taking their mean?

The standard approach is to take some similarity metric (e.g., correlation or a Jaccard index) to define similarities between pairs of movies, take the K most similar movies under this metric (where K is perhaps chosen via cross-validation), and then use the same similarity metric when computing the weighted mean.

This has a couple problems:

Neighbors aren’t independent, so using a standard similarity metric to define a weighted mean overcounts information. For example, suppose you ask five friends where you should eat tonight. Three of them went to Mexico last week and are sick of burritos, so they strongly recommend against a taqueria. Thus, your friends’ recommendations have a stronger bias than what you’d get if you asked five friends who didn’t know each other at all. (Compare with the situation where all three Lord of the Rings Movies are neighbors of Harry Potter.)

Different movies should perhaps be using different numbers of neighbors. Some movies may be predicted well by only one neighbor (e.g., Harry Potter 2 could be predicted well by Harry Potter 1 alone), some movies may require more, and some movies may have no good neighbors (so you should ignore your neighborhood algorithms entirely and let your other ratings models stand on their own).

So another approach is the following:

You can still use a similarity metric like correlation or cosine similarity to choose the set of similar items.

But instead of using the similarity metric to define the interpolation weights in the mean calculations, you essentially perform a (sparse) linear regression to find the weights that minimize the squared error between an item’s rating and a linear combination of the ratings of its neighbors. Note that these weights are no longer constrained, so that if all neighbors are weak, then their weights will be close to zero and the neighborhood model will have a low effect.

(A slightly more complicated user-user approach, similar to this item-item neighborhood approach, is also useful.)

Implicit Data

Adding on to the neighborhood approach, we can also let implicit data influence our predictions. The mere fact that a user rated lots of science fiction movies but no westerns, suggests that the user likes science fiction better than cowboys. So using a similar framework as in the neighborhood ratings model, we can learn for Inception a set of offset weights associated to Inception’s movie neighbors.

Whenever we want to predict how Bob rates Inception, we look at whether Bob rated each of Inception’s neighbors. If he did, we add in the corresponding offset; if not, then we add nothing (and, thus, Bob’s rating is implicitly penalized by the missing weight).

Matrix Factorization

Complementing the neighborhood approach to collaborative filtering is the matrix factorization approach. Whereas the neighborhood approach takes a very local approach to ratings (if you liked Harry Potter 1, then you’ll like Harry Potter 2!), the factorization approach takes a more global view (we know that you like fantasy movies and that Harry Potter has a strong fantasy element, so we think that you’ll like Harry Potter) that decomposes users and movies into a set of latent factors (which we can think of as categories like “fantasy” or “violence”).

In fact, matrix factorization methods were probably the most important class of techniques for winning the Netflix Prize. In their 2008 Progress Prize paper, Bell and Koren write

It seems that models based on matrix-factorization were found to be most accurate (and thus popular), as evident by recent publications and discussions on the Netflix Prize forum. We definitely agree to that, and would like to add that those matrix-factorization models also offer the important flexibility needed for modeling temporal effects and the binary view. Nonetheless, neighborhood models, which have been dominating most of the collaborative filtering literature, are still expected to be popular due to their practical characteristics - being able to handle new users/ratings without re-training and offering direct explanations to the recommendations.

The typical way to perform matrix factorizations is to perform a singular value decomposition on the (sparse) ratings matrix (using stochastic gradient descent and regularizing the weights of the factors, possibly constraining the weights to be positive to get a type of non-negative matrix factorization). (Note that this “SVD” is a little different from the standard SVD learned in linear algebra, since not every user has rated every movie and so the ratings matrix contains many missing elements that we don’t want to simply treat as 0.)

Some SVD-inspired methods used in the Netflix Prize include:

Standard SVD: Once you’ve represented users and movies as factor vectors, you can dot product Alice’s vector with Inception’s vector to get Alice’s predicted rating of Inception.

Asymmetric SVD: Instead of users having their own notion of factor vectors, we can represent users as a bag of items they have rated (or provided implicit feedback for). So Alice is now represented as a (possibly weighted) sum of the factor vectors of the items she has rated, and to get her predicted rating of Titanic, we can dot product this representation with the factor vector of Titanic. From a practical perspective, this model has an added benefit in that no user parameterizations are needed, so we can use this approach to generate recommendations as soon as a user provides some feedback (which could just be views or clicks on an item, and not necessarily ratings), without needing to retrain the model to factorize the user.

SVD++: Incorporate both the standard SVD and the asymmetric SVD model by representing users both by their own factor representation and as a bag of item vectors.

Regression

Some regression models were also used in the predictions. The models are fairly standard, I think, so I won’t spend too long here. Basically, just as with the neighborhood models, we can take a user-centric approach and a movie-centric approach to regression:

User-centric approach: We learn a regression model for each user, using all the movies that the user rated as the dataset. The response is the movie’s rating, and the predictor variables are attributes associated to that movie (which can be derived from, say, PCA, MDS, or an SVD).

Movie-centric approach: Similarly, we can learn a regression model for each movie, using all the users that rated the movie as the dataset.

Restricted Boltzmann Machines

Restricted Boltzmann Machines provide another kind of latent factor approach that can be used. See this paper for a description of how to apply them to the Netflix Prize. (In case the paper’s a little difficult to read, I wrote an introduction to RBMs a little while ago.)

Temporal Effects

Many of the models incorporate temporal effects. For example, when describing the baseline predictors above, we used a few temporal predictors that allowed a user’s rating to (linearly) depend on the time since the first rating he ever made and on the time since a movie’s first rating. We can also get more fine-grained temporal effects by, say, binning items into a couple months’ worth of ratings at a time, and allowing movie biases to change within each bin. (For example, maybe in May 2006, Time Magazine nominated Titanic as the best movie ever made, which caused a spurt in glowing ratings around that time.)

In the matrix factorization approach, user factors were also allowed to be time-dependent (e.g., maybe Bob comes to like comedy movies more and more over time). We can also give more weight to recent user actions.

Regularization

Regularization was also applied throughout pretty much all the models learned, to prevent overfitting on the dataset. Ridge regression was heavily used in the factorization models to penalize large weights, and lasso regression (though less effective) was useful as well. Many other parameters (e.g., the baseline predictors, similarity weights and interpolation weights in the neighborhood models) were also estimated using fairly standard shrinkage techniques.

Ensemble Methods

Finally, let’s talk about how all of these different algorithms were combined to provide a single rating that exploits the strengths of each model. (Note that, as mentioned above, many of these models were not trained on the raw ratings data directly, but rather on the residuals of other models.)

In the paper detailing their final solution, the winners describe using gradient boosted decision trees to combine over 500 models; previous solutions used instead a linear regression to combine the predictors.

Briefly, gradient boosted decision trees work by sequentially fitting a series of decision trees to the data; each tree is asked to predict the error made by the previous trees, and is often trained on slightly perturbed versions of the data. (For a longer description of a similar technique, see my introduction to random forests.)

Since GBDTs have a built-in ability to apply different methods to different slices of the data, we can add in some predictors that help the trees make useful clusterings:

Number of movies each user rated

Number of users that rated each movie

Factor vectors of users and movies

Hidden units of a restricted Boltzmann Machine

(For example, one thing that Bell and Koren found (when using an earlier ensemble method) was that RBMs are more useful when the movie or the user has a low number of ratings, and that matrix factorization methods are more useful when the movie or user has a high number of ratings.)

Here’s a graph of the effect of ensemble size from early on in the competition (in 2007), and the authors’ take on it:

However, we would like to stress that it is not necessary to have such a large number of models to do well. The plot below shows RMSE as a function of the number of methods used. One can achieve our winning score (RMSE=0.8712) with less than 50 methods, using the best 3 methods can yield RMSE < 0.8800, which would land in the top 10. Even just using our single best method puts us on the leaderboard with an RMSE of 0.8890. The lesson here is that having lots of models is useful for the incremental results needed to win competitions, but practically, excellent systems can be built with just a few well-selected models.

What types of students go to which schools? There are, of course, the classic stereotypes:

MIT has the hacker engineers.

Stanford has the laid-back, social folks.

Harvard has the prestigious leaders of the world.

Berkeley has the activist hippies.

Caltech has the hardcore science nerds.

But how well do these perceptions match reality? What are students at Stanford, Harvard, MIT, Caltech, and Berkeley really interested in? Following the path of my previous data-driven post on differences between Silicon Valley and NYC, I scraped the Quora profiles of a couple hundred followers of each school to find out.

Topics

So let’s look at what kinds of topics followers of each school are interested in*. (Skip past the lists for a discussion.)

MIT

Topics are followed by p(school = MIT|topic).

MIT Media Lab 0.893

Ksplice 0.69

Lisp (programming language) 0.677

Nokia 0.659

Public Speaking 0.65

Data Storage 0.65

Google Voice 0.609

Hacking 0.602

Startups in Europe 0.597

Startup Names 0.572

Mechanical Engineering 0.563

Engineering 0.563

Distributed Databases 0.544

StackOverflow 0.536

Boston 0.513

Learning 0.507

Open Source 0.498

Cambridge 0.496

Public Relations 0.493

Visualization 0.492

Semantic Web 0.486

Andreessen-Horowitz 0.483

Nature 0.475

Cryptography 0.474

Startups in Boston 0.452

Adobe Photoshop 0.451

Computer Security 0.447

Sachin Tendulkar 0.443

Hacker News 0.442

Games 0.429

Android Applications 0.428

Best Engineers and Programmers 0.427

College Admissions & Getting Into College 0.422

Co-Founders 0.419

Big Data 0.41

System Administration 0.4

Biotechnology 0.398

Higher Education 0.394

NoSQL 0.387

User Experience 0.386

Career Advice 0.377

Artificial Intelligence 0.375

Scalability 0.37

Taylor Swift 0.368

Google Search 0.368

Functional Programming 0.365

Bing 0.363

Bioinformatics 0.361

How I Met Your Mother (TV series) 0.361

Operating Systems 0.356

Compilers 0.355

Google Chrome 0.354

Management & Organizational Leadership 0.35

Literary Fiction 0.35

Intelligence 0.348

Fight Club (1999 movie) 0.344

Hip Hop Music 0.34

UX Design 0.337

Web Application Frameworks 0.336

Startups in New York City 0.333

Book Recommendations 0.33

Engineering Recruiting 0.33

Search Engines 0.329

Social Search 0.329

Data Science 0.328

History 0.328

Interaction Design 0.326

Classification (machine learning) 0.322

Startup Incubators and Seed Programs 0.321

Graphic Design 0.321

Product Design (software) 0.319

The College Experience 0.319

Writing 0.319

MapReduce 0.318

Database Systems 0.315

User Interfaces 0.314

Literature 0.314

C (programming language) 0.314

Television 0.314

Reading 0.313

Usability 0.312

Books 0.312

Computers 0.311

Stealth Startups 0.311

Daft Punk 0.31

Healthy Eating 0.309

Innovation 0.309

Skiing 0.305

JavaScript 0.304

Rock Music 0.304

Mozilla Firefox 0.304

Self-Improvement 0.303

McKinsey & Company 0.302

AngelList 0.301

Data Visualization 0.301

Cassandra (database) 0.301

Stanford

Topics are followed by p(school = Stanford|topic).

Stanford Computer Science 0.951

Stanford Graduate School of Business 0.939

Stanford 0.896

Stanford Football 0.896

Stanford Cardinal 0.896

Social Dance 0.847

Stanford University Courses 0.847

Romance 0.769

Instagram 0.745

College Football 0.665

Mobile Location Applications 0.634

Online Communities 0.621

Interpersonal Relationships 0.585

Food & Restaurants in Palo Alto 0.572

Your 20s 0.566

Men’s Fashion 0.548

Flipboard 0.537

Inception (2010 movie) 0.535

Tumblr 0.531

People Skills 0.522

Exercise 0.52

Joel Spolsky 0.516

Valuations 0.515

The Social Network (2010 movie) 0.513

LeBron James 0.506

Northern California 0.506

Evernote 0.5

Quora Community 0.5

Blogging 0.49

Downtown Palo Alto 0.487

The College Experience 0.485

Consumer Internet 0.477

Restaurants in San Francisco 0.477

Chad Hurley 0.47

Meditation 0.468

Yishan Wong 0.466

Arrested Development (TV series) 0.463

fbFund 0.457

Best Engineers at X Company 0.451

Language 0.45

Words 0.448

Happiness 0.447

Path (company) 0.446

Color Labs (startup) 0.446

Palo Alto 0.445

Woot.com 0.442

Beer 0.442

PayPal 0.441

Women in Startups 0.438

Techmeme 0.433

Women in Engineering 0.428

The Mission (San Francisco neighborhood) 0.427

iPhone Applications 0.416

Asana 0.413

Monetization 0.412

Repetitive Strain Injury (RSI) 0.4

IDEO 0.398

Spotify 0.397

San Francisco Giants 0.396

Fortune Magazine 0.389

Love 0.387

Human-Computer Interaction 0.382

Hip Hop Music 0.378

Self-Improvement 0.378

Food in San Francisco 0.375

Quora (company) 0.374

Quora Infrastructure 0.373

iPhone 0.371

Square (company) 0.369

Social Psychology 0.369

Network Effects 0.366

Chris Sacca 0.365

Walt Mossberg 0.364

Salesforce.com 0.362

Sex 0.361

Etiquette 0.361

David Pogue 0.361

Gowalla 0.36

iOS Development 0.354

Palantir Technologies 0.353

Mobile Computing 0.347

Sports 0.346

Video Games 0.345

Burning Man 0.345

Engineering Management 0.343

Cognitive Science 0.342

Dating & Relationships 0.341

Fred Wilson (venture investor) 0.337

Taiwan 0.333

Natural Language Processing 0.33

Eric Schmidt 0.329

Social Advice 0.329

Engineering Recruiting 0.328

Job Interviews 0.325

Mobile Phones 0.324

Twitter Inc. (company) 0.321

Engineering in Silicon Valley 0.321

San Francisco Bay Area 0.321

Google Analytics 0.32

Fashion 0.315

Interaction Design 0.314

Open Graph 0.313

Drugs & Pharmaceuticals 0.312

Electronic Music 0.312

Facebook Inc. (company) 0.309

Fitness 0.309

YouTube 0.308

TED Talks 0.308

Freakonomics (2005 Book) 0.307

Jack Dorsey 0.306

Nutrition 0.305

Puzzles 0.305

Silicon Valley Mergers & Acquisitions 0.304

Viral Growth & Analytics 0.304

Amazon Web Services 0.304

StumbleUpon 0.303

Exceptional Comment Threads 0.303

Harvard

Harvard Business School 0.968

Harvard Business Review 0.922

Harvard Square 0.912

Harvard Law School 0.912

Jimmy Fallon 0.899

Boston Red Sox 0.658

Klout 0.644

Oprah Winfrey 0.596

Ivanka Trump 0.587

Dalai Lama 0.569

Food in New York City 0.565

U2 0.562

TwitPic 0.534

37signals 0.522

David Lynch (director) 0.512

Al Gore 0.508

TechStars 0.49

Baseball 0.487

Private Equity 0.471

Classical Music 0.46

Startups in New York City 0.458

HootSuite 0.449

Kiva 0.442

Ultimate Frisbee 0.441

Huffington Post 0.436

New York City 0.433

Charlie Cheever 0.433

The New York Times 0.431

Technology Journalism 0.431

McKinsey & Company 0.427

TweetDeck 0.422

How Does X Work? 0.417

Ashton Kutcher 0.414

Coldplay 0.402

Conan O’Brien 0.397

Fast Company 0.397

WikiLeaks 0.394

Michael Jackson 0.389

Guy Kawasaki 0.389

Journalism 0.384

Wall Street Journal 0.384

Cambridge 0.371

Seattle 0.37

Cities & Metro Areas 0.357

Boston 0.353

Tim Ferriss (author) 0.35

The New Yorker 0.343

Law 0.34

Mashable 0.338

Politics 0.335

The Economist 0.334

Barack Obama 0.333

Skiing 0.329

McKinsey Quarterly 0.325

Wired (magazine) 0.316

Bill Gates 0.31

Mad Men (TV series) 0.308

India 0.306

TED Talks 0.306

Netflix 0.304

Wine 0.303

Angel Investors 0.302

Facebook Ads 0.301

UC Berkeley

Berkeley 0.978

California Golden Bears 0.91

Internships 0.717

Web Marketing 0.484

Google Social Strategy 0.453

Southwest Airlines 0.451

WordPress 0.429

Stock Market 0.429

BMW (automobile) 0.428

Web Applications 0.423

Flickr 0.422

Snowboarding 0.42

Electronic Music 0.404

MySQL 0.401

Internet Advertising 0.399

Search Engine Optimization (SEO) 0.398

Yelp 0.396

Groupon 0.393

In-N-Out Burger 0.391

The Matrix (1999 movie) 0.389

Trading (finance) 0.385

jQuery 0.381

Hedge Funds 0.378

Social Media Marketing 0.377

San Francisco 0.376

Stealth Startups 0.362

Yahoo! 0.36

Cascading Style Sheets 0.359

Angel Investors 0.355

UX Design 0.35

StarCraft 0.348

Los Angeles Lakers 0.347

Mountain View 0.345

How I Met Your Mother (TV series) 0.338

Google+ 0.337

Ruby on Rails 0.333

Reading 0.333

Social Media 0.326

China 0.322

Palantir Technologies 0.319

Facebook Platform 0.315

Basketball 0.315

Education 0.314

Business Development 0.312

Online & Mobile Payments 0.305

Restaurants in San Francisco 0.302

Technology Companies 0.302

Seth Godin 0.3

Caltech

Pasadena 0.969

Chess 0.748

Table Tennis 0.671

UCLA 0.67

MacBook Pro 0.618

Physics 0.618

Haskell 0.582

Los Angeles 0.58

Electrical Engineering 0.567

Star Trek (movie 0.561

Disruptive Technology 0.545

Science 0.53

Biology 0.526

Quantum Mechanics 0.521

LaTeX 0.514

Mathematics 0.488

xkcd 0.488

Genetics & Heredity 0.487

Chemistry 0.47

Medicine & Healthcare 0.448

Poker 0.445

C++ (programming language) 0.442

Data Structures 0.434

Emacs 0.428

MongoDB 0.423

Neuroscience 0.404

Science Fiction 0.4

Mac OS X 0.394

Board Games 0.387

Computers 0.386

Research 0.385

Finance 0.385

The Future 0.379

Linux 0.378

The Colbert Report 0.376

The Beatles 0.374

The Onion 0.365

Ruby 0.363

Cars & Automobiles 0.361

Quantitative Finance 0.359

Academia 0.359

Law 0.355

Cooking 0.354

Psychology 0.349

Eminem 0.347

Football (Soccer) 0.346

Computer Programming 0.343

Algorithms 0.343

Evolutionary Biology 0.337

Behavioral Economics 0.335

California 0.329

Machine Learning 0.326

Futurama 0.324

Social Advice 0.324

StarCraft II 0.319

Job Interview Questions 0.318

Game Theory 0.316

This American Life 0.315

Economics 0.314

Vim 0.31

Graduate School 0.309

Git (revision control) 0.306

Computer Science 0.303

What do we see?

First, in a nice validation of this approach, we find that each school is interested in exactly the locations we’d expect: Caltech is interested in Pasadena and Los Angeles; MIT and Harvard are both interested in Boston and Cambridge (Harvard is interested in New York City as well); Stanford is interested in Palo Alto, Northern California, and San Francisco Bay Area; and Berkeley is interested in Berkeley, San Francisco, and Mountain View.

More interestingly, let’s look at where each school likes to eat. Stereotypically, we expect Harvard, Stanford, and Berkeley students to be more outgoing and social, and MIT and Caltech students to be more introverted. This is indeed what we find:

Harvard follows Food in New York City; Stanford follows Food & Restaurants in Palo Alto, Restaurants in San Francisco, and Food in San Francisco; and Berkeley follows Restaurants in San Francisco and In-N-Out Burger. In other words, Harvard, Stanford, and Berkeley love eating out.

Caltech, on the other hand, loves Cooking, and MIT loves Healthy Eating – both signs, perhaps, of a preference for eating in.

And what does each university use to quench their thirst? Harvard students like to drink wine (classy!), while Stanford students prefer beer (the social drink of choice).

What about sports teams? MIT and Caltech couldn’t care less, though Harvard follows the Boston Red Sox, Stanford follows the San Francisco Giants (as well as their own Stanford Football and Stanford Cardinal), and Berkeley follows the Los Angeles Lakers (and the California Golden Bears).

For sports themselves, MIT students like skiing; Stanford students like general exercise, fitness, and sports; Harvard students like baseball, ultimate frisbee, and skiing; and Berkeley students like snowboarding. Caltech, in a league of its own, enjoys table tennis and chess.

What does each school think of social? Caltech students look for Social Advice. Berkeley students are interested in Social Media and Social Media Marketing. MIT, on the more technical side, wants Social Search. Stanford students, predictably, love the whole spectrum of social offerings, from Social Dance and The Social Network, to Social Psychology and Social Advice. (Interestingly, Caltech and Stanford are both interested in Social Advice, though I wonder if it’s for slightly different reasons.)

What’s each school’s relationship with computers? Caltech students are interested in Computer Science, MIT hackers are interested in Computer Security, and Stanford students are interested in Human-Computer Interaction.

Digging into the MIT vs. Caltech divide a little, we see that Caltech students really are more interested in the pure sciences (Physics, Science, Biology, Quantum Mechanics, Mathematics, Chemistry, etc.), while MIT students are more on the applied and engineering sides (Mechanical Engineering, Engineering, Distributed Databases, Cryptography, Computer Security, Biotechnology, Operating Systems, Compilers, etc.).

What does each school like to read, both offline and online? Caltech loves science fiction, xkcd, and The Onion; MIT likes Hacker News; Harvard loves journals, newspapers, and magazines (Huffington Post, the New York Times, Fortune, Wall Street Journal, the New Yorker, the Economist, and so on); and Stanford likes TechMeme.

What movies and television shows does each school like to watch? Caltech likes Star Trek, the Colbert Report, and Futurama. MIT likes Fight Club (I don’t know what this has to do with MIT, though I will note that on my first day as a freshman in a new dorm, Fight Club was precisely the movie we all went to a lecture hall to see). Stanford likes The Social Network and Inception. Harvard, rather fittingly, likes Mad Men and Ted Talks.

Let’s look at the startups each school follows. MIT, of course, likes Ksplice. Berkeley likes Yelp and Groupon. Stanford likes just about every startup under the sun (Instagram, Flipboard, Tumblr, Path, Color Labs, etc.). And Harvard, that bastion of hard-won influence and prestige? To the surprise of precisely no one, Harvard enjoys Klout.

Let’s end with a summarized view of each school:

Caltech is very much into the sciences (Physics, Biology, Quantum Mechanics, Mathematics, etc.), as well as many pretty nerdy topics (Star Trek, Science Fiction, xkcd, Futurama, Starcraft II, etc.).

Berkeley, sadly, is perhaps too large and diverse for an overall characterization.

Harvard students are fascinated by famous figures (Jimmy Fallon, Oprah Winfrey, Invaka Trump, Dalai Lama, David Lynch, Al Gore, Bill Gates, Barack Obama), and by prestigious newspapers, journals, and magazines (Fortune, the New York Times, the Wall Street Journal, the Economist, and so on). Other very fitting interests include Kiva, classical music, and Coldplay.

*I pulled about 400 followers from each school, and added a couple filters, to try to ensure that followers were actual attendees of the schools rather than general people simply interested in them. Topics are sorted using a naive Bayes score and filtered to have at least 5 counts. Also, a word of warning: my dataset was fairly small and users on Quora are almost certainly not representative of their schools as a whole (though I tried to be rigorous with what I had).

How does information spread through a network? Much of Quora’s appeal, after all, lies in its social graph – and when you’ve got a network of users, all broadcasting their activities to their neighbors, information can cascade in multiple ways. How do these social designs affect which users see what?

Think, for example, of what happens when your kid learns a new slang word at school. He doesn’t confine his use of the word to McKinley Elementary’s particular boundaries, between the times of 9-3pm – he introduces it to his friends from other schools at soccer practice as well. A couple months later, he even says it at home for the first time; you like the word so much, you then start using it at work. Eventually, Justin Bieber uses the word in a song, at which point the word’s popularity really starts to explode.

So how does information propagate through a social network? What types of people does an answer on Quora reach, and how does it reach them? (Do users discover new answers individually, or are hubs of connectors more key?) How does the activity of a post on Quora rise and fall? (Submissions on other sites have limited lifetimes, fading into obscurity soon after an initial spike; how does that change when users are connected and every upvote can revive a post for someone else’s eyes?)

(I looked at Quora since I had some data from there already available, but I hope the lessons should be fairly applicable in general, to other social networks like Facebook, Twitter, and LinkedIn as well.)

Also, here’s the network of the upvoters themselves (there’s an edge between users A and B if A follows B):

We can see three main clusters of users:

A large group in green centered around a lot of power users and Quora employees.

A machine learning group of folks in orange centered around people like Oliver Grisel, Christian Langreiter, and Joseph Turian.

A group of people following me, in purple.

Plus some smaller clusters in blue and yellow. (There were also a bunch of isolated users, connected to no one, that I filtered out of the picture.)

Digging into how these topic and user graphs are related:

The orange cluster of users is more heavily into machine learning: 79% of users in that cluster follow more green topics (machine learning and technical topics) than red and purple topics (startups and general intellectual matters).

The green cluster of users is reversed: 77% of users follow more of the red and purple clusters of topics (on startups and general intellectual matters) than machine learning and technical topics.

More interestingly, though, we can ask: how do the connections between upvoters relate to the way the post spread?

Social Voting Dynamics

So let’s take a look. Here’s a visualization I made of upvotes on my answer across time (click here for a larger view).

To represent the social dynamics of these upvotes, I drew an edge from user A to user B if user A transmitted the post to user B through an upvote. (Specifically, I drew an edge from Alice to Bob if Bob follows Alice and Bob’s upvote appeared within five days of Alice’s upvote; this is meant to simulate the idea that Alice was the key intermediary between my post and Bob.)

Also,

Green nodes are users with at least one upvote edge.

Blue nodes are users who follow at least one of the topics the post is categorized under (i.e., users who probably discovered the answer by themselves).

Red nodes are users with no connections and who do not follow any of the post’s topics (i.e, users whose path to the post remain mysterious).

Users increase in size when they produce more connections.

Here’s a play-by-play of the video:

On Feb 14 (the day I wrote the answer), there’s a flurry of activity.

A couple of days later, Tracy Chou gives an upvote, leading to another spike in activity.

Then all’s quiet until… bam! Alex Kamil leads to a surge of upvotes, and his upvote finds Ludi Rehak, who starts a small surge of her own. They’re quickly followed by Christian Langreiter, who starts a small revolution among a bunch of machine learning folks a couple days later.

Then all is pretty calm again, until a couple months later when… bam! Aditya Sengupta brings in a smashing of his own followers, and his upvote makes its way to Marc Bodnick, who sets off a veritable storm of activity.

(Already we can see some relationships between the graph of user connections and the way the post propagated. Many of the users from the orange cluster, for example, come from Alex Kamil and Christian Langreiter’s upvotes, and many of the users from the green cluster come from Aditya Sengupta and Marc Bodnick’s upvotes. What’s interesting, though, is, why didn’t the cluster of green users appear all at once, like the orange cluster did? People like Kah Seng Tay, Tracy Chou, Venkatesh Rao, and Chad Little upvoted the answer pretty early on, but it wasn’t until Aditya Sengupta’s upvote a couple months later that people like Marc Bodnick, Edmond Lau, and many of the other green users (who do indeed follow that first set of folks) discovered the answer. Did the post simply get lost in users’ feeds the first time around? Was the post perhaps ignored until it received enough upvotes to be considered worth reading? Are some users’ upvotes just trusted more than others’?)

For another view of the upvote dynamics, here’s a static visualization, where we can again easily see the clusters of activity:

Fin

There are still many questions it would be interesting to look at; for example,

What differentiates users who sparked spikes of activity from users who didn’t? I don’t believe it’s simply number of followers, as many well-connected upvoters did not lead to cascades of shares. Does authority matter?

How far can a post reach? Clearly, the post reached people more than one degree of separation away from me (where one degree of separation is a follower); what does the distribution of degrees look like? Is there any relationship between degree of separation and time of upvote?

What can we say about the people who started following me after reading my answer? Are they fewer degrees of separation away? Are they more interested in machine learning? Have they upvoted any of my answers before? (Perhaps there’s a certain “threshold” of interestingness people need to overflow before they’re considered acceptable followees.)

But to summarize a bit what we’ve seen so far, here are some statistics on the role the social graph played in spreading the post:

There are 5 clusters of activity after the initial post, sparked both by power users and less-connected folks. In an interesting cascade of information, some of these sparks led to further spikes in activity as well (as when Aditya Sengupta’s upvote found its way to Marc Bodnick, who set off even more activity).

35% of users made their way to my answer because of someone else’s upvote.

Through these connections, the post reached a fair variety of users: 32% of upvoters don’t even follow any of the post’s topics.

77% of upvotes came from users over two weeks after my answer appeared.

If we look only at the upvoters who follow at least one of the post’s topics, 33% didn’t see my answer until someone else showed it to them. In other words, a full one-third of people who presumably would have been interested in my post anyways only found it because of their social network.

So it looks like the social graph played quite a large part in the post’s propagation, and I’ll end with a big shoutout to Stormy Shippy, who provided an awesome set of scripts I used to collect a lot of this data.

What is latent Dirichlet allocation? It’s a way of automatically discovering topics that these sentences contain. For example, given these sentences and asked for 2 topics, LDA might produce something like

Sentences 1 and 2: 100% Topic A

Sentences 3 and 4: 100% Topic B

Sentence 5: 60% Topic A, 40% Topic B

Topic A: 30% broccoli, 15% bananas, 10% breakfast, 10% munching, … (at which point, you could interpret topic A to be about food)

LDA Model

In more detail, LDA represents documents as mixtures of topics that spit out words with certain probabilities. It assumes that documents are produced in the following fashion: when writing each document, you

Decide on the number of words N the document will have (say, according to a Poisson distribution).

Choose a topic mixture for the document (according to a Dirichlet distribution over a fixed set of K topics). For example, assuming that we have the two food and cute animal topics above, you might choose the document to consist of 1/3 food and 2/3 cute animals.

Generate each word w_i in the document by:

First picking a topic (according to the multinomial distribution that you sampled above; for example, you might pick the food topic with 1/3 probability and the cute animals topic with 2/3 probability).

Using the topic to generate the word itself (according to the topic’s multinomial distribution). For example, if we selected the food topic, we might generate the word “broccoli” with 30% probability, “bananas” with 15% probability, and so on.

Assuming this generative model for a collection of documents, LDA then tries to backtrack from the documents to find a set of topics that are likely to have generated the collection.

Example

Let’s make an example. According to the above process, when generating some particular document D, you might

Pick 5 to be the number of words in D.

Decide that D will be 1/2 about food and 1/2 about cute animals.

Pick the first word to come from the food topic, which then gives you the word “broccoli”.

Pick the second word to come from the cute animals topic, which gives you “panda”.

Pick the third word to come from the cute animals topic, giving you “adorable”.

Pick the fourth word to come from the food topic, giving you “cherries”.

Pick the fifth word to come from the food topic, giving you “eating”.

So the document generated under the LDA model will be “broccoli panda adorable cherries eating” (note that LDA is a bag-of-words model).

Learning

So now suppose you have a set of documents. You’ve chosen some fixed number of K topics to discover, and want to use LDA to learn the topic representation of each document and the words associated to each topic. How do you do this? One way (known as collapsed Gibbs sampling) is the following:

Go through each document, and randomly assign each word in the document to one of the K topics.

Notice that this random assignment already gives you both topic representations of all the documents and word distributions of all the topics (albeit not very good ones).

So to improve on them, for each document d…

Go through each word w in d…

And for each topic t, compute two things: 1) p(topic t | document d) = the proportion of words in document d that are currently assigned to topic t, and 2) p(word w | topic t) = the proportion of assignments to topic t over all documents that come from this word w. Reassign w a new topic, where we choose topic t with probability p(topic t | document d) * p(word w | topic t) (according to our generative model, this is essentially the probability that topic t generated word w, so it makes sense that we resample the current word’s topic with this probability). (Also, I’m glossing over a couple of things here, in particular the use of priors/pseudocounts in these probabilities.)

In other words, in this step, we’re assuming that all topic assignments except for the current word in question are correct, and then updating the assignment of the current word using our model of how documents are generated.

After repeating the previous step a large number of times, you’ll eventually reach a roughly steady state where your assignments are pretty good. So use these assignments to estimate the topic mixtures of each document (by counting the proportion of words assigned to each topic within that document) and the words associated to each topic (by counting the proportion of words assigned to each topic overall).

Layman’s Explanation

In case the discussion above was a little eye-glazing, here’s another way to look at LDA in a different domain.

Suppose you’ve just moved to a new city. You’re a hipster and an anime fan, so you want to know where the other hipsters and anime geeks tend to hang out. Of course, as a hipster, you know you can’t just ask, so what do you do?

Here’s the scenario: you scope out a bunch of different establishments (documents) across town, making note of the people (words) hanging out in each of them (e.g., Alice hangs out at the mall and at the park, Bob hangs out at the movie theater and the park, and so on). Crucially, you don’t know the typical interest groups (topics) of each establishment, nor do you know the different interests of each person.

So you pick some number K of categories to learn (i.e., you want to learn the K most important kinds of categories people fall into), and start by making a guess as to why you see people where you do. For example, you initially guess that Alice is at the mall because people with interests in X like to hang out there; when you see her at the park, you guess it’s because her friends with interests in Y like to hang out there; when you see Bob at the movie theater, you randomly guess it’s because the Z people in this city really like to watch movies; and so on.

Of course, your random guesses are very likely to be incorrect (they’re random guesses, after all!), so you want to improve on them. One way of doing so is to:

Pick a place and a person (e.g., Alice at the mall).

Why is Alice likely to be at the mall? Probably because other people at the mall with the same interests sent her a message telling her to come.

In other words, the more people with interests in X there are at the mall and the stronger Alice is associated with interest X (at all the other places she goes to), the more likely it is that Alice is at the mall because of interest X.

So make a new guess as to why Alice is at the mall, choosing an interest with some probability according to how likely you think it is.

Go through each place and person over and over again. Your guesses keep getting better and better (after all, if you notice that lots of geeks hang out at the bookstore, and you suspect that Alice is pretty geeky herself, then it’s a good bet that Alice is at the bookstore because her geek friends told her to go there; and now that you have a better idea of why Alice is probably at the bookstore, you can use this knowledge in turn to improve your guesses as to why everyone else is where they are), and eventually you can stop updating. Then take a snapshot (or multiple snapshots) of your guesses, and use it to get all the information you want:

For each category, you can count the people assigned to that category to figure out what people have this particular interest. By looking at the people themselves, you can interpret the category as well (e.g., if category X contains lots of tall people wearing jerseys and carrying around basketballs, you might interpret X as the “basketball players” group).

For each place P and interest category C, you can compute the proportions of people at P because of C (under the current set of assignments), and these give you a representation of P. For example, you might learn that the people who hang out at Barnes & Noble consist of 10% hipsters, 50% anime fans, 10% jocks, and 30% college students.

Real-World Example

Finally, I applied LDA to a set of Sarah Palin’s emails a little while ago (see here for the blog post, or here for an app that allows you to browse through the emails by the LDA-learned categories), so let’s give a brief recap. Here are some of the topics that the algorithm learned:

Suppose you ask a bunch of users to rate a set of movies on a 0-100 scale. In classical factor analysis, you could then try to explain each movie and user in terms of a set of latent factors. For example, movies like Star Wars and Lord of the Rings might have strong associations with a latent science fiction and fantasy factor, and users who like Wall-E and Toy Story might have strong associations with a latent Pixar factor.

Restricted Boltzmann Machines essentially perform a binary version of factor analysis. (This is one way of thinking about RBMs; there are, of course, others, and lots of different ways to use RBMs, but I’ll adopt this approach for this post.) Instead of users rating a set of movies on a continuous scale, they simply tell you whether they like a movie or not, and the RBM will try to discover latent factors that can explain the activation of these movie choices.

One layer of visible units (users’ movie preferences whose states we know and set);

One layer of hidden units (the latent factors we try to learn); and

A bias unit (whose state is always on, and is a way of adjusting for the different inherent popularities of each movie).

Furthermore, each visible unit is connected to all the hidden units (this connection is undirected, so each hidden unit is also connected to all the visible units), and the bias unit is connected to all the visible units and all the hidden units. To make learning easier, we restrict the network so that no visible unit is connected to any other visible unit and no hidden unit is connected to any other hidden unit.

For example, suppose we have a set of six movies (Harry Potter, Avatar, LOTR 3, Gladiator, Titanic, and Glitter) and we ask users to tell us which ones they want to watch. If we want to learn two latent units underlying movie preferences – for example, two natural groups in our set of six movies appear to be SF/fantasy (containing Harry Potter, Avatar, and LOTR 3) and Oscar winners (containing LOTR 3, Gladiator, and Titanic), so we might hope that our latent units will correspond to these categories – then our RBM would look like the following:

(Note the resemblance to a factor analysis graphical model.)

State Activation

Restricted Boltzmann Machines, and neural networks in general, work by updating the states of some neurons given the states of others, so let’s talk about how the states of individual units change. Assuming we know the connection weights in our RBM (we’ll explain how to learn these below), to update the state of unit \(i\):

Compute the activation energy\(a_i = \sum_j w_{ij} x_j\) of unit \(i\), where the sum runs over all units \(j\) that unit \(i\) is connected to, \(w_{ij}\) is the weight of the connection between \(i\) and \(j\), and \(x_j\) is the 0 or 1 state of unit \(j\). In other words, all of unit \(i\)’s neighbors send it a message, and we compute the sum of all these messages.

Let \(p_i = \sigma(a_i)\), where \(\sigma(x) = 1/(1 + exp(-x))\) is the logistic function. Note that \(p_i\) is close to 1 for large positive activation energies, and \(p_i\) is close to 0 for negative activation energies.

We then turn unit \(i\) on with probability \(p_i\), and turn it off with probability \(1 - p_i\).

(In layman’s terms, units that are positively connected to each other try to get each other to share the same state (i.e., be both on or off), while units that are negatively connected to each other are enemies that prefer to be in different states.)

For example, let’s suppose our two hidden units really do correspond to SF/fantasy and Oscar winners.

If Alice has told us her six binary preferences on our set of movies, we could then ask our RBM which of the hidden units her preferences activate (i.e., ask the RBM to explain her preferences in terms of latent factors). So the six movies send messages to the hidden units, telling them to update themselves. (Note that even if Alice has declared she wants to watch Harry Potter, Avatar, and LOTR 3, this doesn’t guarantee that the SF/fantasy hidden unit will turn on, but only that it will turn on with high probability. This makes a bit of sense: in the real world, Alice wanting to watch all three of those movies makes us highly suspect she likes SF/fantasy in general, but there’s a small chance she wants to watch them for other reasons. Thus, the RBM allows us to generate models of people in the messy, real world.)

Conversely, if we know that one person likes SF/fantasy (so that the SF/fantasy unit is on), we can then ask the RBM which of the movie units that hidden unit turns on (i.e., ask the RBM to generate a set of movie recommendations). So the hidden units send messages to the movie units, telling them to update their states. (Again, note that the SF/fantasy unit being on doesn’t guarantee that we’ll always recommend all three of Harry Potter, Avatar, and LOTR 3 because, hey, not everyone who likes science fiction liked Avatar.)

Learning Weights

So how do we learn the connection weights in our network? Suppose we have a bunch of training examples, where each training example is a binary vector with six elements corresponding to a user’s movie preferences. Then for each epoch, do the following:

Take a training example (a set of six movie preferences). Set the states of the visible units to these preferences.

Next, update the states of the hidden units using the logistic activation rule described above: for the \(j\)th hidden unit, compute its activation energy \(a_j = \sum_i w_{ij} x_i\), and set \(x_j\) to 1 with probability \(\sigma(a_j)\) and to 0 with probability \(1 - \sigma(a_j)\). Then for each edge \(e_{ij}\), compute \(Positive(e_{ij}) = x_i * x_j\) (i.e., for each pair of units, measure whether they’re both on).

Now reconstruct the visible units in a similar manner: for each visible unit, compute its activation energy \(a_i\), and update its state. (Note that this reconstruction may not match the original preferences.) Then update the hidden units again, and compute \(Negative(e_{ij}) = x_i * x_j\) for each edge.

Continue until the network converges (i.e., the error between the training examples and their reconstructions falls below some threshold) or we reach some maximum number of epochs.

Why does this update rule make sense? Note that

In the first phase, \(Positive(e_{ij})\) measures the association between the \(i\)th and \(j\)th unit that we want the network to learn from our training examples;

In the “reconstruction” phase, where the RBM generates the states of visible units based on its hypotheses about the hidden units alone, \(Negative(e_{ij})\) measures the association that the network itself generates (or “daydreams” about) when no units are fixed to training data.

So by adding \(Positive(e_{ij}) - Negative(e_{ij})\) to each edge weight, we’re helping the network’s daydreams better match the reality of our training examples.

(You may hear this update rule called contrastive divergence, which is basically a fancy term for “approximate gradient descent”.)

Examples

I wrote a simple RBM implementation in Python (the code is heavily commented, so take a look if you’re still a little fuzzy on how everything works), so let’s use it to walk through some examples.

Note that the first hidden unit seems to correspond to the Oscar winners, and the second hidden unit seems to correspond to the SF/fantasy movies, just as we were hoping.

What happens if we give the RBM a new user, George, who has (Harry Potter = 0, Avatar = 0, LOTR 3 = 0, Gladiator = 1, Titanic = 1, Glitter = 0) as his preferences? It turns the Oscar winners unit on (but not the SF/fantasy unit), correctly guessing that George probably likes movies that are Oscar winners.

What happens if we activate only the SF/fantasy unit, and run the RBM a bunch of different times? In my trials, it turned on Harry Potter, Avatar, and LOTR 3 three times; it turned on Avatar and LOTR 3, but not Harry Potter, once; and it turned on Harry Potter and LOTR 3, but not Avatar, twice. Note that, based on our training examples, these generated preferences do indeed match what we might expect real SF/fantasy fans want to watch.

Modifications

I tried to keep the connection-learning algorithm I described above pretty simple, so here are some modifications that often appear in practice:

Above, \(Negative(e_{ij})\) was determined by taking the product of the \(i\)th and \(j\)th units after reconstructing the visible units once and then updating the hidden units again. We could also take the product after some larger number of reconstructions (i.e., repeat updating the visible units, then the hidden units, then the visible units again, and so on); this is slower, but describes the network’s daydreams more accurately.

Instead of using \(Positive(e_{ij})=x_i * x_j\), where \(x_i\) and \(x_j\) are binary 0 or 1 states, we could also let \(x_i\) and/or \(x_j\) be activation probabilities. Similarly for \(Negative(e_{ij})\).

We could penalize larger edge weights, in order to get a sparser or more regularized model.

When updating edge weights, we could use a momentum factor: we would add to each edge a weighted sum of the current step as described above (i.e., \(L * (Positive(e_{ij}) - Negative(e_{ij})\)) and the step previously taken.

Instead of using only one training example in each epoch, we could use batches of examples in each epoch, and only update the network’s weights after passing through all the examples in the batch. This can speed up the learning by taking advantage of fast matrix-multiplication algorithms.

LDA-based Email Browser

Earlier this month, several thousand emails from Sarah Palin’s time as governor of Alaska were released. The emails weren’t organized in any fashion, though, so to make them easier to browse, I’ve been working on some topic modeling (in particular, using latent Dirichlet allocation) to separate the documents into different groups.

What is Latent Dirichlet Allocation?

Briefly, given a set of documents, LDA tries to learn the latent topics underlying the set. It represents each document as a mixture of topics (generated from a Dirichlet distribution), each of which emits words with a certain probability.

For example, given the sentence “I listened to Justin Bieber and Lady Gaga on the radio while driving around in my car”, an LDA model might represent this sentence as 75% about music (a topic which, say, emits the words Bieber with 10% probability, Gaga with 5% probability, radio with 1% probability, and so on) and 25% about cars (which might emit driving with 15% probability and cars with 10% probability).

If you’re familiar with latent semantic analysis, you can think of LDA as a generative version. (For a more in-depth explanation, I wrote an introduction to LDA here.)

Sarah Palin Email Topics

Here’s a sample of the topics learnt by the model, as well as the top words for each topic. (Names, of course, are based on my own interpretation.)

I also thought the classification for this email was really neat: the LDA model labeled it as 10% in the Presidential Campaign/Elections topic and 90% in the Wildlife topic, and it’s precisely a wildlife-based protest against Palin as a choice for VP:

Future Analysis

In a future post, I’ll perhaps see if we can glean any interesting patterns from the email topics. For example, for a quick graph now, if we look at the percentage of emails in the Trig/Family/Inspiration topic across time, we see that there’s a spike in April 2008 – exactly (and unsurprisingly) the month in which Trig was born.

While working on a Twitter sentiment analysis project, I ran into the problem of needing to filter out all non-English tweets. (Asking the Twitter API for English-only tweets doesn’t seem to work, as it nonetheless returns tweets in Spanish, Portuguese, Dutch, Russian, and a couple other languages.)

Since I didn’t have any labeled data, I thought it would be fun to build an unsupervised language classifier. In particular, using an EM algorithm to build a naive Bayes model of English vs. non-English n-gram probabilities turned out to work quite well, so here’s a description.

EM Algorithm

Let’s recall the naive Bayes algorithm: given a tweet (a set of character n-grams), we estimate its language to be the language $L$ that maximizes

This would be easy if we knew the language of each tweet, since we could estimate

$P(xyz| language = English)$ as #(number of times “xyz” is a trigram in the English tweets) / #(total trigrams in the English tweets)

$P(language = English)$ as the proportion of English tweets.

Or, it would also be easy if we knew the n-gram probabilities for each language, since we could use Bayes’ theorem to compute the language probabilities for each tweet, and then take a weighted variant of the previous paragraph.

The problem is that we know neither of these. So what the EM algorithm says is that that we can simply guess:

Pretend we know the language of each tweet (by randomly assigning them at the beginning).

Using this guess, we can compute the n-gram probabilities for each language.

Using the n-gram probabilities for each language, we can recompute the language probabilities of each tweet.

Using these recomputed language probabilities, we can recompute the n-gram probabilities.

And so on, recomputing the language probabilities and n-gram probabilities over and over. While our guesses will be off in the beginning, the probabilities will eventually converge to (locally) minimize the likelihood. (In my tests, my language detector would sometimes correctly converge to an English detector, and sometimes it would converge to an English-and-Dutch detector.)

EM Analogy for the Layman

Why does this work? Suppose you suddenly move to New York, and you want a way to differentiate between tourists and New Yorkers based on their activities. Initially, you don’t know who’s a tourist and who’s a New Yorker, and you don’t know which are touristy activities and which are not. So you randomly place people into two groups A and B. (You randomly assign all tweets to a language)

Now, given all the people in group A, you notice that a large number of them visit the Statue of Liberty; similarly, you notice that a large number of people in group B walk really quickly. (You notice that one set of words often has the n-gram “ing”, and that another set of words often has the n-gram “ias”; that is, you fix the language probabilities for each tweet, and recompute the n-gram probabilities for each language.)

So you start to put people visiting the Statue of Liberty in group A, and you start to put fast walkers in group B. (You fix the n-gram probabilities for each language, and recompute the language probabilities for each tweet.)

With your new A and B groups, you notice more differentiating factors: group A people tend to carry along cameras, and group B people tend to be more finance-savvy.

So you start to put camera-carrying folks in group A, and finance-savvy folks in group B.

And so on. Eventually, you settle on two groups of people and differentiating activities: people who walk slowly and visit the Statue of Liberty, and busy-looking people who walk fast and don’t visit. Assuming there are more native New Yorkers than tourists, you can then guess that the natives are the larger group.

Results

I wrote some Ruby code to implement the above algorithm, and trained it on half a million tweets, using English and “not English” as my two languages. The results looked surprisingly good from just eyeballing:

But in order to get some hard metrics and to tune parameters (e.g., n-gram size), I needed a labeled dataset. So I pulled a set of English-language and Spanish-language documents from Project Gutenberg, and split them to form training and test sets (the training set consisted of 2000 lines of English and 1000 lines of Spanish, and 1000 lines of English and 1000 lines of Spanish for the test set).

Trained on bigrams, the detector resulted in:

991 true positives (English lines correctly classified as English)

9 false negatives (English lines incorrectly classified as Spanish

11 false positives (Spanish lines incorrectly classified as English)

989 true negatives (Spanish lines correctly classified as English)

for a precision of 0.989 and a recall of 0.991.

Trained on trigrams, the detector resulted in:

992 true positives

8 false negatives

10 false positives

990 true negatives

for a precision of 0.990 and a recall of 0.992.

Also, when I looked at the sentences the detector was making errors on, I saw that they almost always consisted of only one or two words (e.g., the incorrectly classified sentences were lines like “inmortal”, “autumn”, and “salir”). So the detector pretty much never made a mistake on a normal sentence!

Code/Demo

How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years.

How large is your training set?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models.

You can also think of this as a generative model vs. discriminative model distinction.

Advantages of some particular algorithms

Advantages of Naive Bayes: Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

Advantages of Logistic Regression: Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

Advantages of Decision Trees: Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

Advantages of SVMs: High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

But…

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.

Hybrid CEO. Our customers include companies like Airbnb, Dropbox, Expedia, Pandora, and Pinterest, who use us to power their personalization platforms and run millions of human/AI tasks every month. Hit me up if you're interested!