Kickstarter is an online crowdfunding platform for launching creative projects. When starting a new project, project owners specify a deadline and the minimum amount of money they need to raise. They receive the money (less a transaction fee) only if they reach or exceed that minimum; otherwise, no money changes hands.

What’s particularly fun about Kickstarter is that in contrast to that other microfinance site, Kickstarter projects don’t ask for loans; instead, patrons receive pre-specified rewards unique to each project. For example, someone donating money to help an artist record an album might receive a digital copy of the album if they donate 20 dollars, or a digital copy plus a signed physical cd if they donate 50 dollars.

There are a bunch of neatprojects, and I’m tempted to put one of my own on there soon, so I thought it would be fun to gather some data from the site and see what makes a project successful.

Categories

Ending Soon

The categories section really only provides a history of successful projects, though, so to get some data on unsuccessful projects as well, I took a look at the Ending Soon section of projects whose deadlines are about to pass.

It looks like about 50% of all Kickstarter projects get successfully funded by the deadline:

Interestingly, most of the final funding seems to happen in the final few days: with just 5 days left, only about 20% of all projects have been funded. (In other words, with just 5 days left, 60% of the projects that will eventually be successful are still unfunded.) So the approaching deadline seems to really spur people to donate. I wonder if it’s because of increased publicity in the final few days (the project owners begging everyone for help!) or if it’s simply procrastination in action (perhaps people want to wait to see if their donation is really necessary)?

Lesson: if you’re still not fully funded with only a couple days remaining, don’t despair.

Success vs. Failure

What factors lead a project to succeed? Are there any quantitative differences between projects that eventually get funded and those that don’t?

Two simple (if kind of obvious) things I noticed are that unsuccessful projects tend to require a larger amount of money:

and unsuccessful projects also tend to raise less money in absolute terms (i.e., it’s not just that they ask for too much money to reach their goal – they’re simply not receiving enough money as well):

Not terribly surprising, but it’s good to confirm (and I’m still working on finding other predictors).

Pledge Rewards

There’s a lot of interesting work in behavioral economics on pricing and choice – for example, the anchoring effect suggests that when building a menu, you should include an expensive item to make other menu items look reasonably priced in comparison, and the paradox of choice suggests that too many choices lead to a decision freeze – so one aspect of the Kickstarter data I was especially interested in was how pricing of rewards affects donations. For example, does pricing the lowest reward at 25 dollars lead to more money donated (people don’t lowball at 5 dollars instead) or less money donated (25 dollars is more money than most people are willing to give)? And what happens if a new reward at 5 dollars is added – again, does it lead to more money (now people can donate something they can afford) or less money (the people that would have paid 25 dollars switch to a 5 dollar donation)?

First, here’s a look at the total number of pledges at each price. (More accurately, it’s the number of claimed rewards at each price.) [Update: the original version of this graph was wrong, but I’ve since fixed it.]

Surprisingly, 5 dollar and 1 dollar donations are actually not the most common contribution.

To investigate pricing effects, I started by looking at all (successful) projects that had a reward priced at 1 dollar, and compared the number of donations at 1 dollar with the number of donations at the next lowest reward.

Up to about 15-20 dollars, there’s a steady increase in the proportion of people who choose the second reward over the first reward, but after that, the proportion decreases.

So this perhaps suggests that if you’re going to price your lowest reward at 1 dollar, your next reward should cost roughly 20 dollars (or slightly more, to maximize your total revenue). Pricing above 20 dollars is a little too expensive for the folks who want to support you, but aren’t rich enough to throw gads of money; maybe rewards below 20 dollars aren’t good enough to merit the higher donation.

Next, I’m planning on digging a little deeper into pricing effects and what makes a project successful, so I’ll hopefully have some more Kickstarter analysis in a future post. In the meantime, in case anyone else wants to take a look, I put the data onto my Github account.

Forward selection starts with no variables in the model, and at each step it adds to the model the variable with the most explanatory power, stopping if the explanatory power falls below some threshold. This is a fast and simple method, but it can also be too greedy: we fully add variables at each step, so correlated predictors don’t get much of a chance to be included in the model. (For example, suppose we want to build a model for the deliciousness of a PB&J sandwich, and two of our variables are the amount of peanut butter and the amount of jelly. We’d like both variables to appear in our model, but since amount of peanut butter is (let’s assume) strongly correlated with the amount of jelly, once we fully add peanut butter to our model, jelly doesn’t add much explanatory power anymore, and so it’s unlikely to be added.)

Forward stagewise regression tries to remedy the greediness of forward selection by only partially adding variables. Whereas forward selection finds the variable with the most explanatory power and goes all out in adding it to the model, forward stagewise finds the variable with the most explanatory power and updates its weight by only epsilon in the correct direction. (So we might first increase the weight of peanut butter a little bit, then increase the weight of peanut butter again, then increase the weight of jelly, then increase the weight of bread, and then increase the weight of peanut butter once more.) The problem now is that we have to make a ton of updates, so forward stagewise can be very inefficient.

LARS, then, is essentially forward stagewise made fast. Instead of making tiny hops in the direction of one variable at a time, LARS makes optimally-sized leaps in optimal directions. These directions are chosen to make equal angles (equal correlations) with each of the variables currently in our model. (We like peanut butter best, so we start eating it first; as we eat more, we get a little sick of it, so jelly starts looking equally appetizing, and we start eating peanut butter and jelly simultaneously; later, we add bread to the mix, etc.)

In more detail, LARS works as follows:

Assume for simplicity that we’ve standardized our explanatory variables to have zero mean and unit variance, and that our response variable also has zero mean.

Start with no variables in your model.

Find the variable $ x_1 $ most correlated with the residual. (Note that the variable most correlated with the residual is equivalently the one that makes the least angle with the residual, whence the name.)

Move in the direction of this variable until some other variable $ x_2 $ is just as correlated.

At this point, start moving in a direction such that the residual stays equally correlated with $ x_1 $ and $ x_2 $ (i.e., so that the residual makes equal angles with both variables), and keep moving until some variable $ x_3 $ becomes equally correlated with our residual.

And so on, stopping when we’ve decided our model is big enough.

For example, consider the following image (slightly simplified from the original LARS paper; $x_1, x_2$ are our variables, and $y$ is our response):

Our model starts at $ \hat{\mu_0} $.

The residual (the green line) makes a smaller angle with $ x_1 $ than with $ x_2 $, so we start moving in the direction of $ x_1 $.
At $ \hat{\mu_1} $, the residual now makes equal angles with $ x_1, x_2 $, and so we start moving in a new direction that preserves this equiangularity/equicorrelation.

If there were more variables, we’d change directions again once a new variable made equal angles with our residual, and so on.

So when should you use LARS, as opposed to some other regularization method like lasso? There’s not really a clear-cut answer, but LARS tends to give very similar results as both lasso and forward stagewise (in fact, slight modifications to LARS give you lasso and forward stagewise), so I tend to just use lasso when I do these kinds of things, since the justifications for lasso make a little more sense to me. In fact, I don’t usually even think of LARS as a model selection method in its own right, but rather as a way to efficiently implement lasso (especially if you want to compute the full regularization path).

Introduction

Suppose you see two drunks (i.e., two random walks) wandering around. The drunks don’t know each other (they’re independent), so there’s no meaningful relationship between their paths.

But suppose instead you have a drunk walking with her dog. This time there is a connection. What’s the nature of this connection? Notice that although each path individually is still an unpredictable random walk, given the location of one of the drunk or dog, we have a pretty good idea of where the other is; that is, the distance between the two is fairly predictable. (For example, if the dog wanders too far away from his owner, she’ll tend to move in his direction to avoid losing him, so the two stay close together despite a tendency to wander around on their own.) We describe this relationship by saying that the drunk and her dog form a cointegrating pair.

In more technical terms, if we have two non-stationary time series X and Y that become stationary when differenced (these are called integrated of order one series, or I(1) series; random walks are one example) such that some linear combination of X and Y is stationary (aka, I(0)), then we say that X and Y are cointegrated. In other words, while neither X nor Y alone hovers around a constant value, some combination of them does, so we can think of cointegration as describing a particular kind of long-run equilibrium relationship. (The definition of cointegration can be extended to multiple time series, with higher orders of integration.)

Other examples of cointegrated pairs:

Income and consumption: as income increases/decreases, so too does consumption.

Size of police force and amount of criminal activity

A book and its movie adaptation: while the book and the movie may differ in small details, the overall plot will remain the same.

Number of patients entering or leaving a hospital

An application

So why do we care about cointegration? In quantitative finance, cointegration forms the basis of the pairs trading strategy: suppose we have two cointegrated stocks X and Y, with the particular (for concreteness) cointegrating relationship X - 2Y = Z, where Z is a stationary series of zero mean. For example, X could be McDonald’s, Y could be Burger King, and the cointegration relationship would mean that X tends to be priced twice as high as Y, so that when X is more than twice the price of Y, we expect X to move down or Y to move up in the near future (and analogously, if X is less than twice the price of Y, we expect X to move up or Y to move down). This suggests the following trading strategy: if X - 2Y > d, for some positive threshold d, then we should sell X and buy Y (since we expect X to decrease in price and Y to increase), and similarly, if X - 2Y < -d, then we should buy X and sell Y.

Spurious regression

But why do we need the notion of cointegration at all? Why can’t we simply use, say, the R-squared between X or Y to see if X and Y have some kind of relationship? The reason is that standard regression analysis fails when dealing with non-stationary variables, leading to spurious regressions that suggest relationships even when there are none.

For example, suppose we regress two independent random walks against each other, and test for a linear relationship. A large percentage of the time, we’ll find high R-squared values and low p-values when using standard OLS statistics, even though there’s absolutely no relationship between the two random walks. As an illustration, here I simulated 1000 pairs of random walks of length 100, and found p-values less than 0.05 in 77% of the cases:

A Cointegration Test

So how do you detect cointegration? There are several different methods, but the simplest is the Engle-Granger test, which works roughly as follows:

Check that the cointegrating residuals $ e_t $ are stationary (say, by using a so-called unit root test, e.g., the Dickey-Fuller test).

Error-correction and Granger representation

Something else that should perhaps be mentioned is the relationship between cointegration and error-correction mechanisms: suppose we have two cointegrated series $ X_t, Y_t $, with autoregressive representations

where $ Y_{t-1} - \beta X_{t-1} \sim I(0) $ is the cointegrating relationship. Regarding $ Y_{t-1} - \beta X_{t-1} $ as the extent of disequilibrium from the long-run relationship, and the $ \alpha_i $ as the speed (and direction) at which the time series correct themselves from this disequilibrium, we can see that this formalizes the way cointegrated variables adjust to match their long-run equilbrium.

Summary

So, just to summarize a bit, cointegration is an equilibrium relationship between time series that individually aren’t in equilbrium (you can kind of contrast this with (Pearson) correlation, which describes a linear relationship), and it’s useful because it allows us to incorporate both short-term dynamics (deviations from equilibrium) and long-run expectations (corrections to equilibrium).

Gap Statistic

Run a k-means algorithm on the original dataset to find i clusters, and sum the distance of all points from their cluster mean. Call this sum the dispersion.

Generate a set of reference datasets (of the same size as the original). One simple way of generating a reference dataset is to sample uniformly from the original dataset’s bounding rectangle; a more sophisticated approach is take into account the original dataset’s shape by sampling, say, from a rectangle formed from the original dataset’s principal components.

Calculate the dispersion of each of these reference datasets, and take their mean.

Once we’ve calculated all the gaps (we can add confidence intervals as well; see the original paper for the formula), we can select the number of clusters to be the one that gives the maximum gap. (Sidenote: I view the gap statistic as a very statistical-minded algorithm, since it compares the original dataset against a set of reference “control” datasets.)

For example, here I’ve generated three Gaussian clusters:

And running the gap statistic algorithm, we see that it correctly detects the number of clusters to be three:

For a sample R implementation of the gap statistic, see the Github repository here.

Prediction Strength

Another cluster-counting algorithm is the prediction strength algorithm. In contrast to the gap statistic (which, as mentioned above, I find very statistically), I see prediction strength as taking a more machine learning viewpoint, since it’s formulated as a supervised learning problem validated against a test set.

To calculate prediction strength, for each i from 1 up to some maximum number of clusters:

Divide the dataset into two groups, a training set and a test set.

Run a k-means algorithm on each set to find i clusters.

For each test cluster, count the proportion of pairs of points in that cluster that would remain in the same cluster, if each were assigned to its closest training cluster mean.

The minimum over these proportions is the prediction strength for i clusters.

Once we’ve calculated the prediction strength for each number of clusters, we select the number of clusters to be the maximum i such that the prediction strength for i is greater than some threshold. (The paper suggests 0.8 - 0.9 as a good threshold, and I’ve seen 0.8 work well in practice.)

Here’s the prediction strength algorithm run on the same example above:

Again, check out a sample R implementation of the prediction strength here.

In practice, I tend to prefer using the gap statistic algorithm, since it’s a little easier to code and it doesn’t require selecting an arbitrary threshold like the prediction strength does. I’ve also found that it gives slightly better results (though the original prediction strength paper has the opposite finding).

Appendix

I ended up giving a brief description of two very common clustering algorithms, k-means and Gaussian mixture models in the comments, so I figured I might as well bring them up here.

k-means algorithm

Suppose we have a set of datapoints that we want to cluster. We want to learn two things:

A description of the clusters themselves (so that if new points come in, we can assign them to a cluster).

Which clusters our current points fall into.

We start by initializing k cluster centers (e.g., by randomly choosing k points among our datapoints). Then we repeatedly

Step A: Assign each datapoint to the nearest cluster center.

Step B: Update all the cluster centers: for each cluster i, take the mean over all points currently in the cluster, and update cluster center i to be this mean.

(Repeat steps A and B above until the cluster assignments stop changing.)

And that’s pretty much it for k-means.

k-means from an EM point of View

To ease the transition into Gaussian mixture models,
let’s also describe the k-means algorithm using EM language.

Note that if we knew for certain either 1) the exact cluster centers or 2) the cluster each point belonged to, we could trivially solve k-means, since

If we knew the exact cluster centers, all we’d have to do is assign each point to its nearest cluster center, and we’d be done.

If we knew which cluster each point belonged to, we could pick the cluster center by simply taking the mean over all points in that cluster.

The problem is that we know neither of these, and so we alternate between making educated guesses of each one:

In A step above, we pretend that we know the cluster centers, and based off this pretense, we guess which cluster each point belongs to. (This is also known as the E step in the EM algorithm.)

In the B step above, we do the reverse: we pretend that we know which cluster each point belongs to, and then try to guess the cluster centers. (This is also known as the M step in EM.)

Gaussian Mixture Models

k-means has a hard notion of clustering: point X either belongs to cluster C or it doesn’t. But sometimes we want a soft notion instead: point X belongs to cluster C with probability p (according to a Gaussian kernel). This is where Gaussian mixture modeling comes in.

To run a GMM, we start by initializing $k$ Gaussians (say, by randomly choosing $k$ points to be the centers of the Gaussians and by setting the variance of each Gaussians to be the overall variance), and then we repeatedly:

E Step: Pretend we know the parameters of each Gaussian cluster, and assign each datapoint to Gaussian cluster i with appropriate probability.

M Step: Pretend we know the probabilities that each point belongs to a given cluster. Using these probabilities, update the means and variances of each Gaussian cluster: the new mean for cluster i is the weighted mean over all points (where the weight of each point X is the probability that X belongs to cluster i), and similarly for the new variance.

This is exactly like k-means in the EM formulation, except we replace the binary clustering formula with Gaussian kernels.

Activity on the Site

My first question was how activity on the site has increased over time. I looked at number of posts, points on posts, comments on posts, and number of users.

Posts

This looks like a strong linear fit, with an increase of 292 posts every month.

Comments

For comments, I fit a quadratic regression:

Points

A quadratic regression was also a better fit for points by month:

Users

And again for the number of distinct users with a submission:

Points and Comments

My next question was how points and comments related. Intuitively, posts with more points should have more comments, but it’s nice to check (maybe really good posts are kind of boring, so don’t lead to much discussion).

First, I plotted the points and comments of each individual post:

As expected, there’s an overall positive correlation between points and comments. Interestingly, there are quite a few high-points posts with no comments.

The plot’s quite noisy, though, so let’s try cleaning it up a bit, by taking the median number of comments per points level (and removing posts at the higher end, where we have little data):

We see that posts with more points do tend to have more comments. Also, variance in number of comments is indicated by size and color, so (unsurprisingly) posts with more points have larger variance in their number of comments.

Quality of Posts

Another question was whether the quality of posts has degraded over time.

First, I computed a normalized “score” for each post, where a post’s score is defined as the number of points divided by the number of distinct users who made a submission in the same month. (The denominator is a rough proxy for the number of active users, and the goal of the score is to provide a way to compare posts across time.)

While the median score has declined over time (as perhaps should be expected, since only a fixed number of items can reach the front page):

the absolute number of quality posts, defined as posts with a score greater than the (admittedly arbitrarily chosen) threshold 0.01, has increased (until possibly a dip starting in 2010):

(Of course, without some further analysis, it’s not clear how well this score measures quality of posts, so take these numbers with a grain of salt.)

Company Trends

Also, I wanted to see how certain topics have trended over time, so I looked at how mentions of some of the big-name companies (Google, Facebook, Microsoft, Yahoo, Twitter, Apple) have changed. For each company, I plotted the percentage of posts with the company’s name in the title, and also made a smoothed plot comparing all six at the end. Note that Microsoft and Yahoo seem to be trending slightly downward, and Apple seems to be trending upward.

Measure theory studies ways of generalizing the notions of length/area/volume. Even in 2 dimensions, it might not be clear how to measure the area of the following fairly tame shape:

much less the “area” of even weirder shapes in higher dimensions or different spaces entirely.

For example, suppose you want to measure the length of a book (so that you can get a good sense of how long it takes to read). What’s a good measure? One possibility is to measure a book’s length in pages. Since books provide page counts, this is a fairly easy measure to get. However, different versions of the same book (e.g., hardcover and paperback versions) tend to have different page counts, so this page measure doesn’t satisfy the nice property of version invariance (which we would like to have, since hardcover and paperback versions of the same book take the same time to read). Also, not all books even have page counts (think Kindle books), so this measure doesn’t allow us to measure the length of all books we might want to read.

Another, possibly better measure is to measure a book’s length in terms of the number of words it contains. Now we do have version invariance (hardcover and paperback versions contain the same number of words) and we can measure the length of Kindle books as well. We can even do things like add two books together, and the measure/number of words of the concatenated books will pleasantly equal the sum of the measures/number of words of each book alone.

However, what happens when we try to measure a picture book’s length in words? We can’t – picture books are too pathological. Maybe we could say that a picture book has measure zero (since a picture book has no words), but then we get unhappy things like books of measure zero taking a really long time to read (imagine a really long picture book). So maybe a better option is to say that picture books are simply unmeasurable. Whenever someone asks for the length of a picture book, we ignore them, and this way our measure will continue to be a good approximation of reading time and we get to keep our other nice properties as well.

Similarly, measure theory asks questions like:

How do we define a measure on our space? (Jordan measure and Lebesgue measure are two different options in Euclidean space.)

Which objects are measurable/which objects can we say it’s okay not to measure in order to preserve nice properties of our measure? (The Banach-Tarski ball can be rigidly reassembled into two copies of the same shape and size as the original, so we don’t want it to be measurable, since then we would lose additivity properties.)

And once we’ve defined a “generalized area” (our measure), we can try to generalize other mathematical concepts as well. For example, recall that the (Riemann) integral that you learn in calculus measures the area under a curve. What happens if we replace the “area” in the Riemann integral with our new, generalized measure (e.g., to get the Lebesgue integral)? Measure theory also helps make certain probability statements mathematically precise (e.g., we can say exactly what it means that a fair coin flipped infinitely often will “almost never” land heads more than 50% of the time).

Suppose you’re very indecisive, so whenever you want to watch a movie, you ask your friend Willow if she thinks you’ll like it. In order to answer, Willow first needs to figure out what movies you like, so you give her a bunch of movies and tell her whether you liked each one or not (i.e., you give her a labeled training set). Then, when you ask her if she thinks you’ll like movie X or not, she plays a 20 questions-like game with IMDB, asking questions like “Is X a romantic movie?”, “Does Johnny Depp star in X?”, and so on. She asks more informative questions first (i.e., she maximizes the information gain of each question), and gives you a yes/no answer at the end.

Thus, Willow is a decision tree for your movie preferences.

But Willow is only human, so she doesn’t always generalize your preferences very well (i.e., she overfits). In order to get more accurate recommendations, you’d like to ask a bunch of your friends, and watch movie X if most of them say they think you’ll like it. That is, instead of asking only Willow, you want to ask Woody, Apple, and Cartman as well, and they vote on whether you’ll like a movie (i.e., you build an ensemble classifier, aka a forest in this case).

Now you don’t want each of your friends to do the same thing and give you the same answer, so you first give each of them slightly different data. After all, you’re not absolutely sure of your preferences yourself – you told Willow you loved Titanic, but maybe you were just happy that day because it was your birthday, so maybe some of your friends shouldn’t use the fact that you liked Titanic in making their recommendations. Or maybe you told her you loved Cinderella, but actually you really really loved it, so some of your friends should give Cinderella more weight. So instead of giving your friends the same data you gave Willow, you give them slightly perturbed versions. You don’t change your love/hate decisions, you just say you love/hate some movies a little more or less (formally, you give each of your friends a bootstrapped version of your original training data). For example, whereas you told Willow that you liked Black Swan and Harry Potter and disliked Avatar, you tell Woody that you liked Black Swan so much you watched it twice, you disliked Avatar, and don’t mention Harry Potter at all.

By using this ensemble, you hope that while each of your friends gives somewhat idiosyncratic recommendations (Willow thinks you like vampire movies more than you do, Woody thinks you like Pixar movies, and Cartman thinks you just hate everything), the errors get canceled out in the majority. Thus, your friends now form a bagged (bootstrap aggregated) forest of your movie preferences.

There’s still one problem with your data, however. While you loved both Titanic and Inception, it wasn’t because you like movies that star Leonardio DiCaprio. Maybe you liked both movies for other reasons. Thus, you don’t want your friends to all base their recommendations on whether Leo is in a movie or not. So when each friend asks IMDB a question, only a random subset of the possible questions is allowed (i.e., when you’re building a decision tree, at each node you use some randomness in selecting the attribute to split on, say by randomly selecting an attribute or by selecting an attribute from a random subset). This means your friends aren’t allowed to ask whether Leonardo DiCaprio is in the movie whenever they want. So whereas previously you injected randomness at the data level, by perturbing your movie preferences slightly, now you’re injecting randomness at the model level, by making your friends ask different questions at different times.

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.)

There are two approaches to collaborative filtering: neighborhood methods and latent factor models.

Neighborhood models are most effective at detecting very localized relationships (e.g., that people who like X-Men also like Spiderman), but poor at detecting a user’s overall signals.

Latent factor models are best at estimating overall structure (e.g., that a user likes horror movies), but are poor at detecting strong associations among small sets of closely related items.

Since the two approaches have complementary strengths and weaknesses, we should integrate the two; this integration is the focus of this paper.

Preliminaries

As mentioned in previous papers, we should normalize out common effects from movies. Throughout the rest of this paper, Koren uses a baseline estimate of overall rating mean + user deviation from average + movie deviation from average for the rating of user i on movie i; estimation of the latter two parameters are done by solving a regularized least squares problem.

Koren then describes using a binary matrix (1 for rated, 0 for not rated) as a source of implicit feedback. This is useful because the mere fact that a user rated many science fiction movies (say) suggests that the user likes science fiction movies.

A Neighborhood Model

Recall the previous paper, where we modeled each rating $r_{ui}$ as

$$r_{ui} = b_{ui}+ \sum_{N \in N(i; u)} (r_{uj} - b_{uj}) w_{ij},$$

where $N(i; u)$ is the k items most similar to i among the items user u rated, and the $w_{ij}$ are parameters to be learned by solving a regularized least squares problem.

This paper makes several enhancements to that model. First, we replace $N(i; u)$ with $R^k(i; u)$, the intersection of the k items most similar to i (among all items) intersected with the items user u rated. Also, we denote by $N^k(i; u)$ the intersection of the k items most similar to i with the items user u has provided implicit feedback for. This gives us

Notice that by taking the intersection of the k items most similar to i with the items user u rated (giving perhaps a set of size less than k), rather than taking the k items most similar to i among the items user u rated, we let our model be influenced not only by what a user rates, but also by what a user does not rate. For example, if a user does not rate LOTR 1 or LOTR 2, his predicted rating for LOTR 3 is penalized.

This implies that our current model encourages greater deviations from baseline estimates for users that provided many ratings or plenty of implicit feedback. In other words, for well-modeled users with a lot of input, we are willing to predict quirkier and less common recommendations; users we have less information about, on the other hand, receive safer, baseline estimates.

Nonetheless, this dichotomy between power users and newbie users is perhaps overemphasized by our current model, so we moderate the dichotomy by modifying our model to be

Asymmetric-SVD

where $R(u)$ is the set of items user u has rated, and $N(u)$ is the set of items user u has provided implicit feedback for. In other words, this model represents users through the items they prefer, rather than expressing users in a latent feature space. This model has several advantages:

Asymmetric-SVD does not parameterize users, so we do not need to wait to retrain the model when a user comes in. Instead, we can handle new users as soon as they provide feedback.

Predictions are a direct function of past feedback, so we can easily explain predictions. (When using a pure latent feature solution, however, explainability is difficult.)

As usual, parameters are learned via a regularized least-squares minimization.

SVD++

Another approach is to continue modeling users as latent features, while adding implicit feedback. Thus, we replace $p_u$ with $p_u + |N(u)|^{-0.5} \sum_{j \in N(u)} y_j$. While we lose the easily explainability and immediate feedback of the Asymmetric-SVD model, this approach is likely more accurate.

An Integrated Model

An integrated model incorporating baseline estimates, the neighborhood approach, and the latent factor approach is as follows:

Note that we have used $(\mu + b_u + b_i)$ as our baseline estimate. We also used the SVD++ model, but we could use the Asymmetric-SVD model instead.

This rule provides a 3-tier model for recommendations:

The first baseline group describes general properties of the item and user. For example, it may say that “The Sixth Sense” movie is known to be a good movie in general, and that Joe rates like the average user.

The next latent factor group may say that since “The Sixth Sense” and Joe rate high on the Psychological Thrillers Scale, Joe may like The Sixth Sense because he likes this genre of movies in general.

The final neighborhood tier makes fine-grained adjustments that are hard to file, such as the fact that Joe rated low the movie “Signs”, a similar psychological thriller by the same director.

As usual, model parameters are determined by minimizing the regularized squared error function through gradient descent.

(Way back when, I went through all the Netflix prize papers. I’m now (very slowly) trying to clean up my notes and put them online. Eventually, I hope to have a more integrated tutorial, but here’s a rough draft for now.)

tl;dr This paper’s main innovation is deriving neighborhood weights by solving a least squares problem, instead of using a standard similarity function to compute weights.

This paper improves upon the standard neighborhood approach to collaborative filtering in three areas: better data normalization, better neighbor weights (this is the key section), and better use of user data. I’ll first review the standard neighborhood approach, and follow with a description of these enhancements.

Background: Standard Neighborhood Approach to Collaborative Filtering

Recall that there are two types of neighborhood approaches:

User-based approaches: to predict user i’s rating of item j, take the users most similar to user i, and perform a weighted average of their ratings of item j.

Item-based approaches: to predict user i’s rating of item j, perform a weighted average of user i’s ratings of items similar to item j.

For example, to predict how you would rate the first Harry Potter movie, the user-based approach looks at how your friends rated the first Harry Potter movie, while the item-based approach looks at how you rated movies like Lord of the Rings and Twilight.

Better Data Normalization

Suppose I ask my friend Chris whether I should watch the latest Twilight movie. He tells me he would rate it 4.0/5 stars. Great, that’s a high rating, so that means I should watch it – or does it? It turns out that Chris is a super cheerful guy who’s never met a movie he didn’t like, and his average rating for a movie is actually 4.5/5 stars. So Twilight is actually less than average for him, and hence 4.0/5 stars from Chris isn’t actually that hearty a recommendation.

As another example, suppose you look at doctor ratings on Yelp. They’re abnormally high: the average is far from 3/5 stars. Why is this? Maybe it’s harder for people to change doctors than it is to go to a new restaurant, so people might not want to rate a doctor poorly when they know they’ll have to see the doctor again. Thus, an average rating of 5 stars on a McDonalds restaurant is much more impressive than an average of 5 stars on Dr. Joe.

The lesson is that when using existing ratings, we should normalize out these types of effects, so that ratings are as comparable as possible.

Another way of thinking about this is that we are simply building a regression model. That is, for each user u, we have a model
$r_{ui} = (\sum \theta_u x_{ui}) + SpecificRating$, where the $x_{ui}$ are common explanatory variables and we want to estimate $\theta_u$; and similarly for each item i. Once we’ve estimated the $\theta_u$, we can use the fancier neighborhood models on the specific ratings.

For example, suppose we want to predict Bob’s rating of Titanic. We’ve built a regression model with two explanatory variables, whether the movie was Oscar-nominated (1 if so, -1 if not) and whether the movie contains Kate Winslet (1 if so, -1 if not), and we’ve determined that Bob’s weights on these two variables are -2 (Bob tends to hate Oscar movies) and +1.5 (Bob likes Kate Winslet). Similarly, his friend John has weights +1 and -0.5 for these two variables (John likes Oscars, but dislikes Kate Winslet). So if we know that John rated Titanic a 4, then we have 4 = 1(1) + -0.5(1) + (John’s specific rating), so John’s specific rating of Titanic is 3.5. If we use John’s rating alone to estimate Bob’s, we might guess that Bob would rate Titanic -2(1) + 1.5(1) + (John’s specific rating) = 3.0.

To estimate the $\theta_u$, we actually perform this estimation in sequence: each explanatory variable is used to model the residual from the previous explanatory variable. Also, instead of using the maximum-likelihood unbiased estimator $\hat{\theta_u} = \frac{\sum r_{ui} x_{ui}}{x _ {ui} ^ 2}$, we shrink the weights to prevent overfitting. From a Bayesian point of view, the shrinkage arises from a hierarchical model where the true $\theta_u \sim N(\mu, \sigma^2)$, and $\hat{\theta_u} | \theta_u \sim N(\theta_u, \sigma_u^2)$, leading to $E(\theta_u | \hat{\theta_u}) = \frac{\sigma^2 \hat{\theta_u} + \sigma_u^2 \mu}{\sigma^2 + \sigma_u^2}$.

In practice, the explanatory variables Bell and Koren found to work well included the overall mean of all ratings, each movie’s specific mean, each user’s specific mean, time since movie release, time since user join, and number of ratings for each movie.

Better Neighbor Weights

Let’s consider some deficiencies of the neighborhood approach:

Suppose I want to use the first LOTR movie to predict ratings of the first Harry Potter movie. To do this, I need to say how much weight the first LOTR movie should have in this prediction. But how do I choose this weight? Standard neighborhood approaches essentially pick arbitrary similarity functions (e.g., Pearson correlation, cosine distance) as the weight, possibly testing several similarity functions to see which gives the best performance, but is there a more principled approach to choosing weights?

The standard neighborhood approach ignores the fact that neighbors aren’t independent. For example, suppose all three LOTR movies are neighbors of the first HP movie. Since the three LOTR movies are so similar to each other, the standard approach is overcounting their information. Here’s an analogy: suppose I ask five of my friends where I should eat tonight. Three of them live together (boyfriend, girlfriend, and roommate), and they all recently took a trip together to Japan and are sick of Japanese food, so they vehemently recommend against sushi. Thus, my friends’ recommendations have a stronger bias than would appear if I asked five friends who didn’t know each other at all.

We’ll see how using an optimization method to derive weights (as opposed to deriving weights via a similarity function) overcomes these two limitations.

Recall our problem: we want to predict $r_{ui}$, user u’s rating of item i, and what we have is a set $N(i; u)$ of K neighbors of item i that user u has also rated. (These K neighbors are selected via a similarity function, as is standard.) So what we want to do is find weights $w_{ij}$ such that $r_{ui} = \sum_{j \in N(i; u) w_{ij} r_{uj}}$. A natural approach, then, is simply to choose our weights to minimize $\min_w \sum_{v \neq u} \left( r_{vi} - \sum_{j \in N(i; u)} w_{ij} r_{vj}\right)^2$.

Notice how this optimization solves our two problems above: it’s not only a more principled approach (we choose our weights by minimizing squared error), but by deriving weights simultaneously, we overcome interaction effects.

However, not all users have rated every movie, so some of the ratings may be missing from the above formulas. So we should instead use an estimate of A and b, such as $\bar{A}_{jk} = \frac{\sum_{v \in U(j,k)} r_{vj} r_{vk}}{|U(j, k)|}$, where $U(j, k)$ is the set of users who rated both j and k, and similarly for b. To avoid overfitting, we should further modify by shrinking to a common mean: $\hat{A}_{jk} = \frac{|U(J,K)|\bar{A}_{jk} + \beta A_{\mu}}{|U(j,k)| + \beta}$, where $\beta$ is a shrinkage parameter and $A_{\mu}$ is the mean over all $\bar{A}$, and similarly for b.

Note that another benefit of our optimization-derived weights is that the weights of neighbors are no longer constrained to sum to 1. Thus, if an item simply has no strong neighbors, the neighbors’ prediction will have only a small effect.

Also, when engineering these methods in practice, we should precompute all item-item similarities and all entries in the matrix $A$.

Better Use of User Data

Neighborhood models typically follow the item-based approach for two reasons:

There are typically many more users than items, and new users come in much more frequently than new items, so it is easier to compute all pairs of item-item similarities.

Users have diverse tastes, so they aren’t as similar to each other. For example, Alice and Eve may both like horror movies, but disagree on comedies.

But there are various reasons we might want to use a user-based approach in addition to an item-based approach (say, a user hasn’t rated many items yet, but we can find similar users based on other types of data, such as browsing history; or, we want to predict user u’s rating on item i, but user u hasn’t rated any items similar to i), so let’s see if we can get around these limitations.

To get around the first limitation, we can project users into a lower-dimensional space (say, by using a singular value decomposition), where we can use a space-partitioning data structure (e.g., a kd-tree) or a nearest-neighbor algorithm (e.g., locality sensitive hashing) to find neighboring users.

To get around the second limitation – that a user u may be predictive of user v for some items, but less so for others – we incorporate item-item similarity into our weighting method. That is, when using the user-neighborhood model to predict user u’s rating on item i, we give higher weight to items similar to i, by choosing the weights to minimize $\min_w \sum_{j \neq i} s_{ij} \left( r_{uj} - \sum_{v \in N(u, i)} w_{uv} r_{vj} \right)^2,$ where the $s_{ij}$ are item-item similarities.

Appendix: Shrinkage

Parameter shrinkage is used a couple times in the paper, so let’s explain what it means.

Suppose that we want to estimate the probability of a coin. If we flip it once and see heads, then the maximum-likelihood estimate of heads is 1. But (as is typical for maximum-likelihood estimates), this is severe overfitting, and what we should do instead is shrink this maximum-likelihood estimate to a prior estimate of the probability of heads, say 1/2. (Note that shrinkage doesn’t necessarily mean decreasing the number, just moving it towards a prior estimate).

How should we perform this shrinkage? If our maximum-likelihood estimate of our parameter $\theta$ is $x$ and our prior mean is $\mu$, a natural estimation of $\theta$ is to use a weighted mean $\alpha x + (1 - \alpha)\mu$, where $\alpha$ is some measure of the degree of belief in our maximum likelihood estimate.

We can also view it as a Bayesian posterior: if we use a prior $\theta \sim N(\mu, \tau)$ (where $\tau$ is the precision of our Gaussian, not the variance) and a conditional distribution $x | \theta \sim N(\theta, \tau_x)$, then the posterior mean of $\theta$ is $\theta = \frac{\tau_x}{\tau_x + \tau}x + \frac{\tau}{\tau_x + \tau}\mu,$ which is equivalent to the form above.

Lots of people know that the Riemann Hypothesis has something to do with prime numbers, but most introductions fail to say what or why. I’ll try to give one angle of explanation.

Layman’s Terms

Suppose you have a bunch of friends, each with an instrument that plays at a frequency equal to the imaginary part of a zero of the Riemann zeta function. If the Riemann Hypothesis holds, you can create a song that sounds exactly at the prime-powered beats, by simply telling all your friends to play at the same volume.

Mathematical Terms

Let $ \pi(x) $ denote the number of primes less than or equal to x. Recall Gauss’s approximation: $ \pi(x) \approx \int\_2\^x \frac{1}{\log t} \,dt $ (aka, the “probability that a number n is prime” is approximately $ \frac{1}{\log n} $).

we can again interpret it as an error-correcting formula for counting the primes.

Now since $ \psi(x) $ is a step function that jumps at the prime powers, its derivative $ \psi’(x) $ has spikes at the prime powers and is zero everywhere else. So consider

$$ \psi’(x) = 1 - \frac{1}{x(x\^2 - 1)} - \sum\_z x\^{z-1} $$

It’s well-known that the zeroes of the Riemann zeta function are symmetric about the real axis, so the (non-trivial) zeroes come in conjugate pairs $ z, \bar{z} $. But $ x\^{z-1} + x\^{\bar{z} - 1} $ is just a wave whose amplitude depends on the real part of z and whose frequency depends on the imaginary part (i.e., if $ z = a + bi $, then $ x\^{z-1} + x\^{\bar{z}-1} = 2x\^{a-1} cos (b \log x) $), which means $ \psi’(x) $ can be decomposed into a sum of zeta-zero waves. Note that because of the $2x\^{a-1}$ term in front, the amplitude of these waves depends only on the real part $a$ of the conjugate zeroes.

For example, here are plots of $ \psi’(x) $ using 10, 50, and 200 pairs of zeroes:

So when the Riemann Hypothesis says that all the non-trivial zeroes have real part 1/2, it’s hypothesizing that the non-trivial zeta-zero waves have equal amplitude, i.e., they make equal contributions to counting the primes.

In Fourier-poetic terms, when Flying Spaghetti Monster composed the music of the primes, he built the notes out of the zeroes of the Riemann zeta function. If the Riemann Hypothesis holds, he made all the non-trivial notes equally loud.

Hybrid CEO. Our customers include companies like Airbnb, Dropbox, Expedia, Pandora, and Pinterest, who use us to power their personalization platforms and run millions of human/AI tasks every month. Hit me up if you're interested!