I recently got a chance to help out on Gadfly, a twitter client that @SteveHolstad, @leeiroth, and @eklimcz from Clarity are working on. If you haven’t had a chance to check it out, I definitely recommend it, I’ve been waiting for a long while for a great web-based twitter client to show up!

Gadfly allows you to rate individual tweets from 1 to 5 stars. Using this ratings data, we started looking into building a recommendation engine that would suggest other tweeple that you might find interesting.

Co-follower rate: when two people follow the same people, you can numerically represent how similar their tastes are. This is a crucial concept for most recommendation algorithms.

What your followers follow, and what they tweet about: if you index the text of individual tweets, you can do some very interesting stuff such as clustering similar tweeple into groups, e.g. people who tweet about a certain technology.

Shouldn’t Matter:

Follower counts

When you started tweeting

Frequency of tweets

Pretty straightforward… Your follower count,when you started tweeting,and how often you tweet don’t represent the quality of your tweets. These metrics have no place in a twitter recommendation algorithm.

Practical Challenges

There are some practical challenges, however, when trying to incorporate the “should matter” metrics into a twitter recommendation algorithm; I’ll explain those as they relate to Gadfly in particular.

Gadfly is a web based application build in Silverlight; it uses isolated storage as a virtual file system to store application data on your machine. We’re somewhat limited in how much data we can store in isolated storage, e.g. 1MB for an in-browser Silverlight application.

So why not store followers and tweet content in a central database? If storage space weren’t an issue, this would certainly be feasible. However, we didn’t want to get into the business of constantly querying follower and following lists and storing them in our database, let alone storing the content (raw or indexed) of tweets. There’s also the matter of how quickly we’d eat into the twitter API rate limit with these operations.

Enter Ratings

The best way to explain this is to use the good old Netflix movie ratings example.

Netflix allows users to rate movies that they have watched. When new users join Netflix and begin rating movies, you can compare their preferences to existing users that have rated those same movies. Based on this data, you can identify a set of existing users who are most “similar” to the new users and use these users’ ratings for other movies to fill in the gaps in the new user’s profile. For example, a recommendation engine can predict that you would give The Bourne Supremacy a rating of 4 stars and place it on your recommended list.

This describes the KNN (K-Nearest Neighbor) algorithm in its simplest form. This is the algorithm that Netflix uses (albeit with a ton of tweaks and optimizations) to recommend movies to users.

How does this apply to a twitter recommendation engine? Instead of rating movies, you rate tweets. Since you’re rating the same twitter user multiple times, we average out this rating to come up with a single rating. This gives us a single figure that represents a Gadfly user rating a twitter user. Over time, we’ll probably calculate the average based on more recent tweets, otherwise new ratings will have little to no effect on the calculation.

When other Gadfly users rate the same twitter user, we can compute how similar these users’ tastes are. We can then predict what rating users will give to twitter users that they haven’t rated. If the predicted rating is above an acceptable threshold that we set, we can recommend that twitter user as a Gadfly Pick.

Computing Similarity

The process that powers a KNN-based recommendation engine is always run in batch mode due to the computational intensity that is involved.

The first step is to calculate the similarity (or correlation) between every set of two users using a coefficient such as Pearson. This is a number between 0 and 1, with 0 meaning that the users are not similar at all, and 1 meaning that they are as similar as they can be.

The key here though is that you can’t measure correlation between two users unless they have some ratings in common – we define this parameter to the algorithm as the Minimum Number of Ratings in Common. Realizing that at first our data will be very sparse, we set this to 1 – we can tweak this as more Gadfly users enter ratings. We can increase this to make the engine more selective as we get more ratings data.

To the left is example of some of the ratings data, this is all sample data so the similarity numbers are artificially high.

Generating Picks

Now that the system has computed all possible combinations of similarity among users with a minimum number of ratings in common, the next step is to generate the Gadfly Picks for each user.

When computing the picks for a particular user, e.g. UserId 67 in the above sample data, the K from KNN refers to the number of most similar users that the algorithm will use to predict ratings for twitter users that User 67 hasn’t rated. If the predicted rating is greather than a specified threshold, we recommend that twitter user to User 67 and they appear in their Gadfly Picks timeline.

So if we’re using a value of K=3 and the above sample data to put together the Gadfly Picks for User 67, the top 3 similar users are 68, 71, and 74 with Pearson values of 1.0, 1.0, and 0.992583334. Let’s use those numbers to predict how User 67 would rate a particular twitter user, assuming the following information on how User 67′s top 3 most similar users have rated that twitter user:

The predicted rating should be pretty self explanatory. If the three Gadfly users that User 67 is most similar to gave this twitter user ratings of 2, 3, and 2, a predicted rating of 2.33 is right in line.

We currently have the value of Predicted Rating Threshold set to 3.5 – this user wouldn’t make the cut.

Right now we’re generating each user’s Gadfly Picks once a day, we can make this more frequent as usage increases. It will be very interesting to see how much we have to tweak the backend SQL when it has to deal with much larger datasets.

Limitations and Opportunities for Improvement

I’m excited to get this rolled out in the next couple of days, but I can already identify several areas which I want to work on improving after we iron out the initial kinks.

Twitter Users in a Search Timeline have a Twitter User Id = 0

As I understand it, the twitter search API hasn’t yet been integrated into twitter’s core API. The implication of this for us is that the Twitter User Id of a twitter user in a search timeline is always 0. We’re thus not able to relate that user to an actual twitter user and can’t use that piece of ratings data. Hopefully this will be a non-issue as the APIs continue to come together.

The Engine Recommends Me to Me

I thought it was funny when I saw myself as a Gadfly Pick, we need to tweak the algorithm so that I’m not recommended to myself!

Ability to Rate the Recommendations

The only way to truly measure the effectiveness of a recommendation algorithm is to compare the predicted rating to the actual rating provided by the user. The functionality is currently there to rate the recommended picks. We haven’t proved this out yet, but we suspect that the algorithm will automagically take care of things.

We’ll be rolling this out in the next day or two, I’m really interested in taking a look at the quality of the ratings as we get more realistic data.