Thursday, November 17, 2011

SpotLight: Tianqi Chen - SVDFeature collaborative filtering project

Here is an interview With Tianqi Chen, the winner in the 3rd place in ACM KDD CUP track 1 this year. Since he got two places ahead of us (Lebusishu team), I definitely appreciate his achievement! Tianqi is one of the main collaborators of the SVDFeature project. In a nutshell, SVDFeature is an efficient c++ implementation of stochastic gradient descent that allows for additional feature tuning by the user. It can scale to large models since it keeps only part of the problem in memory. However, it is not fully parallel yet.

1) First I wanted to congratulate you for your great achievement in the KDD CUP this year. In retrospective, can you explain what are the crucial features of SVDFeature project that made it so successful ?

One of the key factor of winning is ensemble predictor of course, our teammates in HKUST did a great job in that. On the other hand, SVDFeature creates the chance to build a powerful single predictor that takes many information into account, we report the best single method(RMSE=22.12 without blending) in track1 using SVDFeature. The most interesting thing of SVDFeature is user can build variants of matrix factorization model by feature defining. In our final predictor of track1, we incorporate user temporal information, implicit feedback, item neighborhood information, taxonomy information etc. Many famous models such as time-SVD++, SVD with neighborhood can also be implemented by defining user temporal related feature, and neighborhood feature. Another important thing of SVDFeature is we don't try to load all the training data into the memory, which suits the large-scale CF data such as KDDCup(of course we still need the memory for storing the model). For example, we can use less than 2GB memory to get comparable performance in KDDCup track1 and track2.

2) What is the project focus?

We really want to make the project as an generic tool for developing Informative Collaborative Filtering and Ranking Algorithms. By informative, we mean to build matrix factorization algorithm that takes available information into account( taxonomy information, user temporal, neighborhood, social network, users' previous click history or browsing history etc.). Similar context exists long in traditional classification tools such as LIBSVM, SVMLight, SVMRank. These tools takes in feature vectors and give predict using the information given in the feature-vector. We want SVDFeature to be the counterpart of the feature-based predictors in Collaborative Filtering and Ranking. Users can define features to describe user(e.g user's click history and browsing history), item(e.g taxonomy information), and global effect(e.g neighborhood effect) to build actually new variants of matrix factorization. In that sense, SVDFeature is suitable for those who want to develop new recommendation algorithms for feature define. It's also suitable for building context-aware recommendation algorithms.

3) Who are the contributors?

There are four authors in the project page, I am the major author of the project. We do welcome comments and feedback on the project.

4) What is the open source model?

The project is released under Apache License, Version 2.0. The toolkit doesn't depends on extra libraries and compiles under both linux(g++) and windows(MSVC).

(2) Build new CF algorithms( for ranking or for rate prediction) that makes use of extra information by defining new features.

(3) Working with a team building feature components independently and merge up together to see the result. Since the input only involves different kinds of feature, we can have one guy working on user temporal feature using python, and another one generation item neighborhood feature using C++. The generated features can be used to build independent models or merge up to build a hybrid predictor.

(4) Large-scale data handling, the toolkit works for existing large-scale datasets such as kddcup, netflix with frugal usage of memory and good speed, if we build features for basic matrix factorization in KDDCup track1, here are some statistics for one round over the dataset: Intel(R) Core(TM)2 Quad CPU Q8400 @ 2.66GHz. 274 sec(64 factors), 357sec(128 factors) , the memory cost is within 2GB.

6) Are there some algorithms which will be hard to express using your framework?

Yes, to be honest, currently SVDFeature supports feature-based matrix factorization model, that involves user, item and global feature. Which do covers large-range of existing algorithms. It currently doesn't cover the models that involves multiple independent factorization parts. For example, tensor time bias, that has time vs user/item in addition to user vs item factor. It's not hard to add additional tweaks to the code to add some specific features. We do welcome requests of specific features. On the other hand, we want to think about a more abstract model that involves configuration of multiple factorization parts, that's a future direction of the toolkit.

7) Can you point to some tutorials for people who want to try it out? Conceptually, what is needed from the users to be defined?

We have a detailed user reference manual and technical report about the model that comes with the project archive. The user need to define a feature-file similar to the common SVM format and provide a configure file that specifies the parameters such as number of latent factors, learning_rate. etc. We have a demo folder in the toolkit archive that have simple example scripts for movielen data. We also have a non-trivial example that shows how to do experiment on KDDCup track2 dataset. We also welcome questions from users.

8) Is the implementation parallel?

The training algorithm is not parallel, but we do create a independent thread to fetch the input data to ensure the speed, we also use SSE2 to optimize some vector operations.

9) Can you give some performance numbers for kdd cup data?

For the KDDCup track1, SVDFeature can build a model with RMSE 22.1x, we also did an afterwards experiments on KDDCup track2 ( see all the scripts and codes) that error=3.1x. The performance of SVDFeature really depends on the user's intelligence of feature defining, the kddcup track2 is a good example to show performance of different feature compositions.

10) What is the difference of SVDFeature to libfm?

(1) Yes, it's very related to FM, we do respect the work of libFM. We pointed it out in our related work in the tech-report.
(2) The difference in the model: feature-based matrix factorization distinguish different types of feature, it has factorization interaction between user feature and item feature, and linear only effect with the input of global feature. FM is more like a tensor-factorization style model, which enforce factorization interaction between every features. The distinguish of features allows linear only feature such as neighbourhood information, and ensures user temporal feature and user implicit feedback will only have factorization interaction with item features.
(3) Difference in toolkit: SVDFeature supports both rate prediction and feature-based collaborative ranking, it also has a fast algorithm for training with implicit feedback, large-scale data handling(it's very important if you define many features), and other minor differences: e.g detailed options for parameter tuning and various loss functions( hinge loss, logistic loss, square loss ) .
(4) Some examples that may tell these differences: try to implement SVD++ model, SVD with neighbourhood information, or see our non-trival experiment on KDDCup track2(http://apex.sjtu.edu.cn/apex_wiki/kddtrack2)
We call our model Feature-based Matrix Factorization because the idea descends more naturally from matrix factorization using features. We do not try to claim we proposed a freshly new feature-based model, the credits shall goes to FM. But we do try to make a model and a toolkit that's useful and can help doing informative collaborative filtering and ranking, and SVDFeature at least works better in some scenarios:)

Thanks for clarification! So your model is not a special case of FM, but if x_i x_j in eq. (1) in paper [1] was changed to new feature x_ij, then we would have more general model which subsumes both SVDFeature and FM.

Btw do you plan to publicize experiment scripts for track1 model with RMSE 22.1x? Would be great.

That's a very good idea. This is the support for multiple factorization we mentioned in the future work. If we separate linear term and factorization term, and implicit specify all the weight of factorization each terms, that's a more general model. On the other hand, specify user/item feature as group also have advantages in computation, saving the effort for extra dot product. More importantly, specifying user implicit feature as group enables the fast computation algorithm of SVD++(you can reference our tech-report). So in a practical view, we do want to be able to specify features as group as possible(especially for implicit feedback) for algorithms that fits into SVDFeature. But you are right, a more general setting can be desirable when we has a algorithm that doesn't fit into SVDFeature. What we want to work on is to still specify features in group and support more general models. We will also support multiple solvers if one solver isn't perfect for everything. If you have needs in detail, please send them by email;-) About the experiment scripts for track1 model. Yes we have them but they are kind of messy, so we won't publish them directly. We will choose the most effective ones(which may not get 22.1x but should be good, we shall see) and make a ready to run one like the track2 scripts. We are doing it now, I think it should be ready in a week.

About Me

6 years ago, along with my collaborators at Carnegie Mellon University, I have started the GraphLab large scale open source project, which is a framework for implementing machine learning algorithms in parallel and distributed settings. When the project became popular, we have decided to raise money to expand the project and provide an industry grade solution.
Specifically I wrote the award wining collaborative filtering toolkit to GraphLab which is widely deployed today, and helped us win top places at ACM KDD CUP 2011, ACM KDD CUP 2012 among other competitions.
Checkout our website: http://dato.com