Thursday, September 29, 2011

A while ago I met Eric Bieschke and Tao Ye at GeekSessions event in SF.
I will really impressed by Eric's talk presenting Pandora Internet Radio, and I am sure everyone will agree with me it is one of the coolest companies, with great large scale machine learning
applications. Here is a quick interview I held with Tao:

Q: Can you give a short description of Pandora, to those few who don't know about this company?
A: Pandora is the leader in internet radio in the United States, offering a personalized experience for each of our listeners. We have pioneered a new form of radio that uses intrinsic qualities of music to initially create stations and then adapts playlists in real-time based on the individual feedback of each listener.

The Music Genome Project and our playlist generating algorithms form the technology foundation that enables us to deliver personalized radio to our listeners. These proprietary technologies power our ability to predict listener music preferences and play music content suited to the tastes of
each individual listener. The extensive musicological database of the Music Genome Project has been meticulously built by a team of professional musicians and musicologists analyzing up to 480 attributes, or genes, for every song in our vast collection, to capture the fundamental musical
properties of each recording. When a listener enters a single song, artist, composer or genre to start a station ­ a process we call seeding ­our complex mathematical algorithms combine the genes cataloged by the Music Genome Project with individual and collective feedback to suggest songs and buildpersonalized playlists.

Q: What is the magnitude of datasets you are working on?
A: As of July 2011, we had over 100 million registered users, and more
than 37 million Active monthly users. Since the launch of Pandora in 2005, our listeners
have created 1.9 billion stations and have given more than 11 billion thumbs.
Containing over 900,000 songs from over 90,000 artists, we believe the
Music Genome Project is the most comprehensive analysis of music in the
world.

Q: Are there unique properties of your data relative to other datasets
like yahoo KDD cup?
A: Compared to KDD cup 2011, our feedback dataset has binary data only
(thumb up or thumb downs) instead of numeric ratings. In addition all the feedbacks are
in context -- for a music/comedic seed. Since users can start stations
from a song, an artists or a genre, there are close to 1 million possible
"contexts" for recommendations to live in. This has both computation
implications (scale makes running complex algorithms harder) and
recommendation implications (in some cases makes the problem easier).

Our genome data has not only track/album/artist/genre meta data, but also
'gene' analysis for each track done by human music/comedic analysts. There are up to
450 gene values per track, capturing a track's musical (or comedic)
attributes from melody, harmony and instrumentation to rhythm, vocals and
lyrics.

Q: How does your current recommendation engine works? (maybe in general,
you probably do not want to reveal all secret recipes here)
A: We combine crowd feedback data and genome analysis data to provide
recommendations within context of station seed to each user. Our algorithm
recommends songs based on metrics such as thumbs ratio, genome nearest
neighbor and song novelty. It also additionally customize stations in real
time per user, based on instant user feedback.

Q: What are some future challenges you would like to solve? Specifically,
are you looking at online /real-time recommendations.
A: We're constantly improving the playlist algorithm. Many challenges lie
ahead.
* Pandora already provides online/real time personalized playlist. We
compute the building blocks to assist in making those choices offline, but
every song on Pandora was chosen specifically for that listener at that
moment. It is still a challenge to build a more refined set of real time
Metrics and infer listener preference, especially with limited user input
(many listeners don't thumb at all!).
* Past competitions emphasize on prediction accuracy optimization, however
at Pandora we value music variety greatly, hence understanding the
tradeoff between prediction accuracy and music variety/diversity and
striking the right balance is very important.
* We work on context relevant recommendations, from creating the best 4th
of July stations to ensuring new artist/song stations are good. These are
our cold start problems.
* Greater combination of different recommendation algorithms, including
content based, expert based and varies crowd based recommendation.

About Tao and Eric:
Tao Ye is a member of the Pandora playlist engineering team, currently
working on Pandora's playlist measurement and genome optimization. Most
recently, she spent 5 years as a research scientist at Sprint's IP and
wireless networking group, working on network monitoring and measurement
of large scale IP backbone. Prior to joining Sprint, she held lead
engineer and engineer roles working on Java systems at Consilient and Sun
Microsystems. She received a Master's degree from UC Berkeley
in Computer Science and duo Bachelor's degrees from State University of
New York at Stony Brook in Computer Science and Engineering Chemistry. She is expecting her Ph.D.
Degree from University of Melbourne on 12/2011.

Eric Bieschke runs playlist engineering for Pandora. As Pandora¹s second
employee he built initial prototypes for Pandora¹s playlist algorithms and
with his team has grown them to service more than 100M users who¹ve
thumbed 10 billion songs while listening to billions of hours of music. He
is currently working on optimally combining content based recommendations,
collective intelligence, and human machine cooperation in order to provide
the best experience for listeners.

Tuesday, September 27, 2011

A few days ago I met with Prof. Scott Kirkpatrick, who asked me for some help implementing the K-cores decomposition algorithm in Graphlab.

K-Core decomposition is a very simple algorithm, originating from statistical mechanics for investigating graph properties. I found a nice paper (k-core decomposition: a tool for the visualization of large scale networks, by
José Ignacio Alvarez-Hamelin, Luca Dall'Asta, Alain Barrat, Alessandro Vespignani, NIPS 18, 2006, describing the algorithm. The algorithm proceeds as follows:
In the first round, all nodes with one or less edges are removed from the graph, and all their edges are deleted. In the second round, all nodes with two edges or less are removed from the graph and their edges deleted. Similarly, at the i-th iteration all nodes with less than i neighbors are removed from the graph. Note that removal of a node with k or fewer neighbors is done recursively. If removal (with the k links) exposes a new site with now less than k ngbrs it is removed in the current iteration as well.

The above image taken from the NIPS paper above, shows 3 iterations of the K-core algorithm. In the first iteration the blue nodes
are removed. In the second iteration the yellow nodes are removed. Finally we are left with the red nodes which are the core of this sample network.

The output of the K-cores algorithm is a single number for each node: its core assignment. The algorithm has been used by Prof. Kirkpatrick for investigating Internet topology and reported here.

Below you can find an image from the above NIPS paper which depicts application of the algorithm into France Internet domain:

Anyway, now the K-core algorithm is supported as part of the Graphlab clustering library. You are welcome to try it out on your own network!

Wednesday, September 21, 2011

I am one of the readers of your weblog. I have a question about one of your posts in your weblog about comparison of of two linear algebra libraries: 'it++ vs eigen" ; I guess you are the expert person who can answer my question.
I have an algorithm that involves matrix-matrix and matrix-vector matrix multiplication iteratively and involves all kinds of dense and sparse matrices and vectors. I have implemented my algorithm using gmm with atlas flag active but it seems that it is still slower than MATLAB. More specifically, it seems that gmm uses one thread comparing to MATLAB that uses multiple threads when it is compiled with MCC.
I was wondering if any of those libraries you have introduced in your post (it++, eigen) are capable to of multi-threading and how does it compared with MATLAB linear algebra engine.

Regards,
Kayhan

It is always nice to get feedback from my readers! Especially the ones who call
me an expert (although without "I guess" - next time please!! :-)
There is definitely a room for improving blas/lapack performance. Need to dig into the details of the library you are using.

I'm guessing that the current configuration produces too many threads,
or puts those threads in the wrong places. See for example the section
'Choosing the number of threads with MKL' on
http://www.psc.edu/general/software/packages/mkl/ . It might also be
worth linking against the non-threaded version of MKL, which I think
would involve doing:

From my experience, there is a huge difference in performance between different lapack configurations on the same machine. For example, on BlackLight supercomputer
I got the following timing results for Alternating least squares on Netflix data.
Here is a graph comparison different implementations. I used 16 BlackLight cores. Alternating least squares is run 10 iterations to factorize a matrix of 100,000,000 nnz. The width of the factor matrices was set to D=30.
As you can see, wrong configuration resulted in x24 more running time! (In this Graph - lower is better!) Overall, if you are using an Intel platform I highly recommend using MKL.

Why don't you try out GraphLab? It is designed for iterative algorithms on sparse data. In case you use it is much easier to deploy efficiently the multiple cores.

Giraph machine learning project, is a relatively new large scale machine learning project at incubation stage under Apache. It is the only open source implementation I am aware of Google's Pregel (BSP = Bulk Synchronous Parallel) framework.

I got the following instructions, from my colleague and friend Aapo Kyrola:

Tuesday, September 20, 2011

I thought about writing down about some of the exciting applied machine learning projects that are currently taking place. I start with Nickolas (Nick) Vasiloglou, PhD, a former graduate student of Alex Gray at Georgia Tech. Nick is currently an independent large scale machine learning consultant.

Here are some of the projects Nick is involved in, in his own words:

LexisNexis-ML is a machine learning toolbox combining the HPCC-LexisNexis hyperformance computing cluster and the PaperBoat/GraphLab library. HPCC is by a far a superior alternative to Hadoop. The system uses ECL, a declarative language that allows easier expression of data problems (see http://hpccsystems.com/ ). Inlining of C++ code can make it even more powerful when blending of sequential numerical algorithms with data manipulation is necessary. HPCC's heart is a C++ code generator that has the advantage of generating highly optimized binaries that outperform java Hadoop binaries.

PaperBoat a single thread machine learning library built on top of C++ Boost MPL (template metaprogramming). The library is built with several templated abstractions so that it can be integrated easily with other platforms. The integration can be either light or very deep. The library makes extensive use of multidimensional trees for improving scalability and speed. Here is the current list of implemented algorithms. All of them support both sparse and desne data:

Mouragio is an asynchronous version of Paperboat where single threaded machine learning algorithms can exchange asynchronously data. Mouragio implements very efficiently a publish subscribe model that is ideal for asynchronous bootstraping (bagging) as well as for the racing algorithm (Moore & Maron 1997). Asynchronous iterations is an old idea from MIT optimization lab (Bertsekas and Tsinstikilis see link.
Mouragio is trying to utilize algorithms from the graph literature to automatically partition data and tasks so that the user doesn't have to deal with it. The mouragio daemon is trying to schedule tasks to the node where most of the required data and computational power are available. Mouragio is partially supported by LogicBlox.

DataLog-LogicBlox scientific engine. LogicBlox has developed a database platform based on Logic. The language used is an enhanced version of Datalog. By far Datalog is the most expressive and declarative language for manipulating data. At this point datalog translates logic into a run-time database engine transactions. The goal of this project is to translate datalog to other scientific platfroms such as GRAPHLAB and MOURAGIO. Datalog is very good at expressing graphs so it very easily can translate to GRAPHLAB Also since the algorithm are described as sequence independent rules, automatic parallelization is more easy to do (although not always 100%).

Sunday, September 18, 2011

it++ and Eigen are both popular and powerful matrix linear algebra packages for C++.

We got a lot of complaints from our users about the relative difficulty in installing it++, as well for its limited GPL license. We have decided to try and swith to Eigen linear library instead. Eigen has no installation since the code is composed of header files. It is licensed under LGPL3+ license.

Today I have created a pluggable interface that allows swapping it++ and Eigen underneath our GraphLab code. I have run some tests to verify speed and accuracy of Eigen vs. it++.

And here are the results:

Framework and Algorithm

Running time (sec)

Training RMSE

Validation RMSE

it++ ls_solve_chol

16.8

0.7000

0.9704

it++ ls_solve

17.8

0.7000

0.9704

Eigen ldlt

18.3

0.6745

0.9495

Eigen llt

18.7

0.6745

0.9495

Eigen JacobiSVD

63.0

0.6745

0.9495

Experiment details: I have used GraphLab's alternating least squares, with a subset of Netlix data. Dataset is described here. I let the algorithm run for 10 iterations, in release mode, on our AMD Opteron 8 core machine.

Experiment conclusions: It seems that Eigen is more accurate than it++. It slightly runs slower than it++ but accuracy of both training and validation RMSE is better.

Tho those of you who are familiar with it++ and would like to try out Eigen I made some short
list of compatible function calls of both systems.

The problem arises when you have an array, and you want multiple cores to add values to the same array position concurrently. This of course may result in undetermined behavior of the needed precautions are not taken.

A nice way to solve this problem is the following. Define the following
scary assembler procedure:

And here is some more detailed explanation from Guy Blelloch:The CAS instruction is one of the machine instructions on the x86 processors (the first function is just calling the instruction, which has name cmpxchgq). Probably the best book that describes its various applications is Herlihy and Shavit's book titled "The Art of Multiprocessor Programming". The general use is for implementing atomic read-modify-write operations. The idea is to read the value, make some modification to it (e.g. increment it) and then write it back if the value has not changed in the meantime. The CAS(ptr, a, b)
function conditionally writes a value b into ptr if the current value equals a.

Friday, September 9, 2011

Recently I have been working on implementing a clustering library on top of GraphLab.
Currently we have K-means, Fuzzy K-means and LDA (Latent Dirichlet Allocation) implemented. I took some time for comparing performance of GraphLab vs. Mahout on an Amazon EC2 machine.

Here is a graph which compares performance:

Some explanation about the experiment. I took a subset of Netflix data with 3,298,163 movie ratings, 95,526 users, and 3,561 movies. The goal is to cluster user with similar movie preferences together. Both GraphLab and Mahout run on Amazon m2.xlarge instance .
This machine has 2 cores. I have used the following settings: 250 clusters, 50 clusters and 20 clusters. The algorithm runs a single iteration and then dumps the output into a text file.
For Mahout, I used Mahout's K-Means implementation. GraphLab was run using a single node, while Mahout was run using either one or two nodes. Mahout is using 7 mappers and GraphLab 7 threads.

Overall, GraphLab runs between x15 to x40 faster on this dataset.

A second experiment I did is to compare Mahout's LDA performance to GraphLab's LDA.
Here is the Graph:
For this experiment, I used m1.xlarge instance. I tested Graphlab on 4 cores, Mahout and 4 cores and Mahout on 8 cores (2 nodes). I used the same Netflix data subset, this time with 10 clusters. Graph depicts running time of a single iteration.

Finally, here are performance results of GraphLab LDA with 1, 2, 3 and 4 cores (on m1.xlarge EC2 instance):
Running time in this case is for 5 iterations.

Thursday, September 8, 2011

One of the greatest things about writing a blog is in getting interesting feedback from the readers. (I mean if no one reads it - why bother??) Here is an email I got this morning from Steve Lianoglou, a graduate student in the Computational Systems Biology dept, Memorial Sloan-Kettering Cancer Center, Weill Medical College of Cornell University.

In time, I hope to get it "easily installable" and push it out to
CRAN, but in the meantime I thought it would be of interest to the
people reading this list in its current form, and to the shotgun
authors (who I'm hoping are also reading this list), even if they
don't use R :-)

Thanks a lot Steve! We really appreciate your efforts. The shotgun code has been significantly improved over the last two weeks. We are looking for more users to beta test it on real data. Write me if you are trying our code!

I am very pleased to announce, that GraphLab large scale machine learning project is now supported by Amazon Elastic cloud (EC2), who allocated us computing time for using their cloud. This will allow us to extend compatibility with EC2 and further to scale for larger models.

I want to take this opportunity to thanks James Hammilton, VP and Distinguished Engineer in Amazon who pulled some strings, and introduced us to Kurt Messersmith, Senior Manager in Amazon Web Services who was kind enough to approve our grant request.

Showers continued throughout the week in the Bahia cocoa zone, alleviating the drought since early January and improving prospects for the coming temporao, although normal humidity levels have not been restored, Comissaria Smith said in its weekly review. The dry period means the temporao will be late this year. Arrivals for the week ended February 22 were 155,221 bags of 60 kilos making a cumulative total for the season of 5.93 mln against 5.81 at the same stage last year. Again it seems th....

The goal of the method, is to cluster similar news items together. This is done by first counting word occurrences using TF-IDF scheme. Each news item is a sparse row in a matrix. Next, rows are clustered together using the k-means algorithm.

Answer: cluster path and input path point for the same folder. When starting run all files in cluster path are deleted, so input file is deleted as well. Change paths to point to different folders! Problem:

Problem: program clusterdump runs, with empty txt file as output.
Solution: You probably gave the intermediate cluster path of k-means instead of the output path dir. In this case, program runs and terminates without an error.

About Me

6 years ago, along with my collaborators at Carnegie Mellon University, I have started the GraphLab large scale open source project, which is a framework for implementing machine learning algorithms in parallel and distributed settings. When the project became popular, we have decided to raise money to expand the project and provide an industry grade solution.
Specifically I wrote the award wining collaborative filtering toolkit to GraphLab which is widely deployed today, and helped us win top places at ACM KDD CUP 2011, ACM KDD CUP 2012 among other competitions.
Checkout our website: http://dato.com