Thursday, December 29, 2011

In the first part of this post, I described how to program a multicore parser, where the task is to translate string IDs into consecutive integers that will be used for formatting the input of many machine learning algorithms. The output of part 1 is a map between strings and unsigned ints. The map is built using a single pass over all the dataset.

Now an additional task remains, namely translating the records (in my case phone call records) into a Graph to be used in Graphlab. This is an embarrassingly parallel task - since the map is read only - multiple threads can read it in parallel and translate the record names into graph edges. For example the following records:

etc. etc.
The code is part of Graphlab v2 and can be downloaded from our download page.

In the current post, I will quickly explain how to continue setting up the parser.
The task we have now is to merge multiple phone calls into a single edge. It is also useful to sort the edges by their node id. I have selected to program this part in perl, since as you are going to see in a minute it is going to be a very easy task.

INPUT: A gzipped files with phone call records, where each row has two columns: the caller and the receiver; each of them as an unsigned integer. It is possible the same row will repeat multiple times in the file (in case multiple phone calls between the same pair of people where logged in different times).
OUTPUT: A gzipped output file with sorted unique phone call records. Each unique caller receiver pair will appear only once.

Explanation
1) ForkManager($THREADS_NUM) sets the number of parallel threads - in this case 8.
2) For each file, the file is unzipped using "gunzip -c", and sorted uniquely (-u command line flag). The -T flag is an optional argument in case your temp drive does not have enough space. -k 1,1 sets the sorting key to be the first column and the second -k 2,2 sets column 2 as the key in case of a match in the first column. -S flag sets the buffer size such that the full input file will fit into memory. 4G is 4 Gygabytes of memory.
3) The system() command runs any shell command line, so you can change the parallel loop execution to perform your own task easily.

Overall, we got in a few lines of code a parallel execution environment which would be much harder to setup otherwise.

Sunday, December 25, 2011

When dealing with machine learning, one usually ignores the (usually boring!) task of preparing the data to be used in any of the machine learning algorithms. Most of the algorithms have either linear algebra or statistical foundation and thus the data has to be converted to a numeric form.

In the last couple of weeks I am working on the efficient design of multicore parser, that allows converting raw string data into a format usable by many machine learning algorithms. Specifically, I am using CDR (call data records) from a large European country. However, the dataset has several typical properties, so I believe my experience is useful for other domains.

The first column is an anonymized caller ID, the second column is an anonymized receiver ID, the third column is the date, the fourth is the time, and the last column is the duration of the call.

Now to the data magnitude. If your dataset is small, no need for any fancy parsing, you can write a python/perl/matlab script to convert it to numeric form and avoid reading further... However, this dataset is rather big: every day there are about 300M unique phone calls. So depending on how many days you aggregate together, you can get to quite a large magnitude. For a month there are about 9 billion phone calls logged.

To make the CDR useful, we need to convert the hashed string ID into a number, hopefully a consecutive increasing number. That way we can express the phone call information as a matrix.
Then we can use any of the fundamental machine learning algorithms like: SVM, Lasso, Sparse logistic regression, matrix factorization, etc. etc.

One possible approach for converting strings to integer, is taken in Vowpal Wabbit, where strings are hashed into numeric IDs during the run. However, there is a risk that two different string IDs will be mapped into the same integer. So depending on the application this may be acceptable. I have chosen to take a different approach - which is to simply assign a growing consecutive ID to each string.

I have implemented the code in GraphLab, where GraphLab is not the intuitive tool to be used for this task (although it was convenient to use). In a multicore machine, several GraphLab threads are running in parallel and parsing different portions of the input files concurrently. We have to be careful that node IDs will remain consecutive across the different files. Since stl/boost data structures are typically not thread safe, I had to use a mutex for defending against concurrent insertions to the map. (Concurrent reads from a stl/boost map are perfectly fine).

One should be careful here, since as I verified using gprof profiler, about 95% of the running time is wasted on this critical section of assigning strings to ints.

Initially I used std::map<string,int> but I found it to be rather slow. It seems that std::map is implemented using an underlying tree and so insertions are costly: log(N). I switched to boost::unordered_map which is actually a hash table implementation with O(1) insertions. This gave x2 speedup in runtime.

Second, since each day of input file amount to about 5GB of gzipped file, I used boost gzipped stream for avoiding the intermediate extraction of the input files. Here is an example:

PETSc, pronounced PET-see (the S is silent), is a suite of data structures and routines for the scalable (parallel) solution of scientific applications modeled by partial differential equations. It supports MPI, shared memory pthreads, and NVIDIA GPUs, as well as hybrid MPI-shared memory pthreads or MPI-GPU parallelism.

SLEPc is a software library for the solution of large scale sparse eigenvalue problems on parallel computers. It is an extension of PETSc and can be used for either standard or generalized eigenproblems, with real or complex arithmetic. It can also be used for computing a partial SVD of a large, sparse, rectangular matrix, and to solve quadratic eigenvalue problems.

The large matrix is first partitioned into several small matrices by slicing the original matrix to column groups.

Each smaller matrix is solved in parallel using a cluster of computers, using any desired matrix factorization algorithm.

Now it remains how to aggregate the result back for getting the original factorized solutions of the larger matrix. This can be done by column projection, namely computing the least square solution of the smaller matrix onto the larger one. (See details in the paper!) and averaging the different solutions together.

It seems this construction has a pretty descent speedup. I am considering implementing this method in GraphLab as well. Ping me if you are interested in using this algo.

The observation at the basis of this construction is that one of the heavy computational burdens in K-means is computing distances from each data point to all K-cluster heads. A very simple trick is proposed: a random vector is selected and the data point as well of all cluster locations are projected into it. Now we got a list of scalars for each cluster and it remains to find out which is closest to the projected data point. This significantly saves work since instead of computing D-dimensional distances we compute only 1-dimensional distances. It seems that in practice accuracy of the method is not significantly worse.

Monday, December 12, 2011

Here is an interesting blog post from John Langford (Yahoo! Research NY) about some of the pros and cons of MPI vs. Hadoop. Different approaches to ML are discussed and summarized by John as follows:

The general area of parallel learning has grown significantly, as indicated by the Big Learning workshop at NIPS, and there are a number of very different approaches people are taking. From what I understand of all other approaches, this approach is a significant step up within it’s scope of applicability. Let’s define that scope as learning (= tuning large numbers of parameters to be simultaneously optimal on test data) from a large dataset on a cluster or datacenter. At the borders:

For counting based learning algorithms such as the NLP folks sometimes use, a MapReduce approach appears superior as MapReduce is straightforwardly excellent for counting.

For broadly distributed datasets (not all in one cluster), asynchronous approaches become unavoidably necessary. That’s scary in practice, because you lose the ability to debug. The model needs to fit into memory. If that’s not the case, then other approaches are required.

Anyone who reads this blog and is attending the NIPS big ML workshop is encouraged to contact me since I plan to be there! Note that the asynchronous approach reference (bullet 3) actually refers to our GraphLab project. I liked the phrasing: it's scary in practice.. But we do support also BSP execution so it is always possible to debug that way.

Wednesday, December 7, 2011

GraphLab v2 new features are going to be presented in Intel's ISTC-CC retreat. The new features are the result of the hard work for Joey Gonzalez, who spent the summer as part of our collaboration with Yahoo! Research, in Alex Smola's lab.

The main challenge identified by Joey with GraphLab v1, is the handling of natural (power-law) graphs.

The problem with iterative graph algorithms of such graphs, is that when computing the update of a high degree node one has to touch large portion of the graph, and thus is is hard to distribute computation. Furthermore, computation of a high degree node is done sequentially and thus we loose some of the parallelism. Last, whenever an algorithm use locking of neighbors state (to avoid concurrent writes to their data structures) large number of locks has to be acquired which significantly degrades performance.

The first solution is factorized update functions (shown above). The work on a high degree nodes is divided into smaller pieces of work which can be done in parallel.

The second solution is associate/commutative update functions, a.k.a delta updates (shown below). Whenever sums over a large number of nodes have to be computed, when the state of most of them is constant, there is no much point in summing up the result again and again. In this case, only the change in the result is tracked.

Friday, December 2, 2011

As part of our new collaboration with LensKit, where Graphlab collaborative filtering library will be used as one of LensKit engines, here is an overview of this interesting project from Michael Ekstrand, a PhD student at GroupLens research, Univ. of Minnesota.

- What is the goal of the LensKit project?

The goal of LensKit is to provide a flexible, extensible platform for researching, studying, and deploying recommender systems. We are primarily concerned with providing a framework for building and integrating recommenders, high-quality implementations of canonical algorithms which perform well on research-scale data sets, and tools for running reproducible offline evaluations of algorithms.

- What other relevant projects exist in the GroupLens lab?

We have been building recommenders internally for quite some time, starting with the GroupLens recommender system, followed by MovieLens and later recommendation efforts (such as various research paper recommenders which went under the name TechLens, and our new book recommender service BookLens). Several years ago, the MultiLens project made some of our recommender code available for use outside the lab.
LensKit is a brand-new, from-scratch implementation of core recommender algorithms that we will be using internally going forward. BookLens is currently built on top of it, and we plan to move MovieLens from its current internal recommender code, related to the MultiLens code, to LensKit sometime in the coming months. A number of of LensKit's design decisions have been driven by the needs of the BookLens project, as we have been making it suitable for both offline runs and integration into web applications. Future web-based recommender systems, both within GroupLens and externally, will be able to pick up LensKit and integrate it very easily with the complex needs of web server environments.

- Who is working on the project and for how long?

I started the project about 2 years ago, working on it off-and-on. Development picked up substantially in late 2011-early 2011, and we had our first public release early this year. Michael Ludwig has been involved with the project for most of that time, helping particularly with design and requirements work as he integrates it with the BookLens code, and also contributing code directly as well. Jack Kolb is an undergraduate who has been working with me since late spring, and we have had other students helping from time to time as well.

- What is the status of development?

Right now we are in late beta, with stable APIs for common recommender tasks, a robust infrastructure, and good implementations of several classic algorithms. We are currently working on documentation and a refactoring of our recommender configuration infrastructure; once that is completed and tested, we should be ready to declare a stable 1.0 release to build on going forward. It is pretty safe to build against LensKit at this point, though; the main interfaces are stable, and the APIs for configuration shouldn't change too much. Some code might need to be updated for 1.0, but it should be limited to the code to configure the recommender.

- Which open source license are you suing?

LensKit is licensed under the GNU GPL version 2 or later with a link exception to allow linking between it and modules under other licenses (whether proprietary or GPL-incompatible open source). LensKit can be used and deployed internally without restriction; projects distributing it externally must make its source code available to their users. The link exception is the same as that used by the GNU Classpath project.

Many of the libraries we depend on are licensed under the Apache license (APLv2).

- What is the level of parallelism you allow (is there support for parallel execution?)

Currently, none of the algorithms are parallelized. The evaluator is capable of training and evaluating multiple algorithms in parallel, even sharing some of the major data structures like the rating snapshot between runs, to achieve good throughput on comparative evaluations. Parallelizing some of them is on our radar, though.

- Do you have some performance numbers on Netflix/KDD data or similar datasets?

We provide accuracy data from the MovieLens data sets and Yahoo! Music data sets in our RecSys 2011 paper (http://grouplens.org/node/479). Efficiency numbers are a bit rough, since we've been running parallel builds, but we can train an item-item model on the MovieLens 10M data set in about 15 minutes, and on one shard of the Y! Music data set (from WebScope, each shard has ~80M ratings) in about 20-26 hours. FunkSVD takes 30-50 minutes, depending on the model size and parameters, on ML10M and 14 hours on Y!M.

- Regarding the potential collaboration with GraphLab. What may be the benefits of this collaboration?

We are looking to integrate GraphLab to allow LensKit users to have easy access to high-quality implementations of a wider variety of matrix factorization methods in particular, and to leverage your existing work rather than re-building everything ourselves. It will also make it easy to integrate GraphLab-based algorithms with interesting recommender environments, such as web applications or comparative evaluation setups.

Overall, here at the GraphLab project we are very excited about this collaboration. We believe that LensKit has a few properties that are currently missing in GraphLab: namely output processing for finding the best recommendation as well as improved user interaction and better UI.

In my NIPS 2010 paper, I (and my advisor Carlos Guestrin) where the first to show how to compute inference in linear models involving heavy tailed stable distributions. This was the first closed form solution to this problem. The trick was to use duality and compute inference in the Fourier (characteristic function) domain. My work is heavily based on Mao's work, since he was the first to formulate this duality in a general linear model. Here are his related papers:
Y. Mao and F. R. Kschischang. On factor graphs and the Fourier transform. In IEEE Trans. Inform. Theory, volume 51, pages 1635–1649, August 2005.
Y. Mao, F. R. Kschischang, and B. J. Frey. Convolutional factor graphs as probabilistic models. In UAI ’04: Proceedings of the 20th conference on Uncertainty in artificial intelligence, pages 374–381, Arlington, Virginia, United States, 2004. AUAI Press.

It seems that Mao was quite frustrated that his work seemed merely theoretical at that time, while it was hard for him to find an application to his construction. So he was delighted to see that I have actually used his construction towards a very useful application. That is why he titles my work "a missed opportunity" because he is a bit upset that he did not think about it at the time... But anyway this is science - a small progress at each step..

I attended today a very interesting talk give by Prof. William Cohen from CMU (Given at HUJI). The talk describes a very simple spectral clustering method called power iteration clustering. A paper at the same name by Frank Lio and William Cohen appeared in last year's ICML 2010.

What I like about the method, is its simplicity and application to really large datasets. While some of the HUJI audience regarded the presentation quite skeptically since no theoretical results where presented to justify the great empirical performance, it seems that empirically the power iteration clustering method works just like spectral clustering. I guess that future theoretical results will soon to follow. And let me tell you a secret: some of the most successful methods deployed today in practice are simple methods like linear and (sparse) logistic regression, matrix factorization, K-means clustering etc.

In traditional spectral clustering, a square matrix describing the problem is constructed. (Typically the matrix is normalized s.t. the sum of each row equals to one). Then the K top eigenvectors are computed. For example, if this is a social network then this matrix is the adjacency matrix. When the data is big, we can view this process of eigen decomposition as a dimensionality reduction step from sparse data in high dimension to a lower representation which is typically dense. The second step in spectral clustering is application of some clustering method like K-means to the eigenvectors. It was shown that typically data points which are close to each other has similar values in their eigenvectors.
That is why it makes sense to cluster those points together.

Note: I am cheating a bit by not providing all the details. A great tutorial for spectral clustering is found here: http://arxiv.org/abs/0711.0189.

In contrast, the proposed power iteration clustering (PIC) computes power iteration using the data matrix. Since the sum of each row is equal to one, it can be thought of a weighted average calculation, where each node computes the average of its neighbors. Of course this process will lead to the same global average in all graph nodes. The main observation in the PIC algorithm, is that the process should be stopped early before every node in the graph has an equivalent sum. When stopped at the right moment, each node store a cluster distinctive value. Intuitively, this is because nodes which are highly connected in a cluster push their intermediate results to be close to each other. Then the second step of K-means clustering is computed. This step is easier (relative to traditional spectral clustering), since we cluster just scalars in a single vector and not K dimensional vectors.

In terms of work, instead of computing K largest eigenvetors in the traditional spectral clsutering methods, PIC computes only a single eigenvector (which is actually a linear combination of several eigenvectors). So it is K times faster then traditional spectral clustering.

Here is the algorithm:

The image above shows a demonstration of the power iteration method (taken from their ICML 2010 paper). (a) Shows a problem that K-means clustering has a difficulty operating on. (b) - (d) evolution of the power iteration method. Y-axis is the output scalar value of the output vector. The x-axis represent the different points. The colors are the output of the K-means clustering of scalars. While (b)-(c) are still not accurate, eventually the values of the vector converge very nicely to three clusters.
In principle, if we would have continued to step (e) where all nodes have converged, we will get the same value in all nodes that would not have been useful.

Anyway, since the conjugate gradient, Lanczos and SVD methods implemented
in GraphLab (As well as K-meams) it is going to be very easy to implement this algorithm in GraphLab. I am planning to do it soon.

About Me

6 years ago, along with my collaborators at Carnegie Mellon University, I have started the GraphLab large scale open source project, which is a framework for implementing machine learning algorithms in parallel and distributed settings. When the project became popular, we have decided to raise money to expand the project and provide an industry grade solution.
Specifically I wrote the award wining collaborative filtering toolkit to GraphLab which is widely deployed today, and helped us win top places at ACM KDD CUP 2011, ACM KDD CUP 2012 among other competitions.
Checkout our website: http://dato.com