Saturday, October 22, 2011

It seems I struck some nerve, since I am getting a lot of comments about this blog post.

First, here is an email I got from Bryan, Research Programmer at CMU:"The machine in question, named "thrust", features two Xeon X5690 CPUs. Each one has 12 hyperthreaded cores, clock speed 3.47GHz, 64 KB L1 cache/core (32 KB L1 Data + 32 KB L1 Instruction) and 256 KB L2 cache/core, and 12MB L3 cache. It's got 192GB of RAM. We're running Ubuntu 10.04 LTS with the Ubuntu-supplied underlying libraries (boost, ITPP, etc.), some of which are a bit older than the ones supplied to the newer version of Ubuntu mentioned on your blog.

No problem on getting you access to this machine. The only difficulty would be in coordinating who gets to use it when. We have other somewhat less well-endowed machines of identical hardware platform and software setup that see more idle time where it'd be easier to fit in runs that take only a few tens of GB of RAM and, say, half a dozen cores.

To elaborate on what I observed during Abhay's run, I present the following strage wall of numbers excerpted from a home-grown load monitoring script:

Each row represents some statistic averaged over the last five seconds. The first two columns are time stamps. The interesting bit is the next grouping with five columns of poorly-spaced numbers. The first number is the average number of cores doing something in the kernel. We see here that the process has spurts where it will spend ~10 cores in the kernel for ~five seconds, and then that number will back off for 15-30 seconds. The next column counts idle time, and we see a spike in idle time after each spike of kernel time, and that this spike is of similar or perhaps greater magnitude. The 3rd column is all zeros. The 4th column is # of cores spent in userland, and we see again a pulsation between near-full load and then spurts of 5-10 seconds where it shifts first into the kernel and then goes idle. The 5-second interval apparently is too course too catch all of the kernel/idle spikes. The 5th column is # cores spent blocked on I/O, so we can see that there is no measurable waiting for disk.

The other interesting part is the rightmost column. This is average MB/s read and written to disk. What we see here is that the spikes in kernel time correlate with bursts of writes to disk that amount to 14MB/s within that 5-second interval (i.e. a 70MB write -- sounds like one of these swap files).

Now, I should mention that the disk here is a linux software RAID6 array, and that the RAID calculations show up as kernel time. Obvious culprit, but seeing even one whole core worth of CPU time on RAID calculations is a bit rare, and I don't think I've ever seen more than two. We could settle this by running top and looking at the relevent system processes. But I would have to guess that something about the way in which this data is being written out, or perhaps some memory-related operations, are causing the kernel so much fuss. It may be that your code is not really "at fault" per se, but, as a programmer, I look at all that idle time and wonder if there isn't a way to put it to use."

And here is my reply: "Very interesting question and observations!Currently, my code is implemented as follows. On each iteration, the matrix operations of multiplying by A' and A are done in parallel. Then all threads are collected, and then the multiplication by A and A' is done in parallel. Then some accounting is done for computing the eigenvector of this iterations, follows by its writing into disk.

I guess is what you see in your monitoring, are the idle periods where the single threads is active doing the accounting and writing the intermediate eigenvectors to disk.

I am sure that further optimizations to the code are possible to utilize the cores better and potentially distribute some of the accounting. Having an access to your machine will help me optimize further. I will look at this some more on my machines, but most effectively will be to look at it on yours. We can start with smaller problems in order not to overload your machine."

Bryan and Anhay where kind enough to grant me access to their thrust machine. So now I am going to try it out
myself!

I hope you ll be doing great, I wanted know about the GraphLab SVD solver that you have mentioned in your blog. I would like to try out on matrix of about 5 million rows and 400 columns and would like to compare the performance gains over mahout. It would be great if you could guide me to useful resources and the input formating code you have mentioned on your blog. I also want to read the best documentation available for the Java API to Graphlab and would really appreciate if you could direct me to it.

Best,Ehtsham Elahi

This is very cool! Before ink has dried on my blog post (or whatever..) users are actually want to try out my code. If you want to try the code, those are step you need to follow:

1) Download and install GraphLab on your system, using a checkout from our mercurial repository as instructed here http://graphlab.org/download.html . If you have access to an Ubuntu machine or EC2 machine that would be the easiest
way to start.
2) Prepare your matrix in MatrixMarket sparse matrix format as explained here: http://graphlab.org/pmf.html
Alternatively, if you have already a Mahout sequence file, you can translate it to GraphLab collaborative filtering format
using the instructions here: http://bickson.blogspot.com/2011/08/graphlab-and-mahout-compatability.html
3) Run our new SVD solver using the commend ./demoapps/clustering/glcluster 7 2 0 --matrixmarket=true --scope=null --ncpus=XX
4) When you run Mahout - note that their current Lanczos solver solves only one side of the problem, so you will need to run their solver twice: once for A and once for A'. Our solver runs both iterations in a single pass.

Anyway, I would love to get more comments about this construction from anyone out there! Email me..

Friday, October 21, 2011

About two weeks ago, I got a note from Tom Mitchell, head of the Machine Learning Dept. at Carnegie Mellon University, that he is looking for a large scale solution for computing SVD (singular value decomposition) on a matrix of size 1.2 billion non-zeros. Here is his note:

We have a large, sparse, 10^7 by 10^7 array on which we'd like to run
SVD/PCA. It is an array of corpus statistics, where each row
represents a noun phrase such as "Pittsburgh", each column represents
a text fragment such as "mayor of __", and the i,j entry in the array
gives the count of co-occurences of this noun phrase with this text
fragment in a half billion web pages. Most elements are zero.

To date, Abhay (cc'd) has thinned out the rows and columns to the most
frequent 10^4 noun phrases and text fragments, and we have run
Matlab's SVD on the resulting 10^4 by 10^4 array. We'd like to scale
up to something bigger.

My question: do you think your GraphLab SVD is relevant? In case
yes, we'd enjoy giving it a try.

cheersTom

Aside from the fact that Tom is my beloved dept. head, the presented challenge is quite exciting and I started immediately to look into it.

One factor that complicates this task, is that he further required to run no less then 1000 iterations. Because the matrix size is 1,200,000,000 non-zeros, and each iteration involves multiplying by A and then A', and equivalently by A' and A, total of 4 passes over the matrix, we get that we need to compute 4,800,000,000,000 multiplications!

Furthermore, since there are 8,000,000 rows and 7,000,000 columns, to store the eigenvectors we need to store 17,000,000,000 numbers.

As explained in my previous post, SVD can be computed using the Lanczos iteration, using the trick that Abhay taught me. So first, I have changed the code to compute both AA' and A'A on each iteration (instead of solving full Lanczos of AA' and then of A'A which would require much more time). I let Abhay try this solution, and program crashed after consuming 170GB of RAM on a 96GB machine.

Next, I have changed all the usage of doubles to float, reducing memory consumption by half. I further saved all rows and columns using itpp sparse vectors (instead of Graphlab edge structures to save memory). Now on my 8 core machine, it took about 2150 seconds to load the problem into memory, and each SVD iteration took 160 seconds. Again I handed this solution to Abhay, which had trouble again operating the program. This is because I tested it using 10 iterations, and I was not aware at this point he needed to run 1000 iterations. Even with floats, allocating data structure to hold the eigen vectors takes 72GB of RAM!

Finally, I made a another change, which is dumping each eigenvector to disk. On each iteration each eigenvector is saved to a swap file of size 72MB. That way the 72GB for storing those partial results are saved to disk and not handled in memory. By doing a few more optimizations of the loading process Ireduced the download time to 1600 seconds.

I also compile the Graphlab solution on BlackLight supercomputer, and using 16 cores it takes 80 seconds per iteration. Overall, now the time for computing 1000 eigenvectors of this mega matrix can be done in a reasonable time of 24 hours using a single machine of 16 cores!

An update: this is a note I got from Abhay: "using ncpus=24, 1000 iterations finished in 10 hours, which is pretty neat. Also, this time around I didn't see a lot of memory usage. "

Overall, now we have one more happy GraphLab user! Contact me if you have any large problem you think can be solved using GraphLab and I will be happy to help.

Monday, October 10, 2011

A while ago, I examined Mahout's parallel machine learning toolbox mailing list, and found out that a majority of the questions where about their Lanczos SVD solver. This is a bit surprising since Lanczos is one of the simplest algorithms out there. I thought about writing a post that will explain Lanczos once and for all.

The Lanczos method, is a simple method for finding the eigenvalues and eigenvectors of a square matrix. Here is matlab code that implements Lanczos:

Handling non-square matrices
Now what happens when the matrix A is not square? This is very easy. Instead of decomposing A we will decompose A'A. In terms of efficiency, there is no need for computing A'A explicitly. We
can compute it iteratively by changing:
w = A*V(:,j) - beta(j)*V(:,j-1);
to
w = A*V(:,j); w = A'*w - beta(j)*V(:,j-1);

GraphLab Implementation
SVD is now implemented in Graphlab as part of the collaborative filtering library.
You are welcome to run first the unit test:
./pmf --unittest=131
For performing SVD on a 5x12 matrix.

The input to the algorithm is sparse matrix in GraphLab (or sprase matrix market) format as explained here. The output are three matrices: U, V, and T.
U and V as above, and T is a matrix of size (number of iterations +1 x 2) where the eigenvalues
of A'A are in the first column and the eigenvalues of AA' are in the second column. (Due to numerical errors, and the fact we run a limited number of iteration, there may be variations between two lists of Eigenvalues, although in theory they should be the same).

Implementation details
We implement Lanczos algorithm as described above, but on each iteration we compute in parallel both A'A and AA'. The matrices are sliced into rows/columns and each row and column is computed in parallel using Graphlab threads.

Notes:
1) The number of received eigenvalues, is always number of iterations + 1.
2) Currently we do not support square matrices. (When inputing a square matrix we still compute A'A).
3) Unlike some common belief, algorithm works well for both small example and large matrices. So it is also desirable to try it out using small examples first.
4) It is recommended to run using itpp. Eigen version is still experimental at this point.
5) The output matrix T contains the square of the singular value. (And not the singular value itself).

copy the command string: "git clone..." and execute it in your ubuntu shell.

Note: After you cloned the repository and you would like to get up to date,
you can issue the commands:
git pull

6) configure and compile - for GraphLab version 2.1

cd graphlab

./configure
cd release/toolkits/collaborative_filtering/ # for example, if you want to compile the cf toolkit
make -j4

7) Optional: install Octave.
If you don't have access to Matlab, Octave is an open source replacement. Octave is useful for preparing input formats to GraphLab's collaborative filtering library and reading the output.
You can install Octave using the command:

Tuesday, October 4, 2011

Download MacPorts. There are available images for Lion and Snow Leopard.

The you should want to update your MacPorts using the commands:

sudo port selfupdate
sudo port upgrade outdated

NOTE: FOR GRAPHLAB, PLEASE USE gcc45 and not gcc46 as in the below
example.

Then you can install gcc using

sudo port install gcc46 #change this to gcc45 for graphlab

You can view other versions of gcc using the command

port search gcc

After installation you can view installed file by

port contents gcc46 #change this to gcc45 for graphlab

Now if you have several gcc versions installed you may want to choose the right version to use.
This is done using the command:

sudo port select gcc mp-gcc46

Now I got some error:
Selecting 'mp-gcc46' for 'gcc' failed: could not create a new link for "/opt/local/bin/gcj'. It seems it is a known issue discussed here.
I applied a quick and dirty fix (Since I don't need gcc Java:)

sudo touch /opt/local/bin/gcj-mp-4.6

And then I get:
Selecting 'mp-gcc46' for 'gcc' succeeded. 'mp-gcc46' is now active.

Monday, October 3, 2011

Following the first part of this post, where I compared some properties of it++ vs. Eigen, two popular linear algebra packages (it++ is an interface to BLAS/LaPaCK). I took some time for looking more closely at their performance.

Each algorithm was tested once on top of it++ and a second time on Eigen, using a subset of Netflix data described here.

Following is a summary of the results:
* Installation: Eigen has no installation, since it is composed of header files. it++ has a problematic installation on many systems, since installation packages are available on only part of the platforms, and potentially BLAS and LaPaCK should bre preinstalled.
* Licensing: Eigen has LGPL3+ license, while it++ has a limiting GPL license.
* Compilation: Eigen compilation tends to be slower because of excessive template usage.
* Performance: Typically Eigen is slightly faster as shown in the graphs below.
* Accuracy: Eigen is more accurate when using ldlt() method (relative to it++ solve() method), while it++ performs better on some other problems, when using backslash() least squares solution.

About Me

6 years ago, along with my collaborators at Carnegie Mellon University, I have started the GraphLab large scale open source project, which is a framework for implementing machine learning algorithms in parallel and distributed settings. When the project became popular, we have decided to raise money to expand the project and provide an industry grade solution.
Specifically I wrote the award wining collaborative filtering toolkit to GraphLab which is widely deployed today, and helped us win top places at ACM KDD CUP 2011, ACM KDD CUP 2012 among other competitions.
Checkout our website: http://dato.com