Basically he argues that although, for example, social network graphs have
local structure (such as your friend network in facebook might be nicely clustered),
they lack any global structure. This makes at least current methods of graph partitioning useless on such graphs.

Monday, July 25, 2011

I had a very interesting meeting today with Matei Zaharia, a graduate student from Berkeley with a very impressive publication record. He is one of the the authors of Spark, a parallel cloud architecture targeted for iterative and interactive algorithms, where Hadoop does not perform well.

I was impressed with the demo of Spark, using 20 Amazon EC2 machines. Spark is implemented using Scala, which allows for convenient Java like programming interface.

Next, I tested multicore GraphLab using the same data, and same factorized matrix width (D=60). One iteration in GraphLab of alternating least squares takes 106 seconds on 8 cores, using an AMD Opteron multicore machine. This is very close
to Spark results with 12 cores.

Overall conclusion (by Matei:) I think the main takeaway from this is that you should absolutely use a lot of cores on one machine if you have them, as communication is much faster. When you add nodes, the communication cost will lower the overall performance-per-CPU, but you will get lower response times too.

Thursday, July 21, 2011

There is no doubt that Amazon EC2 is one of the most successful and useful cloud services. However, a few days ago I had the nerve breaking experience of trying to ec2-register an amazon AMI image. This task is needed when you want to save your work so you can easily load it next time you run. Amazon AMI tools is one of the worst designed and implemented tools I have ever encountered. You need a lot of patience when dealing with it. I wrote down some of the errors I encountered.
(I thought that by having a PhD and working for 15 years on Linux I will be immune to this kinds of errors, but I was absolutely wrong..) The reader should be warned,
that I did not collect error on the web, I simply encountered all of the below errors, until eventually I got so tired so I did not document everything from a certain point.

Basically, what you want to do is to run 3 commands. Usually it should not take more than a few minutes. However, if you manage to run those command in less than a few hours you are absolutely lucky.
Those are the command you like to run:

The specified bucket is not S3 v2 safe (see S3 documentation for details):

Solution: Looks like an EC2 bug - underscore and capital letters are allowed but result in this warning. If you try to ignore this warning at this point, you will get much worser errors later. Try to avoid this warning.

Solution:
You tried so many times, you got out of disk space.. Need to clean up files or restart image and retry again.

Problem:

Neither a 'manifest' or 'block-device-mapping' have been specified; at least one is required. (-h for usage)

Solution:
You should have both used the -n flag to specify a bucket name, and then the path of the bucketname/imagename.manifest.xml . By the way bucket name is flexible - it does not have to be image name.

Solution: no clue what I did - I started to get out of focus at this point. Probably started all over again.. :-(

Problem: Client.InvalidAMIName.Duplicate: AMI name graphlaborgreleasev1234 is already in use by AMI ami-98946ef1

Solution: this happens when you try to register a new AMI with a name you already gave to an older AMI need to rename.

Hopefully, after all this mess, you managed to ec2-register.. and got a printout of
the type:
IMAGE AMI-12120930
HALELUYA.
And I ask : why not simply add a UI option from AWS consule to register an image???

Final comment: I have a quick email exchange with James Hamilton, VP in Amazon and I sent him this link. I got back the following note: Sorry you had a bad experience with EC2.

I would like to take this opportunity to clarify that my overall experience with EC2 is very good. But still some interfaces could be improved.

Monday, July 18, 2011

After the excitement following our 5th place in KDD CUP 2011 is a little over, I started looking at other interesting problems. The hearst machine learning challenge has some interesting data. About 1M emails are given with 273 sparse features. The task is to classify some validation emails, and decide whether the user has opened the email and if he clicked on the link within the email. The problem is not so easy since the data is highly skewed - most users ignore ad emails
as spam, so the number of positive examples is rather low.

One of the classic ways of solving the classification problem is using SVM (support vector machine). SVMLight is a popular implementation of SVM solver.

Here is a short script I wrote for converting Hearst machine learning challenge data into SVMLight format (and also pegasos format).

The resulting files are svm1.txt -> svm5.txt (using first target - open email), files 2svm1.txt -> 2svm5.txt (using second target - click email) and the validation.txt file.
Next you can merge the files using the command

A few days ago we have released a detailed technical report about the performance of distributed GraphLab on Amazon EC2 with up to 64 nodes (512 cores total) : http://arxiv.org/abs/1107.0922

We compared GraphLab using three applications: matrix factorization, CoEM (a variant of personalized pagerank, a named entity recognition algorithm), and video co-segmentation.

As a reference we compared three platforms: Hadoop, MPI (message-passing-interface) and GraphLab. In a nutshell, GraphLab runs about 20x to 100x times faster than Hadoop, depending on the data and application. The main reason is that we perform all computation in memory and do not provide any fault tolerance. Compared to MPI, GraphLab has a similar performance. The drawback of MPI is that the code has to be rewritten for each application, while GraphLab provides building blocks for iterative computation.

The following graph shows the speedup of the 3 applications using 64 Amazon HPC machines:

The baseline for speedup calculation are 4 machines. For matrix factorization (line denoted as Netflix) we get a speedup of x16 on x64 machines. For video co-segmentation we get a speedup of x40 on x64 EC2 nodes.
When we increase factorized matrix width, the problem becomes computation heavy and we get
even a better speedup of x40 on 64 nodes.

About Me

6 years ago, along with my collaborators at Carnegie Mellon University, I have started the GraphLab large scale open source project, which is a framework for implementing machine learning algorithms in parallel and distributed settings. When the project became popular, we have decided to raise money to expand the project and provide an industry grade solution.
Specifically I wrote the award wining collaborative filtering toolkit to GraphLab which is widely deployed today, and helped us win top places at ACM KDD CUP 2011, ACM KDD CUP 2012 among other competitions.
Checkout our website: http://dato.com