GraphChi (IO Efficient Graph Mining) GraphChi (Kyrola, 2012) is an IO efficient graph mining system that is also designed
to accept topology updates based on a Parallel Sliding Window (PSW) algorithm.
Each iteration over the graph requires P^2 sequential reads and P^2 sequential
writes. Because all IO is sequential, GraphChi may be used with traditional disk or
SSD. The system is not designed to answer ad-hoc queries and is not a database in
any traditional sense – the isolation semantics of GraphChi are entirely related to
the Asynchronous Parallel (ASP) versus Bulk Synchronous Parallel (BSP) processing
paradigms. GraphChi does not support either vertex attributes or link attributes.
The basic data layout for GraphChi is a storage model that is key-range partitioned
by the link target (O) and then stores the links in sorted order (SO). This design was
chosen to permit IO efficient vertex programs where the graph was larger than main
memory.

While GraphChi does not support cluster-based process, the approach
could be extended to a compute cluster. Because of the IO efficient design, the
approach is of interest for out-of-core processing in hybrid CPU/GPU architectures.
GraphChi operates by applying a utility program to split a data set into P partitions,
where the user chooses the value of P with the intention that a single partition will
fit entirely into main memory. The edges are assigned to partitions in order to
create partitions with an equal #of edges – this provides load balancing and
compensates for data skew in the graph (high cardinality vertices).
GraphChi reads one partition of P (called the memory partition) into core. This
provides the in-edges for all vertices in that partition. Because the partitions are
divided into target vertex key-ranges, and because partitions are internally ordered
by the source vertex, out-edges for those vertices are guaranteed to lie in a
contiguous range of the remaining P-1 partitions. Those key-ranges may vary in
their size since the #of out-edges for a vertex is not a constant. Thus, for the current
memory partition, GraphChi performs 1 full partition scan plus P-1 partial partition
scans.

In addition to the edges (network structure), GraphChi maintains the transient
graph computation state for each edge and vertex. The edge and vertex each have a
user assignable label consisting of some (user-defined) fixed-length data type. The
vertices also have a flag indicating whether or not they are scheduled in a given
iteration. The edge state is presumably read with the out-edges, though perhaps
from a separate file (the paper does not specify this). The vertex state is presumably
read from a separate file (again, this is not specified). After the update() function
has been applied to each vertex in the current memory partition, the transient graph
state is written back out to the disk. This provides one more dimension of graph
computation state that is persisted on disk, presumably in a column paired with the
vertex state.

If you like to read the rest of the overview, and also some proposed extensions, you should read the full paper. And of course, you can read about the collaborative filtering toolkit I am writing on top of GraphChi here.

Co-EM is a very simple algorithm, extensively utilized by Rosie Jones in her PhD thesis. Originally by Nigam and Ghani (2000). The algorithm is used for clustering test entities into categories. Here is an example dataset (NPIC500) which explains the input format. The algorithm constructs a bipartite graph:

Where the left nodes are noun phrases, the right node are the sentence context, and edge weight is the number of times a certain noun phrase was within the context. The algorithm is very simple (described in page 43 of John's PhD thesis:

As seen above, the noun labels simply compute a weighted some of the edge values. The context nodes compute the same weighted sum (if they are not seed nodes). Seed nodes are the initial graph nodes we have ground truth labels about them.

The output of the probability for each noun phrase to be in a different categories.

Here are some more concrete example of the input file:

Additionally, ground truth is given about the negative and positive seeds. For example, assume we have two categories (city / not city). The seed lists classify certain nouns to their matching categories.

And here is how to try it out in GraphChi
0) Install graphchi as explained here, and compile using "make ta"
1) Download the file http://graphlab.org/downloads/datasets/coem.zip and unzip it in your root graphchi folder
2) In the root graphchi folder run:

Sunday, February 17, 2013

Recently, Professor Pedro Domingos, one of the top machine learning researchers in the world, wrote a great article in the Communications of the ACM entitled “A Few Useful Things to Know about Machine Learning“. In it, he not only summarizes the general ideas in machine learning in fairly accessible terms, but he also manages to impart most of the things we’ve come to regard as common sense or folk wisdom in the field.

One thing you should worry about applying machine learning to high dimensional data is random correlation. I got a great example from my friend and collaborator Erik Aurell:

Monday, February 11, 2013

A few days ago I got a request from Jidong, from the Chinese Renren company to implement label propagation in GraphChi. The algorithm is very simple described here:Zhu, Xiaojin, and Zoubin Ghahramani. Learning from labeled and unlabeled data with label propagation. Technical Report CMU-CALD-02-107, Carnegie Mellon University, 2002.
The basic idea is that we start with a group of users that we have some information about the categories they are interested in. Following the weights in the social network, we propagate the label probabilities from the user seed node (the ones we have label information about) into the general social network population. After several iterations, the algorithm converges and the output is labels for the unknown nodes.

Here is a pseudo code for the algorithm:

Where Y is the matrix with probabilities for each label, T is the original adjacency graph (where weights are normalized first to one). Clamping means that for the labeled data the weights are fixed to be the input probabilities.

Label propagation is now a part of GraphChi graph analytics toolkit, which includes the following algorithms:kcores algorithmsubgraph - cut subgraph following X hops around seed nodes / output node degreesinmemconcomp - in memory connected components

The way to run label propagation is to prepare two input files.
1) The --training=filename file is a file in sparse matrix market format which describes the social graph (an adjacency list of size N x N).
2) Additionally we need to provide a seed file (with the filename: filename.seeds ) which has a sparse matrix of size N x D. D is the number of possible classes categories. For example, if node 1 is a seed node with 3 categories with probabilities p_1 = 0.3, p_2 = 0.4, p_e = 0.3, we need to add the following inputs:1 1 0.31 2 0.41 3 0.3
Here is an example training file for a network of size 5x5 (file name in this example is renren)

It is easy to verify the seed node (node 1) probabilities were not changed. But the other nodes have now probabilities which originate in their connections with node 1.An update: here is a note I got from Soundar Kumara, Prof. in Penn State Univ:

Sunday, February 10, 2013

As you may know, our GraphChi collaborative filtering toolkit in C is becoming more and more popular. Recently, Aapo Kyrola did a great effort for porting GraphChi C into Java and implementing more methods on top of it.

In this blog post I explain how to setup GraphChi Java development environment in Eclipse and run alternating least squares algorithm (ALS) on a small subset of Netflix data.
Based on the level of user feedback I am going to receive for this blog post, we will consider porting more methods to Java. So email me if you are interested in trying it out.

in thread "main" java.io.FileNotFoundException: ~/Downloads/smallnetflix_mm.shovel.0 (No such file or directory)

at java.io.FileOutputStream.open(Native Method)

at java.io.FileOutputStream.<init>(FileOutputStream.java:194)

at java.io.FileOutputStream.<init>(FileOutputStream.java:84)

at edu.cmu.graphchi.preprocessing.FastSharder.<init>(FastSharder.java:113)

at edu.cmu.graphchi.apps.ALSMatrixFactorization.createSharder(ALSMatrixFactorization.java:176)

at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:198)

Solution: Give a full absolute path pointing to the location of your file, namely /home/bickson/Downloads/smallnetflix_mm etc.

Error:

thread "main" java.lang.IllegalArgumentException: Java Virtual Machine has only 32489472bytes maximum memory. Please run the JVM with at least 256 megabytes of memory using -Xmx256m. For better performance, use higher value

at edu.cmu.graphchi.engine.GraphChiEngine.<init>(GraphChiEngine.java:120)

at edu.cmu.graphchi.apps.ALSMatrixFactorization.main(ALSMatrixFactorization.java:215)

Instructions for computing item to item similarities:

2) Run createTrain.sh to download the million songs dataset and prepare GraphChi compatible format.$ sh createTrain.sh
Note: this operation may take an hour or so to prepare the data.

3) Run GraphChi item based collaborative filtering, to find out the top 500 similar items for each item:

$ ./toolkits/collaborative_filtering/itemcf --training=train --K=500 --asym_cosine_alpha=0.15 --distance=3 --min_allowed_intersection=5Explanation: --training points to the training file. --K=500 means we compute the top 500 similar items.--distance=3 is Aillio's metric. --min_allowed_intersection=5 - means we take into account only items that were rated together by at least 5 users.Note: this operation requires around 20GB of memory and may take a few ours...

Create user recommendations based on item similarities:

1) Run itemsim2rating to compute recommendations based on item similarities$ rm -fR train.* train-topk.*$ ./toolkits/collaborative_filtering/itemsim2rating --training=train --similarity=train-topk --K=500 membudget_mb 50000 --nshards=1 --max_iter=2 --Q=3 --clean_cache=1
Note: this operation may require 20GB of RAM and may take a couple of hours based on your computer configuration.

Friday, February 1, 2013

I had an interesting talk with José P. González-Brenes, A 6th year grad student from CMU LTI dept.During the talk, I learned that Jose participated in the Kaggle's RTA challenge and actually won the 1st place out of more than 300 groups.The challenge was for predicting RTA highway travel times. The data was recorded time of different segments different cars traveled. The winning solution (of Jose and Guido Matías Cortés) was composed of a very simple method - a random forest. Unfortunately, there was no paper published about it, but here is a blog post summarizing the solution method. And here is a link to their presentation. What is further interesting about the solution method is that it was composed of 90 lines of matlab code!The reason we actually talked is that Jose was recently trying out my GraphChi collaborative filtering code for his research, so I gave him some advice on which methods to use. Once he has some interesting results I hope he will update us!

About Me

6 years ago, along with my collaborators at Carnegie Mellon University, I have started the GraphLab large scale open source project, which is a framework for implementing machine learning algorithms in parallel and distributed settings. When the project became popular, we have decided to raise money to expand the project and provide an industry grade solution.
Specifically I wrote the award wining collaborative filtering toolkit to GraphLab which is widely deployed today, and helped us win top places at ACM KDD CUP 2011, ACM KDD CUP 2012 among other competitions.
Checkout our website: http://dato.com