GPS: A Graph Processing System

Overview

GPS is an open-source system for scalable, fault-tolerant,
and easy-to-program execution of algorithms on extremely large graphs.
GPS is similar to Google’s proprietary Pregel
system, and Apache Giraph.
GPS is a distributed system designed to run on a cluster of machines, such as Amazon's EC2.

In systems such as GPS and Pregel, the input graph (directed, possibly with values on edges)
is distributed across
machines and vertices send each other messages to perform a computation. Computation is
divided into iterations called supersteps. Analogous to the map()
and reduce() functions of the MapReduce framework, in each superstep
a user-defined function called vertex.compute() is applied to each vertex in parallel.
The user expresses the logic of the computation by implementing vertex.compute().
This design is based on Valiant's Bulk Synchronous Parallel model of computation. A detailed
description can be found in the original Pregel paper.

There are five main differences between Pregel and GPS:

GPS is open-source.

GPS extends Pregel's API with a master.compute() function, which
enables easy and efficient implementation of algorithms that are composed of
multiple vertex-centric computations, combined with global computations.

GPS has an optional dynamic repartitioning scheme, which
reassigns vertices to different machines during graph
computation to improve performance, based on observing communication patterns.

GPS has an optimization called LALP that reduces the network I/O in when running
certain algorithms on real-world graphs that have skewed degree distributions.

GPS programs can be implemented using a higher-level domain specific language called Green-Marl,
and automatically compiled into native GPS code. Green-Marl is a traditional imperative language with several
graph-specific language constructs that enable intuitive and simple expression of complicated algorithms.

We have completed an initial version of GPS, which is available to download.
We have run GPS on up to 100 Amazon EC2 large instances and on graphs of up to 250 million
vertices and 10 billion edges.

Using GPS we have studied the effects on system performance of different
ways of partioning the graph, both before the graph computation
starts and during the computation. A detailed description of GPS and our
work on partitioning can be found in our
paper published in SSDBM in July 2013.

Using our Green-Marl compiler, we were able to implement several algorithms such as Approximate Betweenness Centrality and
Conductance, whose native GPS implmentations are very challenging. A detailed description of our work on the Green-Marl
compiler can be found in our paper from CGO 2014..
Details about Green-Marl can be found in the original Green-Marl paper.

We have been exploring the problem of implementing graph algorithms efficiently on Pregel-like systems. We have done extensive
implementations and evaluations of several fundamental graph algorithms such as
computing strongly and weakly connected components (SCC), minimum spanning trees, a coloring, and approximate maximum matchings of large-scale graphs. We have observed that standard implementations of graph algorithms can incur unnecessary inefficiencies such as slow convergence or high communication or computation cost in Pregel-like systems, typically due to structural properties of the input graphs such as large diameters or skew in component sizes. We have developed several optimization techniques to address these inefficiencies. Our most general technique is based on the idea of performing some serial computation on a tiny fraction of the input graph, complementing Pregel’s vertex-centric parallelism. A detailed description
of our work on optimizations and algorithms can be found in our paper that appeared in VLDB 2014.

In another work related GPS, though not directly built on top of GPS, we identify a set of high-level primitives for distributed processing of large-scale graphs. The motivation for our work is the observation that the APIs of current distributed graph systems, that contain a set of functions, such as the vertex.compute() of Pregel-like systems, or the gather(), apply(), and scatter() of Powergraph, are too low-level and difficult to program for certain computations. Programming these APIs yields long and complex programs for certain computations (for examples, see the implementations of the minimum spanning tree or the strongly connected components algorithms on GPS codebase). Similar issues with the MapReduce framework have led to widely-used languages such as Pig Latin and Hive, which offer high-level primitives for large-scale data processing. We took a similar approach for graph processing and introduced HelP, a set of high-level primitives that abstract some of the commonly used operations in distributed graph processing. HelP is implemented as a library ontop of the open-source GraphX distributed system. We currently do not have an implementation of the HelP primitives that compile to GPS. Details of the HelP primitives can be found in our paper that appeared at the GRADES 2014 workshop.

In our latest work, we address the problem of debugging programs written for Pregel-like systems, which currently can be a tedious and long manual process. After interviewing Giraph and GPS users, we developed Graft. Graft supports the debugging cycle that users typically go through: (1) Users describe programmatically the set of vertices they are interested in inspecting. During execution, Graft captures the context information of these vertices across supersteps. (2) Using Graft's GUI, users visualize how the values and messages of the captured vertices change from superstep to superstep,narrowing in suspicious vertices and supersteps. (3) Users replay the exact lines of the vertex.compute() function that executed for the suspicious vertices and supersteps, by copying code that Graft generates into their development environments' line-by-line debuggers. Graft also has features to construct end-to-end tests for Giraph programs. Graft is open-source and fully integrated into Apache Giraph's main code base. The details of the system can be found in our technical report, and the code and instructions on how to use Graft can be found on github.

The GPS project is supported by the National Science
Foundation (grant IIS-0904497),
KAUST, and an Amazon Web Services Research Grant.

GPS Source Code and Documentation

The GPS source code is available as open-source code under the BSD license.
The source code can be found
here.
Here is the GPS documentation site containing information about how to download, set-up and run GPS as well as the GPS extensions to Pregel and the GPS API.
Please join the user group stanfordgpsusers @ googlegroups.com for any questions/problems you might have in
setting GPS up (the email group is public to everyone, you can join the group from Google Groups website).