Getting Started with Giraph

Apache Hadoop’s core analytical tools (e.g. MapReduce, Hive, Pig) are great for performing batch analytics over large, unstructured data sets. However, a myriad of data sets are comprised of a more graph-like structure.

Examples of such data sets include:

a map with cities connected by roads,

a social network with people connected by relationships,

airports connected by flight paths, and

computers connected via a network (an obvious one to those of us in IT!)

Graph processing is usually an iterative process with heavy reliance on vertex communication. In addition to this, any type of exploratory analysis on a graph data set will require multiple iterations and subsequent calculations. The implication to those of us developing applications in Hadoop: MapReduce isn’t the right tool for the job when working with graph data sets. Enter Apache Giraph – a distributed, fault-tolerant graph processing framework built on top of Hadoop.

Giraph is based on Google’s Pregel, an in-house graph processing system of their own design. Engineers at Yahoo made use of Google’s research paper on the subject to implement their own version for internal use. Eventually the code was contributed to the Apache Software Foundation and is now recognized as a top-level project with many high-profile contributors. Facebook, a key contributor to Giraph, made numerous substantial improvements to the system in 2012 that allowed them to perform analytics on a graph with a trillion edges.

Giraph makes use of the Bulk Synchronous Parallel (BSP) computational model, developed in the 1980s at Harvard University. BSP works by dividing computation into discrete, universal sections called supersteps. Each superstep is made up of three phases: computation, message passing, and synchronization. The BSP model was initially developed as a solution for general parallel computing, but lends itself well to distributed graph processing.

Graphs in Giraph are made up of a set of vertices and their respective edges. Giraph’s API employs a “vertex-centric” programming model, meaning a computation method is written that will run once per vertex, for each superstep. This basic premise allows the graph to be distributed and run across a cluster of machines; the only network communication required is that of the messages passed by the vertices. Within the compute method, vertices typically: receive messages from the previous superstep, make some calculations, and send a new set of messages. After every vertex’s compute method has completed, synchronization occurs and Giraph prepares for the next superstep. During execution, a vertex can elect to go inactive. Execution completes once all vertices in the graph have gone inactive.

Giraph, like most applications underneath the Hadoop umbrella, is written in Java. Writing a new Giraph application is as simple as extending a few classes in the API. Development begins with the graph input. Graphs are typically represented in plain text by a flat file resting on HDFS (although Hive, HBase, and a couple other options are available for I/O). The API requires you to subclass an abstract file-input class in order to properly parse your file. The next step is the vertex computation method. One simple use case for Giraph is to find the maximum vertex value in a given graph. The compute method below illustrates how to accomplish that goal.

The compute method works by processing all of its incoming messages and replacing its own value with the largest it encounters. It then sends this new value to all its neighbors. This way the maximum value in the graph is propagated outward to every vertex in the graph. When all vertices stop receiving messages, program execution will halt automatically. The final vertex values are output, again in the manner of your choosing.

This is only a rudimentary example, but it gives you an idea of how versatile this programming model is. Algorithms like Google’s PageRank, designed to calculate the popularity of pages on the web, can easily be run on enormous graph data sets. In addition to Giraph’s rich set of features, it fits well into any current Hadoop stack without any reconfiguration. Giraph applications are deployed through YARN, allowing any and all other analytics in your cluster to be run simultaneously. While implementing Giraph in an existing Hadoop environment may involve some initial development overhead, when the business need requires manipulation of datasets with a graph-like structure, that initial overhead will pay off.

Have any questions on how to utilize Giraph? Feel free to ask below in the comments!

About Avalon Consulting, LLC

Avalon Consulting, LLC transforms data investments into actionable business results through the visioning and implementation of Big Data, Web Presence, Content Publishing, and Enterprise Search solutions. We are the trusted partner to over one hundred clients, primarily Global 2000 companies, public agencies, and institutions of higher learning.