Analyzing a Web graph with 129 billion edges using FlashGraph
This is a guest blog post by Da Zheng
Da Zheng is the architect and main developer of the FlashGraph project. He is a PhD student of computer science at Johns Hopkins University, focusing on developing frameworks for large-scale data analysis, particularly for massive graph analysis and data mining.
FlashGraph is a SSD-based graph processing framework for analyzing massive graphs. We have demonstrated that FlashGraph is able to analyze the page-level Web graph constructed from the Common Crawl corpora by the Web Data Commons project. This Web graph has 3.5 billion vertices and 129 billion edges and is the largest graph publicly available in the world. Thanks to the hard work of the Common Crawl and the Web Data Commons project, we are able to demonstrate the scalability and performance of FlashGraph as well as the graph algorithms designed for billion-node graphs.
You may ask why we need another graph processing framework while we already have quite a few, such as Pregel/Giraph, GraphLab/PowerGraph and GraphX. As pointed out by Frank McSherry in his blog 1 & 2, the current distributed graph processing frameworks have substantial overhead in order to scale out; we should seek performance and capacity (the size of a graph that can be processed). On top of the runtime overheads Frank McSherry mentions, these frameworks also have very large memory overhead. For example, as shown in the performance evaluation of the GraphX paper, Giraph cannot even process a graph with 106 million vertices and 3.8 billion edges in a cluster with aggregate memory of 1088 GB. The similar problem exist in others, as shown here. The large memory overhead prevents them from scaling to a larger graph or unnecessarily wastes resources.
FlashGraph seeks performance, capacity, flexibility and ease of programming at the moment when it was created. We hope FlashGraph can have performance comparable to the state-of-art in-memory graph engines while scaling to graphs with hundreds of billions of edges or even trillions of edges. We also hope that FlashGraph can express varieties of algorithms in FlashGraph and hide the complexity of accessing data on SSDs and parallelizing graph algorithms.
To scale graph analysis and achieve in-memory performance, FlashGraph uses the semi-external memory model, which stores algorithmic vertex state in memory and edge lists on SSDs. This model enables in-memory vertex communication while scaling to graphs that exceed memory capacity. Because vertex communication is the main source of computation overhead in many graph algorithms, it is essential to achieve in-memory performance in vertex communication. To optimize data access on SSDs, FlashGraph deploys two I/O optimizations: access edge lists only required by the applications; conservatively merge I/O requests to achieve higher I/O throughput and reduce CPU overhead caused by I/O.
The graph format used by FlashGraph is designed for both efficiency and flexibility. All graph algorithms in FlashGraph use the same graph format, so each graph only needs to be converted into the format once and to be loaded to SSDs once. FlashGraph stores both in-edges and out-edges in a graph image. In the Web graph, an out-edge is a hyperlink from a Web page to another page, and an in-edge is the reverse of a hyperlink. It is necessary to keep an edge twice for a directed graph because some graph algorithms require in-edges, some require out-edges and some require both. For efficiency, in-edges and out-edges of a vertex are stored separately. This reduces data access from SSDs if an algorithm requires only one type of edges.
FlashGraph provides a very flexible vertex-centric programming interface and supports varieties of graph algorithms. The vertex-centric programming interface allow programmers to “think like a vertex”: each vertex maintains some algorithmic state and performs user-defined computation independently. In FlashGraph, a vertex can communicate with any vertices through message passing and read edge lists of any vertices from SSDs. We have implemented a set of graph algorithms such as breadth-first search, PageRank, connected components and triangle counting. All of these graph algorithms implemented in FlashGraph can run on the page-level Web graph in a single commodity machine and complete at an unprecedented speed, as shown in the table below. The performance result also shows that FlashGraph has a very short initialization time even on this massive graph.
|Algorithm||Runtime (sec)||Init time (sec)||Memory (GB)|
|Weakly connected components||461||32||47|
|PageRank (30 iterations)||2041||33||46|
The more detailed design of FlashGraph is documented by the paper published at FAST’15.
We further explore community detection with FlashGraph on billion-node graphs. Here we detect communities with only active vertices. The activity level of a vertex is measured by a locality statistic (the number of edges in the neighborhood of a vertex). Again, we use the large Web graph to demonstrate the scalability and accuracy of our procedure. The key here is to quickly identify the most active vertices in a graph. Having these vertices, we further cluster them into active communities. In our experiment on the paper, we identify 2000 most active vertices in the Web graph and discover five communities. The sizes of community 1 to 5 are n1 = 35, n2 = 1603, n3 = 199, n4 = 42 and n5 = 121 respectively. Community 1 is a collection of websites that are all developed, sold or to be sold by an Internet media company networkmedia. Community 2 are all hyperlinks extracted from a single Pay-level-domain adult website. In the community 3, most links are social media websites and often used in our daily life such as WordPress.org and Google. Community 4 consists of websites related to online shopping such as the shopping giant Amazon and the bookseller AbeBooks. Community 5 is another collection of 121 adult web pages where each web page comes from a different Pay-level-domain in this cluster. In summary, top 5 active communities in the Web Graph are grouped with high topical similarities.
Active community detection is one application that demonstrates the power of FlashGraph. We are looking forward to seeing more cases that people use FlashGraph for mining massive graphs. We are happy to help others develop algorithms to explore the Web graph as well as other graphs of the similar size or even a larger size.