Frank McSherryThis is a guest blog post by Frank McSherry
Frank McSherry is a computer science researcher active in the area of large scale data analysis. While at Microsoft Research he co-invented differential privacy, and lead the Naiad streaming dataflow project. His current interests involve understanding and improving performance in scalable data processing systems.

The computer science systems and database research communities are abuzz with graph processing systems, large computer systems specialized to process and analyze very large graphs such as global social networks, internet topologies, and the hyperlink structure of the world-wide web. Academic researchers working in these areas have made significant strides in understanding how to process graph-oriented data, but their ability to validate their techniques at scale have been limited by their access to very large graph datasets. The data made available by Common Crawl and Web Data Commons provide an excellent first opportunity for these researchers to understand the performance of graph processing systems at scales that justify their complexity.

Background on graph processing

Graphs are an interesting source of data in which each record (an “edge”) references two distinct entities (called “nodes”). In the case of the web graph, for example, each hyperlink can be viewed as an edge with the source and destination web pages as the nodes it references. Many important social graph analyses are functions of the graph structure: the existence of edges between nodes imposes a constraint on the answer, and many computations can be viewed as maintaining or updating per-node state as a function of the incident edges (and the states of adjacent nodes).

Researchers have since built computational engines based around algorithms in which information flows only along graph edges. These systems distribute the large set of edges across multiple threads, processors, and even computers, and for each edge ensure that the information at each node is shared with the other node. To define a computation, a data analyst then supplies the code for what should happen with this information each time it is presented, for example updating the information maintained by each node to reflect what they have learned from others. The separation of computational logic from system specifics allows each system to effect scalable implementations of many popular graph algorithms, without worrying the programmer about issues of data distribution, network communication, or recovery in the presence of machine failures.

Evaluation at scale

As popular as graph processing systems have become, their evaluation has largely either been on small to medium size data sets, or behind the closed doors of corporate data centers. Evaluations on moderately sized data is not without its use, but it does fail to stress the systems in ways that could inform users, programmers, or administrators when faced with truly large data. Worse, from a research perspective, we shouldn’t expect to make progress in improving graph processing systems without relevant evaluation of when they perform well and when they perform badly.

For example, Gonzalez et al., OSDI 2014 evaluated four of the most popular recent systems used for graph processing —Spark, Giraph,GraphLab, and GraphX— on two graph datasets edges each containing more than one billion edges. Their performance numbers on a cluster of 16 machines, comprising 128 processing cores, reveal that their proposed system improved on prior work with the same features, without losing the performance of specialized systems.

Although one billion edges is quite a lot, the datasets still fit comfortably on a modern laptop. To demonstrate the limitations of evaluations on this size of data, we compared the performance numbers reported by Gonzalez et al. with the performance of the same algorithms executed on my laptop, using about one hundred lines of code. Eliding the specific details of the comparison, no graph processing system strictly out-performed the laptop, and some of the systems were consistently slower than the laptop. With a few improvements to the code, the laptop was able to out-perform all systems on all computations on all datasets evaluated by Gonzalez et al.

These conclusions surprised many people, who had assumed that the scalable systems would obviously out-perform a single core on a laptop. Indeed, the main point of these systems is that one can accomplish more by using multiple computers than one can accomplish using just a single computer. If the systems are not yet accomplishing this goal, doing something you couldn’t otherwise do, it is less clear when and why you would use them, especially given the additional resources (and complexity) they entail. Perhaps one simply should not use them, and instead do graph analysis on my laptop instead.

Scaling up evaluation

One common (and fair) response to our initial evaluation was that the datasets we used (those used by the GraphX evaluation) were not sufficiently large to distinguish between good scalable systems and bad scalable systems (including my laptop, presumably). We completely agree (and this was part of the point of using only existing measurements), but researchers are challenged to access to realistic data at meaningful scales, and the graph datasets used by the GraphX evaluation were among the largest publicly available datasets at the time. Given this difficulty, it is perhaps unfair to blame the researchers for the limits of their evaluation.

This situation has now changed with the data provided by Common Crawl and Web Data Commons (who have processed the former’s crawl data into a compact graph dataset. The graph data they provide is substantially larger, by almost two orders of magnitude, than the graph datasets used by Gonzalez et al. As the dataset is made freely available, researchers can use it to evaluate the improvement their systems provide over simpler single-computer alternatives. If the systems do not improve as much as expected, the researchers can now pin-point what is slow, and can either improve their system’s implementation or improve the ideas underlying its design. In either case, the state of the art can now advance in a more principled manner.

To get the ball rolling on evaluation, we took the graph data supplied by Web Data Commons, 128 billion edges relating 4 billion nodes, approximately one terabyte uncompressed, and evaluated my laptop’s performance on this dataset. This collection is substantially larger than other datasets, and is near the limit of what can be processed by my laptop. The data required some further transformation and compression both to fit on my laptop’s drive, and to avoid exhausting memory when the computations were executed. But, the computations do run, and they take time that is only slightly longer than what one might expect based only on the increase in the number of edges.

The laptop’s performance is good news for people who want to run graph computations at scale, and a healthy challenge (and opportunity) for implementors of graph processing systems. These measurements provide an excellent baseline and proof of concept for the scalable systems to target. A laptop can perform the large-scale computation, admittedly under duress, and so the scalable systems should be expected both to perform the computation and to improve on my laptop’s performance. If a system’s performance is lacking, a comparative evaluation should reveal which components are limiting the system, and they can be improved.

Advancing graph processing systems

It should be mentioned that there are already several graph processing systems that do improve on laptop-scale performance. Ligra is a shared-memory (single machine) system whose performance appears to scale quite well from a single-core baseline when graphs fit in to the computer’s random access memory. For larger graphs, FlashGraph uses an array of solid-state drives to provide performance without requiring nearly as much memory. Finally, Naiad distributes computation across multiple computers, but manages to avoid introducing much overhead when it does, scaling performance up from the appropriate single-machine baseline.

In my opinion, each of these systems represent progress in the state of the art, and we should understand what each have done well (and what each do badly). Given our access to these systems and their ideas, as well as sufficient data to evaluate and distinguish good ideas from bad, there is no obvious reason not to expect the state of the art in graph processing to advance significantly and swiftly. In fact, to the extent that we are serious about performance in graph processing systems, we should demand it.

Follow us @CommonCrawl on Twitter for the latest in Big Open Data. If you value Open Data, please make a donation to the Common Crawl Foundation.