Tuesday, July 19, 2011

Hadoop is great. It seems clear that it will serve as the basis of the vast majority of analytical data management within five years. Already today it is extremely popular for unstructured and polystructured data analysis and processing, since it is hard to find other options that are superior from a price/performance perspective. The reader should not take the following as me blasting Hadoop. I believe that Hadoop (with its ecosystem) is going to take over the world.

The problem with Hadoop is that its strength is also its weakness. Hadoop gives the user tremendous flexibility and power to scale all kinds of different data management problems. This is obviously great. But it is this same flexibility that allows the user to perform incredibly inefficient things and not care because (a) they can simply add more machines and use Hadoop's scalability to hide inefficiency in user code (b) they can convince themselves that since everyone talks about Hadoop as being designed for "batch data processing" anyways, they can let their process run in the background and not care about how long it will take for it to return.

Although not the subject of this post, an example of this inefficiency can be found in a SIGMOD paper that a bunch of us from Yale and the University of Wisconsin published 5 weeks ago. The paper shows that using Hadoop on structured (relational) data is at least a factor of 50 less efficient than it needs to be (an incredibly large number given how hard data center administrators work to yield less than a factor of two improvement in efficiency). As many readers of this blog already know, this factor of 50 improvement is the reason why Hadapt was founded. But this post is not about Hadapt or relational data. In this post, the focus is on graph data, and how if one is not careful, using Hadoop can be well over a factor of 1000 less efficient than it needs to be.

Before we get into how to improve Hadoop's efficiency on graph data by a factor of 1000, let's pause for a second to comprehend how dangerous it is to let inefficiencies in Hadoop become widespread. Imagine a world where the vast majority of data processing runs on Hadoop (a not entirely implausible scenario). If people allow these factors of 50 or 1000 to exist in their Hadoop utilization, these inefficiency factors translate directly to factors of 50 or 1000 more power utilization, more carbon emissions, more data center space, and more silicon waste. The disastrous environmental consequences in a world where everyone standardizes on incredibly inefficient technology is downright terrifying. And this is ignoring the impact on businesses in terms of server and energy costs, and lower performance. It seems clear that developing a series of "best practices" around using Hadoop efficiently is going to be extremely important moving forward.

Let's delve into the subject of graph data in more detail. Recently there was a paper by Rohloff et. al. that showed how to store graph data (represented in vertex-edge-vertex "triple" format) in Hadoop, and perform sub-graph pattern matching in a scalable fashion over this graph of data. The particular focus of the paper is on Semantic Web graphs (where the data is stored in RDF and the queries are performed in SPARQL), but the techniques presented in the paper are generalizable to other types of graphs. This paper and resulting system (called SHARD) has received significant publicity, including a presentation at HadoopWorld 2010, a presentation at DIDC 2011, and a feature on Cloudera's Website. In fact, it is a very nice technique. It leverages Hadoop to scale sub-graph pattern matching (something that has historically be difficult to do); and by aggregating all outgoing edges for a given vertex into the same key-value pair in Hadoop, it even scales queries in a way that is 2-3 times more efficient than the naive way to use Hadoop for the same task.

The only problem is that, as shown by an upcoming VLDB paper that we're releasing today, this technique is an astonishing factor of 1340 times less efficient than an alternative technique for processing sub-graph pattern matching queries within a Hadoop-based system that we introduce in our paper. Our paper, led by my student, Jiewen Huang, achieves these enormous speedups in the following ways:

Hadoop, by default, hash partitions data across nodes. In practice (e.g., in the SHARD paper) this results in data for each vertex in the graph being randomly distributed across the cluster (dependent on the result of a hash function applied to the vertex identifier). Therefore, data that is close to each other in the graph can end up very far away from each other in the cluster, spread out across many different physical machines. For graph operations such as sub-graph pattern matching, this is wildly suboptimal. For these types of operations, the graph is traversed by passing through neighbors of vertexes; it is hugely beneficial if these neighbors are stored physically near each other (ideally on the same physical machine). When using hash partitioning, since there is no connection between graph locality and physical locality, a large amount of network traffic is required for each hop in the query pattern being matched (on the order of one MapReduce job per graph hop), which results in severe inefficiency. Using a clustering algorithm to graph partition data across nodes in the Hadoop cluster (instead of using hash partitioning) is a big win.

Hadoop, by default, has a very simple replication algorithm, where all data is generally replicated a fixed number of times (e.g. 3 times) across the cluster. Treating all data equally when it comes to replication is quite inefficient. If data is graph partitioned across a cluster, the data that is on the border of any particular partition is far more important to replicate than the data that is internal to a partition and already has all of its neighbors stored locally. This is because vertexes that are on the border of a partition might have several of their neighbors stored on different physical machines. For the same reasons why it is a good idea to graph partition data to keep graph neighbors local, it is a good idea to replicate data on the edges of partitions so that vertexes are stored on the same physical machine as their neighbors. Hence, allowing different data to be replicated at different factors can further improve system efficiency.

Hadoop, by default, stores data on a distributed file system (HDFS) or a sparse NoSQL store (HBase). Neither of these data stores are optimized for graph data. HDFS is optimized for unstructured data, and HBase for semi-structured data. But there has been significant research in the database community on creating optimized data stores for graph-structured data. Using a suboptimal store for the graph data is another source of tremendous inefficiency. By replacing the physical storage system with graph-optimized storage, but keeping the rest of the system intact (similar to the theme of the HadoopDB project), it is possible to greatly increase the efficiency of the system.

To a first degree of approximation, each of the above three improvements yield an entire order of magnitude speedup (a factor of 10). By combining them, we therefore saw the factor of 1340 improvement in performance on the identical benchmark that was run in the SHARD paper. (For more details on the system architecture, partitioning and data placement algorithms, query processing, and experimental results please see our paper).

It is important to note that since we wanted to run the same benchmark as the SHARD paper, we used the famous Lehigh University Benchmark (LUBM) for Semantic Web graph data and queries. Semantic Web sub-graph pattern matching queries tend to contain quite a lot of constants (especially on edge labels) relative to other types of graph queries. The next step for this project is to extend and benchmark the system on other graph applications (the types of graphs that people tend to use systems based on Google's Pregel project today).

In conclusion, it is perfectly acceptable to give up a little bit of efficiency for improved scalability when using Hadoop. However, once this decrease in efficiency starts to reach a factor of two, it is likely a good idea to think about what is causing this inefficiency, and attempt to find ways to avoid it (while keeping the same scalability properties). Certainly once the factor extends beyond the factor of two (such as the enormous 1340 factor we discovered in our VLDB paper), the sheer waste in power and hardware cannot be ignored. This does not mean that Hadoop should be thrown away; however it will become necessary to package Hadoop with "best practice" solutions to avoid such unnecessarily high levels of waste.

14 comments:

Hi Daniel, you mentioned in the post that neither of HDFS and HBase data stores are optimized for graph data. I'm looking at the problem of using HBase to graph mining. Is there any paper about measuring HBase performance for graph data? Thanks.

Mapper/reducer is meant for key/value-like processing, not for graph and network data processing. If you are using Hadoop for network data processing, you are barking up the wrong tree. For graph processing, try neo4j, allegrograph or Google's pregel.

Hi Daniel, Very nice work! and good point about the characteristics of the benchmark data and queries. Not sure if you have seen this SIGMOD'11 paper on RDF benchmarks, they have some interesting observations there: http://dx.doi.org/10.1145/1989323.1989340

Hi Daniel, we came to the same conclusion in our work on parallel abstractions for machine learning. As a consequence we developed an alternative computational abstraction, called GraphLab, to represent algorithms on graph structured data. By changing both the implementation and computational model we were able to both improve the running time and even the theoretical performance of our learning algorithms. You can find out more about GraphLab at http://graphlab.org.

Thanks for the post! Educational. I certainly agree that one needs to pay close attention if your primary workload is running very inefficiently. Of course if a work load is using a small fraction of your resource, runs infrequently or is experimental, etc., then worrying about a factor of 2 or 10 might be premature optimization. I've found that the key is exposing your users to the economics of the resources they are using in multi-teneted environments.

It will be interesting to see what can be done using the next version of Hadoop 0.23. That will support using custom execution frameworks within hadoop clusters. If your graph fits in RAM this should allow very optimal solutions. If not, it seems like it could still take advantage of the sorts of tricks you describe and those that HBase region servers use to do very well.

1. Partitioner is plugable in Hadoop. You can create your own.2. Replica placement is plugable as well (not sure its in the 0.21(2) but its definitely in the trunk )3. HDFS provides low-level generic storage abstraction and you are can build any graph - optimized solution on top of HDFS API.

I have been using Hadoop for a season and recently I have to manage graph datasets. I agree with previous comments that Hadoop is amazing for its purpose, but it is not the solution for every problem. It reminds me a book (http://www.umiacs.umd.edu/~jimmylin/MapReduce-book-final.pdf) by Jimmy Lin explaining how to adapt MapReduce to graph problems. Why Hadoop instead well oriented solutions as Neo4J?

However, your improvement seems really impresive and thank you again for writing this posts ;)

I am really interested in your latest publication as referenced in this blog. However the URL of http://cs-www.cs.yale.edu/homes/dna/papers/sw-graph-scale.pdf seems to be broken.Could you please fix that?

Daniel Abadi

About Me

Daniel Abadi is an Associate Professor at Yale University, doing research primarily in database system
architecture and implementation. He received a Ph.D. from MIT and a M.Phil. from Cambridge. He is best known for his research in column-store database systems (the
C-Store project, which was commercialized by Vertica), high performance transactional systems (the H-Store project, which was commercialized by VoltDB),
and Hadoop (the HadoopDB project). Abadi has been a recipient of a Churchill
Scholarship, an NSF CAREER Award, a Sloan Research Fellowship, the 2008 SIGMOD
Jim Gray Doctoral Dissertation Award, and the 2007 VLDB best paper award. His
research on HadoopDB is currently being commercialized by Hadapt, where Abadi
also serves as chief scientist. He blogs at http://dbmsmusings.blogspot.com and
tweets at http://twitter.com/#!/daniel_abadi.