Sharing thoughts on building or using distributed systems

Category Archives: compute

The last few blog posts explored the topic of counting unique items efficiently using two specific sketching techniques – HyperLogLogs (HLLs) and KMinHash. The underlying motivation of these techniques was to use probabilistic data structures for counting high cardinality data sets, with a focus on being efficient in both time and space, trading off some accuracy in the counts. For high cardinality data sets, this is a reasonable tradeoff in some domains. We saw how HLLs in Redis provided unique counts with an error of about 0.18% with a bounded 12KB memory size per key. We also saw how well additional operations like unions and intersections fared with HLLs, and how KMinHash provided a more accurate measure over HLLs for intersection operations.

One of the strong advantages of sketching techniques is their efficiency vis-a-vis time and space system measures. Therefore, while we concluded that KMinHash provided more accurate results over HLLs for intersections, it would be good to set it in context alongside a performance comparison so that tradeoffs can be made between accuracy and performance. The purpose of this blog post is to cover the system performance measures of the two methods, using Redis as a store for the counts.

Context for the performance tests

All tests involved two sets: Set A: 175,000 elements, Set B: 10,000 elements, and their intersection Set A n B: 7,500 elements. The elements were added to Redis data structures using Python client code.

The test code operated in two phases. The first added the elements to keys representing the HLL and KMinHash sets. Once all additions were completed, the second phase computed the intersection cardinality using HLL Inclusion/Exclusion principle and the KMinHash method, respectively. The details of the HLL based implementation have been covered in the first and second posts. The KMinHash algorithm has been covered in the third post. Readers can review those posts to familiarise themselves with the details.

The tests were performed on a MacBook Pro 1.6 GHz Intel Core i5 processor, 4 GB 1600 MHz DDR3 RAM. Redis version was 3.0.3 compiled from source, and started with default configuration (at least, as far as the performance related configuration goes). The test code used Python Redis client 2.10.3.

Counting with KMinHash

For KMinHash, recall that the algorithm was implemented using Redis sorted sets storing the IDs as items in the set sorted according to their hashes (which acted as scores). The add phase added/updated elements in the Redis sorted sets. The compute phase computed the Jaccard coefficient estimate using the algorithm described in post 3, and from there computed A n B cardinality.

At the time of computing the Jaccard coefficient, we should only consider the ‘k’ minimum values of the MinHash sets. However, the sets may have more than ‘k’ elements during the add phase. This gives us a knob to tradeoff between time and memory. For example, we could either keep exactly ‘k’ elements at all times thereby making sure that memory is bounded. This does require more operations in the add phase to ensure the cardinality is maintained (via a ZREM operation, for instance). The motivation to keep the memory bounded might come if we need to maintain a lot of such MinHash sets (say, for different dimensions being measured) and cumulatively, the amount of memory might shoot up very high. On the other hand, we could allow the memory to be slightly unbounded, but make the add operation very fast. This could be a valid strategy if the number of additions is going to happen very fast and saving on time is crucial.

Based on the above choices, I tried three different approaches for implementing KMinHash.

Optimise for time (time-optimised): Add multiple elements as a batch using the Redis pipeline mechanism, and truncate the batch to the size of ‘k’ once we have added the batch. Note that in this approach, the MinHash set’s size could grow beyond ‘k’ (depending on the size of the batch).

Here’s how the add phase looks using batch addition. Note the cardinality adjustment at the end of the batch.

Optimise for memory (mem-optimised): Bound the cardinality of the KMinHash sets to ‘k’ at add time itself. We do this by truncating the MinHash set to ‘k’ elements after any addition that potentially increases the set’s size. We maintain some state on the client – the current cardinality of the sorted set and the current max MinHash value. This state acts as a cache to help avoid some calls to the Redis server.

This is how the mem-optimised version looks. Note the cardinality adjustment after every add post ‘k’ elements. The local state is maintained in variables like elements_added and max_min_hash

Server side scripting (scripting): Redis has a mechanism to execute something akin to stored procedures of a database, by writing them using the LUA scripting language. In this method, define a LUA script that updates the KMinHash set keeping the cardinality bounded to ‘k’. Call this script in pipeline mode during the add phase using the Redis command EVAL or EVALSHA.

Here’s the Lua script that is loaded and executed in the Redis server process. Note how the cardinality is adjusted after every addition post ‘k’ elements. The difference with the mem-optimised approach is that all state is maintained in Redis itself.

The native support for HLL in Redis acts as an advantage and it should be intuitively clear that HLLs score better than KMinHash overall. So, this is not really to show whether KMinHash is better than HLL (which it is not), but to illustrate the comparative system measures for similar cardinality sets in both approaches, as also among the various KMinHash implementation strategies.

Time comparison

Things to measure here included the time indicators between HLL and KMinHash implementations, and also across the various KMinHash implementations. To compare various KMinHash implementations, the high level ‘time’ command was used. The tests were run multiple times to see stability of the time measures across different data sets. The results of the same are as below:

KMinHash – time-optimised: 8.5 seconds (average real time)

KMinHash – mem-optimised: 13.85 seconds (average real time)

KMinHash – scripting: 11.2 seconds (average real time)

Note that this time includes the add phase and compute phase; however, since the HLL addition and cardinality computation is fixed, the time difference is only accounted for by the various KMinHash strategies used.

To compare times between HLL and KMinHash specifically, the Python profiler cProfile was used and the cumulative time measured across individual calls. The results are as below:

HLL addition: 6.1 seconds

KMinHash addition – time-optimised: 7.2 seconds

KMinHash addition – mem-optimised: 12.9 seconds

KMinHash addition – scripting: 10.2 seconds

HLL intersection: 0

KMinHash intersection – time-optimised: 0.03 seconds

KMinHash intersection – mem-optimised: 0.04 seconds

KMinHash intersection – scripting: 0.02 seconds

Note that the times in the profiled runs don’t add up exactly to the measurements using the ‘time’ command. I suspect this could be due to the profiler overhead.

From the above, we can draw the following conclusions:

As expected, HLL performance is the best among all approaches in terms of time measures.

The best performance among KMinHash approaches is from the time-optimised approach, followed by the Lua scripting approach and finally by the mem-optimised approach. This is as expected.

The time-optimised approach is slower than the HLL approach by about 18%. In comparison, the slowest KMinHash approach (mem-optimised) is almost 100% slower.

The time difference for intersection computation is not significant to consider and hence additions is what should be considered for selecting an approach.

Memory comparison

In terms of memory, HLL is a very efficient data structure compared to sorted sets. There are probably parameters that can be tuned for optimising set memory as well, but these will likely cause some increased load on processing time. I did not consider this in my tests.

One thing to note is that the memory is bounded in both cases after the addition of all elements: 12KB for HLL, memory for max ‘k’ elements in KMinHash. The length of the objects used can be determined using the redis command DEBUG OBJECT <key-name>. The results after adding elements from a representative dataset to both HLL and KMinHash keys are as follows:

HLL key 175000 size: Serialized length 10491 bytes

HLL Key 10000 size: Serialized length 8526 bytes

KMinHash key 1 size: Serialized length 187530 bytes

KMinHash key 2 size: Serialised length 179048 bytes

One other thing that is relevant is how the memory grows as elements are added to the sets, as these spikes can cause pressure on Redis server memory when there are a lot of such sets. Referring to the various approaches for KMinHash, we see that the time-optimised approach can add more than ‘k’ elements as they are added in batches. To measure this, I ran the redis command INFO periodically and monitored the used_memory_peak_human value as the add phase was in progress. The results are as follows:

KMinHash – time-optimised: 67.2 MB

KMinHash – mem-optimised: 62.2 MB

KMinHash – scripting: 62.1 MB

With the Lua scripting technique, there is also an increase in the Lua memory used by Redis (used_memory_lua) to about 50176 bytes compared to the default of 36864 bytes.

My first implementation of the time-optimised technique adjusted the cardinality of the KMinHash set to ‘k’ only when the intersection cardinality was computed (sort of a lazy approach), instead of adjusting it after every batch addition. With this approach, the used_memory_peak_human value rose as high as 125.86 MB.

From the above, we can draw the following conclusions:

Memory used by KMinHash is an order of magnitude more than that used by HLL.

The mem-optimised approach is only marginally better in used_memory_peak compared to the time-optimised approach.

A lazy time-optimised approach that clears memory only at the end of the add phase does significantly increase memory consumption – almost 100% more than the optimised cases.

Conclusion

HLL is superior to KMinHash based implementations by a reasonable margin from a performance perspective, which is expected given that it is a highly optimised implementation in the Redis server. However, for the accuracy gains of KMinHash, the penalty doesn’t seem too high. Given that the time-optimised approach gives the best time performance, with only marginally weaker memory performance, it is possibly the best implementation overall for KMinHash. So, it could well be something that is implemented along side HLLs to provide an efficient and accurate unique value counting solution in a BigData analytics system.

While I have tried to optimise the code as much as I could, I might not have got everything completely right, as my Redis knowledge isn’t too high. If anyone has suggestions to improve this implementation, or alternate ideas, I request readers to please post those in comments for the benefit of all.

This post originally appeared in my older blog. I am carrying over a few of those posts which seem to be popular.

On the Hadoop users mailing list, I was recently working with a user on a recurring problem faced by users of Hadoop. The user’s tasks were running out of memory and he wanted to know why. The typical approach to debug this problem is to use a diagnostic tool like jmap, or hook up a profiler to the running task. However, for most Hadoop users, this is not feasible. In real applications, the tasks run on a cluster to which users do not have login access and they cannot debug the task as it runs. I was aware of an option that is provided by the JVM – -XX:+HeapDumpOnOutOfMemoryError that can be passed to the task’s Java command line. This option makes the JVM to take a dump when the task runs out of memory. However, the dump is saved by default to the current working directory of the task, which in Hadoop’s case is a temporary scratch space managed by the framework and it would become inaccessible once the task completes.

At this point, when we were pretty much giving up on options, Koji Noguchi, an ex-colleague of mine and a brilliant systems engineer and Hadoop debugger, responded with a way out. You can read about it here. The few lines mentioned feel almost like magic, so I thought I would write an explanation about how it is working, for people who are interested.

The requirement is to save the dump to a location which is accessible by the user running the task. Such a location, in Hadoop’s case, is … HDFS. So, what we want is to be able to save the dump to HDFS when the scenario happens.

To get there, the first observation is that the JVM offers hooks that can be passed to the task’s command line, for us to act on an OutOfMemory scenario. The options Koji used are:

These options instruct the JVM to take a dump on OutOfMemory, save it with a name of our choice, and importantly, run a custom script . It should be obvious now that the custom script can copy the generated dump to DFS, using regular DFS commands.

There are a couple of gaps to plug though – as the devil is in the details. How does the script (copy-dump.sh, in the example above) get to the cluster nodes ? For that, we use Hadoop’s feature – Distributed Cache. This feature allows arbitrary files to be packaged and distributed to cluster nodes where tasks require them – using HDFS as an intermediate store. The cached files are usually available to tasks via a Java API. However, since we need this outside the task’s context, we use a powerful option of the Distributed Cache – creating symlinks. This option not only makes the files available to the tasks via the API, but also creates a symbolic link of that file into the current working directory of the task. Hence, when we refer to the script in the task’s command line, we can refer to it relative to the current working directory of the task, i.e. ‘.’.

The last detail in the whole solution is about how to save the dump on HDFS with a name that is unique. Because, realize that multiple tasks are running together at the same time, and more than one of them could run out of memory. Koji’s scripting brilliance for solving this problem was to use the following script:

Make unique path in HDFS for every attempt

1

hadoop fs-put heapdump.hprof/tmp/heapdump_hemanty/${PWD//\//_}.hprof

The expression ${PWD//\//_} takes the current working directory of the task (from the environment), and replaces every occurrence of ‘/’ with an ‘_’. Nice !

So, using these options, features and diagnostics of Hadoop and the JVM, users can now get memory dumps of their tasks to locations that they can easily access and analyse. Thanks a lot to Koji for sharing this technique, and happy debugging !!!

This post is carried over from my earlier blog site here. I am migrating posts that seem to have gathered most hits there.

Hortonworks has been blogging about a framework called Tez, a general purpose data processing framework. Reading through the posts, I was reminded of a similar framework that had come from Microsoft Research a while back called Dryad. This blog post is an attempt at comparing them.

In order to structure the comparison, I am trying to express the points under the following topics: historical perspective, features, concepts, and architecture.

Historical Perspective

Both Tez and Dryad define distributed, data parallel computing frameworks that lay an emphasis on modelling data flow. A data processing ‘job’ in either is defined as a graph. The vertices of the graph represent computational processes, with the edges connecting them describing input they receive and output they send out from / to other computational vertices or data sources / sinks. Both systems attempt to provide an efficient execution environment for running these jobs, abstracting users away from needing to handle common distributed computing requirements such as communication, fault tolerance, etc.

At the time of its introduction, Dryad was possibly Microsoft’s view on how to build such a framework from ground up. In contrast to Hadoop, Dryad attempted even then to provide a framework that wasn’t restricted to just one model (MapReduce) of computation. Dryad was inspired by a variety of data processing systems including MPP databases, data parallel programs on GPUs, and MapReduce as well. It attempted to build a system that could express all these kinds of computation.

Tez was introduced as a generalisation of the MapReduce paradigm that had dominated Hadoop computation for several years. However, it seems to be inspired more by data flow frameworks like Dryad. It was enabled immensely by the separation of concerns brought to the Hadoop MapReduce layer in the form of Apache YARN, that separated cluster resource management from distributed job management, enabling more models than just MapReduce. A direct motivation for Tez was the Stinger initiative, launched to build a faster version of Apache Hive. Specifically, the idea was to enable expressing a HQL query as a single Tez job, rather than multiple MapReduce jobs, thereby avoiding the overhead of launching multiple jobs and also incurring the I/O overhead of having to store data between jobs on HDFS.

Features

Tez and Dryad share several features, such as:

The DAG model being the specification choice for a job

A flexible / pluggable system where the framework tries to give the user control of the computation, nature of input and output, etc.

Supporting multiple inputs and outputs for a vertex (that enable SQL like joins to be expressed, and various forms of data partitioning like the shuffle sort phase of Hadoop MapReduce)

An ability to modify the DAG at runtime based on feedback from executing part of the graph. The runtime modification is primarily used for improving the efficiency of execution in both systems. For e.g. in Dryad, this was used to introduce intermediate aggregator nodes (akin to the combiner concept in Hadoop MapReduce), while in Tez, this is being used as a way to optimise the number of reducers or when they would get launched.

Dryad was built from ground up without a supporting resource management or scheduling framework, and some of its ‘features’ are present in or shared by other layers of the Hadoop stack like YARN. In addition to those, Dryad allowed one specific optimisation through which processing nodes can execute concurrently, co-located and connected via shared memory or pipes.

Tez on its hand, expands on learnings from the Hadoop MapReduce framework. For example, it expands on a feature available with MapReduce called JVM reuse, whereby ‘containers’ launched to run the vertex programs of Tez can be reused for multiple Tez tasks. It even allows sharing data between these tasks via an ‘Object Registry‘ without needing to have them run concurrently.

Concepts

Naturally, the core concepts of a Graph are common between the systems.

In Tez:

A vertex is defined by the input, output and processor associated with it.

The logical and physical manifestations of a graph are explicitly separated. Specifically, the inputs and outputs are of two types – a physical type and a logical type. The logical type describes the connection between a vertex pair as per the DAG definition, while the physical type will represent the connection between a vertex pair at runtime. The Tez framework automatically determines the number of physical instances of a vertex in a logical graph.

Edges are augmented with properties that relate to data movement (for e.g. multicast output between connected vertices), scheduling (co-schedule, or in sequence) and data source (persistence guarantees on the vertex’s output). Tez expects that by using a combination of these properties, one can replicate existing patterns of computation like MapReduce.

In addition to the graph concepts, there is also the concept of an ‘event’. Events are a means for the vertices and the framework to communicate amongst themselves. Events can be used to handle failures, learn about the runtime characteristics of the data or processing, or indicate the availability of data.

In Dryad:

Inputs and outputs are considered vertices just like processing vertices.

Dryad represents the logical representation of the DAG as a set of ‘stages’. However, this does not seem to be a first class concept to specify the DAG at definition time. Specifically, Dryad expects the specific number of instances of a vertex at runtime to be defined at definition time.

A lot of operators are defined which help to build a graph. For instance:

Cloning: is an operation by which a given Vertex is replicated. Such a cloning operation is used to define a physical manifestation of a graph.

Composition: is used to define types of data movement patterns (akin to the edge property in Tez) like round robin data transfer, scatter-gather etc.

Merge: is used for defining operations like fork/join etc.

Encapsulation: is a way of collapsing a graph into a single vertex, which makes it execute on a single node – used to express concurrent, co-located execution.

It appears the idea behind the operators is again to try and define patterns of computation like MapReduce.

A ‘channel’ is an abstraction of how data is transferred along an edge. There is support for different types of channels like File, Shared Memory, Pipes etc. This is similar to the physical Input/Output types in Tez.

Architecture

Tez is a YARN application. A Tez job is coordinated by the Tez Application Master (AM). It is comprised of Tez tasks. Each task encapsulates a processor (vertex) of the DAG and all inputs and outputs connected to it. A Tez task is launched within a YARN container. However, in the interest of providing good performance, a single YARN container could be reused for multiple Tez tasks. This is managed by a ‘TezTask’ host. The host also manages a store of objects that can be shared between Tez tasks that run within the host.

The Tez Application Master has a Vertex Manager plugin (that can be customised by the developer) for every type of Vertex. In addition, the AM also maintains a Vertex State Machine. As the state of the DAG changes, the Vertex Manager is invoked by the Application Master, who can then act on the State machine to customise the graph execution.

Another point to note is that Tez relies on YARN’s resource manager and scheduler for initial assignment of containers, etc. However, it has the ability to make the scheduling a two level activity. For example, Tez does come with scheduling capabilities, which it uses for features like container reuse.

Dryad’s architecture includes components that do resource management as well as the job management. A Dryad job is coordinated by a component called the Job Manager. Tasks of a job are executed on cluster machines by a Daemon process. Communication with the tasks from the job manager happens through the Daemon, which acts like a proxy. In Dryad, the scheduling decisions are local to an instance of the Dryad Job Manager – i.e. it is decentralised.

The logical plan for a Dryad DAG results in each vertex being placed in a ‘Stage’. The stages are managed by a ‘Stage manager’ component that is part of the job manager, similar to the Vertex Manager in Tez. The Stage manager is used to detect state transitions and implement optimisations like Hadoop’s speculative execution.

Conclusion

Dryad was discontinued by Microsoft in late 2011. Microsoft has since been contributing to Hadoop. Given the similarities between the two systems, a question is about how Tez’s prospects are going to be different from Dryad. A few points that seem to favour Tez, IMO:

Tez rides on years of learning from Hadoop MapReduce and other systems including Dryad. Microsoft recently posted that it contributes to Tez. The expectation then would be that the insights and learnings from systems (including what did not work) will help build a better system.

The separation of concerns brought about by YARN potentially helps Tez to focus on problems specific to the graph processing model, while delegating resource management and scheduling decisions to another layer – at least partially.

The API for Graph construction in Tez appears a lot simpler and intuitive to understand than the corresponding one in Dryad. Hence, it seems easier to adopt the model from a programmer perspective.

Given Tez was launched with a specific initiative of making Hive faster, there is a goal it is working towards, and there seems to already be evidence that Tez is enabling improvements in Hive as shown here.

Personally, I feel it would be good to have Tez succeed and several people who have invested in Hive will be able to see huge improvements in performance from their existing applications.

Acknowledgements

Most of the information for this post has come from the publicly available knowledge in blog posts and published paper. If there is any omission or mis-representation, please do let me know !

An initial draft of this post was reviewed by a few committers at Hortonworks: Siddharth Seth, Bikas Saha, Hitesh Shah and Vinod Kumar Vavilapalli. I am thankful to them for their feedback. While I have incorporated some of it, I felt some others are best explained from their end, possibly as comments. I will notify them once the blog is published.

Specifically calling out two points:

Both Sid and Hitesh have called out that there are going to be additional changes to the architecture and features in Tez that will soon be published. As this blog was being written, a new post came out from Hortonworks mentioning a new concept called Tez Sessions. So, be sure to watch out for Hortonworks blogs on Tez for more information.

Bikas provided feedback about Tez’s motivation being closer not just to systems like Dryad, but also other data flow systems like Hyracks and Nephele. It may be a good academic exercise to see these other systems as well from a perspective of learning.