October 20, 2015

It has been a little over a year since the Voldemort 1.9 release, and the team of contributors at LinkedIn has not been standing still. After 22 patch releases, 100 pull requests, 244 commits, 504 files changed, and some 20 thousand lines rewritten, it is time for us to unleash the next major milestone in Voldemort’s history, release 1.10!

Although it would be impossible to mention every change that went into the Voldemort 1.10 release, the two major areas of focus are around performance improvements and new operational-centric features for the Hadoop-to-Voldemort pipeline.

1. Performance Improvements

The Voldemort code had some inefficiencies as well as a few vulnerable corner cases. Over the past year, a lot of code was rewritten, and then received some further fixes after running in production for a while. Below are details on the specific improvements included in this work, namely around resource utilization, garbage collection and asynchronous connection acquisition.

1.1 Lighter Resource Footprint

Previously, the Voldemort client/server communication code had a duplicate buffer for each connection – one for receiving and one for sending. This was inefficient as only one of the two buffers was being used at any given time for a given connection. The code was rewritten so that the buffer is shared. With this change, the memory footprint of the client-side and server-side connection buffers was reduced by half.

On the server-side, a few changes also went in to use fewer file descriptors. Read-Only stores now take up half of the file descriptors they previously consumed. There is also a new heartbeating mechanism to detect and release dead connections, instead of letting them pile up.

1.2 Reduced Garbage Collection

There were additional opportunities to optimize how buffers were being resized. The previous implementation was fairly naive and simply doubled the buffer each time it wasn’t big enough. During each doubling, the content of the previous buffer would need to be copied into the new one, which resulted in lots of garbage to collect. The new code instead determines the correct buffer size before sending data across the network. Another change was made to reuse buffers when dealing with Avro serialization. These changes resulted in an 80% reduction of memory allocation, which makes the new Voldemort client much more GC-friendly.

On the server-side, a similar inefficiency was discovered where buffers were not being reused properly inside the Hadoop client library. Upgrading to a more modern Hadoop version fixed the issue, which significantly sped up our Read-Only data fetches from HDFS – by about five times! Fetches were so fast, in fact, that we were worried about saturating the network and ended up enabling our throttling feature (more on that below.)

1.3 Asynchronous Connection Acquisition

The Voldemort client uses a limited number of NIO Selectors which operate over many channels. When done properly, this type of architecture minimizes CPU context switching, thereby enabling the system to achieve lower latency and higher throughput. Unfortunately, there were a few problems in the original Voldemort implementation.

Connection acquisition was handled in a blocking fashion, rather than asynchronously, which resulted in unnecessary delays in the processing of other in-flight requests. The extra delay would cause unrelated requests to time out, which would cause their connections to be torn down, thus resulting in a vicious cycle of expensive connection reestablishment. Even worse, if too many requests to a single server node were considered to have timed out, the client-side Failure Detector would conclude that this node is completely down which would cause all connections to it to be torn down, thus compounding the problem even further when those connections were reestablished later. Clearly, there was a lot of room for improvement, so most of that code was rewritten and now connection acquisition is handled asynchronously.

The new code enables high throughput applications to chug along fine. The new code also isolates failures so that legitimate time outs on one connection will not affect other connections.

This was mainly a problem for zoned clusters, since remote data centers tend to have longer connection establishment times, which compounded the problem described above.

1.4 Performance Comparison

Now, the crunchy part you’ve all been waiting for. How has performance improved in the 1.10 release, given all of the above memory footprint, garbage collection, and asynchronous connection work? We are pleased to report that it improved a lot, especially in the 95th and 99th percentiles of latency!

The graphs below show the 95th and 99th percentiles of latency, as measured from the client’s perspective. In order to properly interpret and fully appreciate this benchmark, you have to consider the following facts:

The metrics shown below come Voldemort clients deployed to 20 different servers, and the metrics shown here are their maximum (worst) p95/p99 latencies.. So this is not a lucky server having a good day, quite the opposite: it is the single slowest server for any given minute.

The Voldemort clients are interacting with one of our Voldemort Read-Only clusters, which is composed of 40 nodes. This is not a cluster dedicated to this use case, it is heavily multi-tenant, hosting upwards of 90 stores, with an average total throughput of 20K QPS and a peak of 56K total QPS.

The new code is in green, overlaid with the old code, in pink. The overlay is offset by exactly two weeks, therefore, the same days of the week are being compared. The throughput is comparable across the two time periods.

The x axis spans three full weekdays.

The unit of the y axis is milliseconds. Numbers suffixed with “m” mean microseconds.

1.4.1 95th Percentile Latency Comparison

In this graph, we see that the average 95th percentile latency is comparable between the old and new code, sitting at a little over 800 microseconds. The worst-case 95th percentile latency over the course of the three days observed, however, is spikier in the old code, going up to about 2.5 milliseconds.

With the new code, we can comfortably claim to provide sub-millisecond 95th percentile latency, which we couldn’t do before.

This kind of latency is typically sufficient for even the most stringent use cases, and does not require putting an in-memory cache in front of the Voldemort cluster! Voldemort itself is serving a dataset which does not fit in memory, and still manages to achieve these impressive numbers. In LinkedIn’s Voldemort deployments, the index fits in memory, but the actual data is looked up from solid state drives.

1.4.2 99th Percentile Latency Comparison

In this graph, we see that the average 99th percentile latency is significantly better, going down from 9 milliseconds to 1.5 milliseconds. The worst-case 99th percentile is also significantly better, going down from 101 milliseconds to 28.5 milliseconds, during the three days observed.

We believe that the main contributor to this is the improved memory allocation, which significantly cut down garbage collection time in the server.

2. Hadoop to Voldemort Read-Only Pipeline Improvements

Voldemort comes in two main flavors: Read-Write, which is backed by a mutable data store (typically BDB-JE), and Read-Only, which bulk loads immutable data sets from Hadoop through a process known as Build and Push.

Voldemort Read-Only is a solid choice for taking data which was computed offline in Hadoop and serving it to online applications with very low latency requirements. At LinkedIn, there are hundreds of stores spread across a handful of Voldemort Read-Only clusters. In the open-source community, a significant portion of the engagement is for the Read-Only functionality, probably because of the increasing popularity of the Lambda Architecture.

Over the past year, a large number of bug fixes and new features went into the Hadoop-to-Voldemort pipeline. The main areas of improvement are around bandwidth, reliability, and usability.

2.1 Bandwidth Optimization and Control

Although bandwidth is often considered to be a fairly cheap commodity, there are cases where this assumption does not hold. One example arises when transferring large data sets to data centers half a world away through expensive trans-oceanic pipes. In order to cope with the evolving needs of the business, we have introduced several improvements to minimize and level off bandwidth requirements, including better compression, throttling, and parallelism.

2.1.1 Block-level Compression

In order to reduce bandwidth requirements, Build and Push now supports block-level compression. When enabled, the Build and Push job instructs the reducer tasks at the end of the build phase to output compressed files to HDFS. As the Voldemort servers fetch those files over the network, they decompress them on the fly, trading network IO for CPU cycles, before persisting them uncompressed to local storage. The resulting bandwidth reduction varies quite a bit since some stores are more compression-capable than others. At LinkedIn, we observed an overall reduction of 18%, when considering all stores relative to their size.

This feature is off by default, and can be enabled with the server-side configuration setting: readonly.compression.codec=GZIP

2.1.2 Fetch Throttling

One of the nightmares of Net Ops folks is trying to plan around sporadic, extremely spiky data transfers, such as those caused by Voldemort’s Build and Push jobs. Although Voldemort already had some throttling code, it was fairly primitive, allowing spikes to sneak in under certain circumstances. It was fully rewritten to leverage Tehuti, the metrics library extracted from Kafka. The rewrite also allowed us to ensure that throttling worked properly in combination with block-level compression.

The feature is off by default, and can be enabled with the server-side configuration setting: fetcher.max.bytes.per.sec=1337

2.1.3 Parallel Pushes

Pushes to multiple clusters now happen in parallel rather than sequentially. At LinkedIn, most of our Read-Only use cases get pushed to all our data centers, so this has significantly improved the end-to-end run time of Build and Push jobs. Moreover, when combined with fetch throttling, this allows us to have a steady stream of data going into each of our data centers, rather than one big sequential spike per destination, thus fulfilling the data ingestion in the same total time as before, but with reduced bandwidth usage which is spread out evenly over time.

Parallel pushes are the new default and cannot be turned off.

2.2 Reliability Improvements

A large number of teams at LinkedIn rely upon Voldemort Read-Only. Even though the data is computed offline and cannot be mutated in real-time, some use cases have stringent data freshness requirements nonetheless, which means Build and Push functionality cannot go down for any extended period of time. Besides squashing a bunch of minor bugs, some efforts also went into more significant developments around monitoring, storage quota, and high availability.

2.2.1 Monitoring Hooks

In order to accommodate integration into third-party systems, Build and Push now supports hooks which can invoke arbitrary code at various points during the lifetime of the job (starting, building, pushing, etc.). At LinkedIn, there are three different internal services getting fed data points and notifications through this abstraction.

This feature can be leveraged by implementing the BuildAndPushHook interface or, for simple use cases, by extending the HttpHook abstract class.

2.2.2 Storage Space Quotas

Sometimes, the challenges of scale are not just technical, but also organizational. The business needs for manageability and cost-efficiency drive us towards having fewer, larger, multi-tenant deployments. This in turn brings a host of other issues, like tenants tipping over a cluster with a bigger dataset than originally expected, or accidentally pushing to another cluster than the one they were assigned. In order to cope with LinkedIn’s culture of fast-paced development, we chose to address these stability concerns with the introduction of storage space quotas.

When enabled, the Build and Push job gathers measurements about the final (uncompressed) footprint that a push will take on each destination node. The server accesses this metadata before beginning to fetch, and decides whether to allow the push or not.

The storage space quota limit can be controlled on a store-per-store basis via the vadmin.sh script:

A default quota can also be defined for all new stores being created with the following server-side configuration settings: default.storage.space.quota.in.kb=1337

2.2.3 Build and Push High Availability

Voldemort’s real-time client/server interactions are already highly available, capable of withstanding the failure of one or more nodes (depending on store replication settings). In contrast, Build and Push jobs fail when targeting a cluster with a failed node, thus preventing data from getting refreshed in a timely manner. Over the past few months, changes have been introduced across many releases in order to fix the various single points of failure in the Build and Push code.

The details of this new feature, which is also off by default, are beyond the scope of this article, but more info on how it works and how to enable it can be found here.

2.3 Usability Improvements

As mentioned above, Voldemort Read-Only is resurfacing as a popular choice these days for serving data online which was originally computed offline on Hadoop. With increased public interest, however, came increased scrutiny of how difficult it actually was to get up and running. To tackle this problem, we’ve been working with the community to make it easier to work with the Read-Only pipeline. For example, a lot of unhelpful logs and error messages have been removed or clarified.

On a funnier note, it seems that Voldemort has been operated at LinkedIn with Kerberos authentication enabled for so long by now, that no one even realized that it was broken without it! This was fixed so people can get up and running without going through the long and tedious process of setting up Kerberos.

Besides that, more significant changes revolve around the processes of building and running the Build and Push code.

2.3.1 Build Improvements

The build script has been improved in order to provide an uber JAR including all necessary dependencies shaded inside. This is especially useful with Hadoop 2, since Voldemort is still relying on older versions of certain dependencies which have since been upgraded in Hadoop, so shading them makes sure we have what we need, without any clash.

The uber JAR can be generated with the following Gradle task:

$ ./gradlew bnpJar

2.3.2 Run Script

The Build and Push job is somewhat tightly coupled with Azkaban, although this is an unnecessary coupling. In order to make it easier to get up and running, we are now providing a new main class which runs standalone. Build and Push, one of Voldemort’s old allies, is no longer a Prisoner of Azkaban.

An example script making use of the new decoupled main class can be found, from the root of the project, at:

$ bin/run-bnp.sh

Conclusion

Voldemort 1.10, with its performance boost and various operational improvements, packs quite a punch. Perhaps as impressive as the aforementioned tangible improvements are the much needed refactoring and code clean-up which also went into this release. All in all, the code base did not grow in size significantly, since a lot of the rewriting involved deleting redundant or unused code. In this regard, we are as proud of what made it into Voldemort as what was taken out.