Learn from our challenges and triumphs as our talented engineering team offers insights for discussion and sharing.

Scaling Adventures: Redesigning Our Data Delivery Pipeline

We just finished re-architecting one of our core systems responsible for pushing massive amounts of data (over 250 TB / day) to our partners. This re-architecture aimed at improving throughput and long-term scalability and gave us the opportunity to reflect upon some of the core challenges of building scalable systems. This post is an attempt to capture some of this reflection.

Problem

The LiveRamp data processing pipeline ends with packages of customer data that need to be sent to other marketing platforms. This project focused on ‘streaming deliveries’ – those that leverage our partners’ web API’s to send these packages via HTTP requests. The technical problem boils down to pulling batch data from the Hadoop Distributed File System (HDFS) and executing HTTP requests containing this data. The challenges comes from having to perform these operations at a scale of hundreds of TBs of data through close to a billion requests per day.

Original Solution

We originally solved this problem by relying on ActiveMQ as a buffer between the batch inputs and streaming outputs. A single process added to the queue in bulk while worker machines continuously pulled from the queues and executed the relevant requests. While this system served us well for over two year, we started to encounter scaling difficulties as our data delivery volume expanded.

Increasing volumes of data caused gradually increasing latency in the system. Last year we witnessed a 4x increase in the amount of data we streamed to our partners – going from 115 million requests per day in January to 630 million a day in December. However, each delivery took 13 times longer to make it through the delivery process. We made numerous changes to make up for this by adding more queues, incorporating a prioritization mechanism to get important deliveries out, etc. However, these patches barely kept up with current delivery volume, and were clearly going to prove inadequate as that volume grew. They were also contributing a non-trivial amount of code debt.

Re-architecture

It became clear that we needed to re-architect the streaming delivery system and do so with a strong focus on performance. Our two goals for throughput and scalability were quite simple:

Scale linearly with the number of nodes performing deliveries. We wanted to be able to handle order-of-magnitude increases in the amount of data we deliver and the number of destinations to which we do.

In order to max out our partners’ ability to receive requests we needed to find bottlenecks in our own flow and fix them. This meant instrumenting our code and chasing down each source of inefficiency. We relied heavily on the Dropwizard’s Metrics package to monitor the application. We used timers to measure the throughput and latency of specific blocks, counters to track the size of every in-memory queue, and a host of other metrics to isolate the performance of individual components. Their easy-to-use JMX module even allowed us to capture trends in memory and CPU usage. We were quickly able to tweak parameters, catch memory leaks, achieve optimal parallelization, and carve out a lean delivery pipeline.

As important as it was to increase current throughput, it was just as important to do so in a way that would easily scale to handle many times the current load. Our definition of ‘easy to scale’ was to be able to handle an increase in data volume with an equivalent increase in the number of machines performing the deliveries; twice the machines should be able to handle twice the data. This meant making the data flow parallel from start to finish. The diagram below details our final architecture. The key take-away is that most of the work is done by the ‘streaming workers’ and the overall throughput of the system can be increased by just adding more of those.

Each Streaming Worker (1) asks the job coordinator for a package to process , (2) pulls its assigned package directly from HDFS, one datum at a time, and then (3) sends an HTTP request for each datum

Result

Overall, the project was a resounding success. We are comfortably maxing out our fastest destination and are confident of maintaining this throughput in the face of increased volume and destinations.