Running Presto over Apache Twill

April 3, 2014

Alvin Wang is a software engineer at Cask where he is building software fueling the next generation of Big Data applications. Prior to Cask, Alvin developed real-time processing systems at Electronic Arts and completed engineering internships at Facebook, AppliedMicro, and Cisco Systems.

Please note: Continuuity is now known as Cask, and Continuuity Reactor is now known as the Cask Data Application Platform (CDAP).

We open-sourced Apache Twillwith the goal of enabling developers to easily harness the power of YARN using a simple programming framework and reusable components for building distributed applications. Twill hides the complexity of YARN with a programming model that is similar to running threads. Instead of writing boilerplate code again and again for every application you write, Twill provides a simple and intuitive API for building an application over YARN.

Twill makes it super simple to integrate new technologies to run within YARN. A great example of this is Presto, an ad-hoc query framework, and in this blog, I’ll explain what it is and how we were able to make Presto run within YARN using Twill in a short period of time.

Why did we want to run Presto over Twill?

We wanted to add ad-hoc query capabilities to our flagship product, Continuuity Reactor. We looked at different frameworks and got started on experimentation with Presto because it is written in Java and is emerging as an important big data tool. The next question was how to integrate it? We opted to run Presto within YARN because it gives developers the flexibility to manage and monitor resources efficiently within a Hadoop cluster, and the capability to run multiple Presto instances.

We use Twill extensively in Reactor for running all services within YARN. So, in order for us to run Presto within Reactor, we had to integrate it with Twill.

What is Presto?

Prestois a distributed SQL query engine for running interactive analytic queries against data sources of all sizes ranging from gigabytes to petabytes. Last fall, Facebook open-sourced Presto, giving the world a viable, faster alternative to Hive, the data warehousing framework for Hadoop. Presto was designed for interactive analytics and approaches the speed of commercial data warehouses while scaling to the size of large organizations.

How does Presto work?

When executing a query, the query is sent to the coordinator through the command-line interface. The coordinator distributes the workload across workers. Each worker reads and processes data for their portion of the input. The results are then sent from the workers back to the coordinator, which would then aggregate the results to form a full response to the query. Presto works much faster than Hive because it doesn’t need to run a new MapReduce job for every query, as the workers can be left running even when there aren’t any active queries.

How did we integrate Presto with Twill?

First, we needed to run Presto services (discovery, coordinator, and worker) embedded in TwillRunnables, which posed a couple of challenges:

The public Presto distribution provides Bash scripts and Python scripts for running Presto services, but has no documented way to run in embedded mode.

Next, we needed to get Presto services to use an existing Hive metastore with the Hive connector so that Presto CLI can run queries against Hive tables. While Presto includes basic documentation for file-based configuration of the Hive connector, there isn’t any documentation on how to do it programmatically. To tackle this, we inspected the code that loads the Hive connectors. We found that ConnectorManager.createConnection() was setting up the connectors, but the ConnectorManager instance was a private field in CatalogManager, so we had to use reflection. While not ideal, it worked. In the future, we may contribute our source code to Presto to make it easier to embed in Java. The code we used to register the Hive connector is shown below:

Once we were able to run embedded Presto without Twill, we packaged Presto with all the dependency jars into a bundle jar file to avoid dependency conflicts. Then we simply configured a Twill application to run various instances of BundledJarRunnablethat were running the Presto services contained within the jar file. Below is a full example of a Twill application that runs Presto’s discovery service that is packaged within a jar file using BundledJarRunnable:

As you can see, once you have your application running from Java code, Twill makes it straightforward to write a Twill application that runs your code inside a YARN container.

Adding new features to Twill

During the process of getting Presto to run over Twill, we contributed a couple of new features to Twill to make it easier for anyone to implement applications that have similar needs: We’ve added support for running Twill runnables within a clean classloader and we’re currently working on allowing users to deploy Twill runnables on unique hosts. In the future, we plan to open-source our Presto work so that anyone can spin up their own Presto services in YARN, and we are also considering support for Presto in Reactor to speed up ad-hoc queries.

Apache Twill is undergoing incubation at the Apache Software Foundation. Help us make it better by becoming a contributor.