Last year, Qubole introduced a fast caching framework for big data engines. This framework enables big data engines such as Presto, Hive, and Spark to localize the data from a cloud object store to the local disk for faster access.

Currently, 90% of Qubole's customers using Presto have enabled RubiX by default in their clusters and they are experiencing significant performance gains for their read-intensive workloads. Today, we are announcing support for RubiX on open-source Presto. In this post, we will walk you through how to configure your EMR Presto cluster to enable RubiX and show you the performance gain we have experienced.

RubiX Architecture

RubiX works on the following basic principles.

Split Ownership Assignment

The entire file is divided into chunks, and with the help of consistent hashing, a node is assigned to be the owner of a particular chunk. RubiX uses this ownership information to provide hints to the respective task schedulers to assign tasks on those nodes.

Block-Level Caching

The whole file chunk, as mentioned above, is divided into logical blocks of a configured size (default value of which is 1MB). The metadata of these blocks (whether cached or not) is kept in the BookKeeper daemon, and also checkpointed to local disk. When a read request comes in, the request is converted into logical blocks, and on the basis of metadata of those blocks, the system decides whether to read the data from the remote location (object store in this case) or from the cache of the local or non-local nodes.

This architecture gives RubiX the following advantages:

Works well with columnar file format: RubiX reads and caches only the data that is required.

Engine-independent scheduling logic: The task assignment works on the locality of the data, which is determined by using consistent hashing of the file chunk.

Shared cache across JVMs: Because the metadata of the cache is stored in a single JVM outside of any engine, any big data framework can use this information to get the benefits of the cache.

Performance Analysis

To analyze the performance impact of RubiX, we compared how long a job takes to run when data is on S3 with how long the same job takes when the data is in local disk through RubiX cache.

We selected TPC-DS data of scale 500 in ORC format as our base dataset. We selected 20 queries with a good mix of interaction and reporting to eliminate any kind of bias. The queries can be found here.

For both S3 read and local RubiX cache read, the queries were run three times; the best run is represented here. For the local RubiX cache read, we have pre-populated the cache via earlier runs of the same queries.

The following graph shows the performance improvement provided by Rubix.

As you can see from the above chart, RubiX provides better performance for every query; the improvement ranges from 6 percent to 32 percent. To summarize:

Max improvement: 32% (Query 42)

Min improvement: 6% (Query 59)

Average improvement: 20%

The percentage improvement is calculated as:

(Time without Rubix - Time with Rubix) / Time without Rubix

Because the RubiX cache is optimized for reading the data from local disk, there are greater performance gains for read-intensive jobs. For CPU-intensive jobs, the gains made while reading the data are offset by the time spent on CPU cycles.

Setting Up Rubix

Cluster configuration:

Presto version:0.184

Number of slave nodes:2

Node type:r3.2xlarge

Log in to the master node as the Hadoop user once the cluster is up and set up passwordless SSH among the cluster nodes:

ssh <key> hadoop@master-public-ip

Install Rubix Admin on the master node. This is the admin tool for installing RubiX on the cluster and starting the necessary daemons.

Running a Simple Example

We will run a very simple Presto query that reads data from a remote location in S3 and does some basic aggregation. The first run of the query warms up the cache by downloading the required data, and subsequent runs use the local cache data to do the necessary aggregation.

We are starting Hive pointing to its embedded Metastore server. As the RubiX-related JARs are pushed ito then Hadoop lib directory during RubiX installation, the thrift Metastore server needs to restarted to be aware of the custom scheme rubix. We also need to set the awsAccessKeyId and awsSecretAccessKey for the rubix scheme by setting fs.rubix.awsAccessKeyId and fs.rubix.awsSecretAccessKey.

SELECT language, page_title, AVG(hits) AS avg_hits
FROM default.wikistats_orc_rubix
WHERE language = 'en'
AND page_title NOT IN ('Main_Page', '404_error/')
AND page_title NOT LIKE '%index%'
AND page_title NOT LIKE '%Search%'
GROUP BY language, page_title
ORDER BY avg_hits DESC
LIMIT 10;

The cache statistics are pushed to MBean named rubix:name=stats. We can run the following query to see the stats:

After the first run (cache warmup) of the Presto job, the numbers look like the following:

Node

Cached Reads

Extra ReadFrom Remote

Hit Rate

Miss Rate

NonLocal DataRead

NonLocal Reads

Read FromCache

ReadFrom Remote

Remote Reads

node1

17

16.59

0.13

0.87

0.17

496

6.53

500.14

109

node2

23

35.07

0.1

0.9

0.14

135

9.5

306.47

207

Conclusion

RubiX was started to eliminate I/O latency from object stores in public clouds for fast ad-hoc data analysis. The architecture of RubiX is a result of running shared storage, auto-scaling Presto, Hive, and Spark clusters that process petabytes of data stored in columnar formats. We have made RubiX available in open source to work with the big data community to solve similar issues across big data engines, Hadoop services, and distributions on-premise or in public clouds. If you are interested in using RubiX or developing on it, please join the community through the links below.