R and Impala: it’s better to KISS than using Java

One of the best things I like in working at CARD.com is that I am not only crunching R code in 24/7, but I also have the chance to interact with and improve the related data infrastructure with some interesting technologies.

After joining the company in January, I soon realized that while Impala is a very powerful database for handling data that do not comfortably fit in MySQL, it’s still not as fast as one might expect when querying large amount of data from R. Sometimes I had to wait several minutes for a query to run! So I used this spare time to think about how to improve the workflow.

Interacting with Impala from R is pretty straightforward: just install and load the RImpala package, which uses the JDBC driver to communicate with Impala. It does the job very well for fetching aggregated data form the database, but gets extremely slow when loading more than a thousand or so row — that you cannot resolve buy throwing more hardware on the problem.

So when loading larger amount of data, the related R process is running with 100% CPU usage on one core, while doing the very same query from bash via impala-shell, the results are returned pretty fast. Why not exporting the data to a CSV file via impala-shell then?

Benchmark

To compare the performance of the two approach in a reproducible way, I started an Elastic MapReduce cluster on AWS with a single m3.xlarge instance running the AMI version 3.8.0 with Impala 1.2.4 — R already pre-installed. Then I downloaded the dbgen utility to generate some data for the benchmarks, as described in the Amazon EMR docs:

Almost 30 seconds to fetch 10K rows! Things getting very slow, right? So let’s create a minimal working R implementation of the above proposed method to export the results from Impala to a CSV file and load it via data.table::fread (due to read performance and I am using data.table in most of my R scripts anyway):

Well, this function is extremely basic and can work only on localhost. For a more general approach with SSH access to remote databases, logging features and a bit of error handling, please see the updated query_impala function referenced at the end of this post.

But this simple function is fair enough to do some quick benchmarks on how JDBC and the CSV export/import hack performs with different number of rows fetched from Impala. Let’s run a loop to load 1, 10, 100, 1K, 10K and 100K values from the database via the two methods, each repeated by 10 times for future comparison:

Unfortunately, I did not have patience to run this benchmark on more rows or columns, but this is already rather impressive in my (personal) opinion. In short, if you are querying more than a 100 rows from Impala and you have (SSH) console access to the server, you’d better use CSV export instead of waiting for the JDBC driver to deliver the data for you.

Quick comparison of the CSV export and the RImpala approach

Please find this quick comparison of the discussed methods for fetching data from Impala to R:

Advantages

Disadvantages

RImpala

Can connect to remote database without SSH access

On CRAN

Slow when querying many rows

Java dependency on the client side

20 megabytes of jar files for the driver

CSV export

Scales nicely

No Java and jar dependencies, it’s damn simple

Needs SSH access for remote connections

Not on CRAN (yet)

Second thoughts

So why is it happening? I was not 100% sure, but suspected it must be something with the jar files or how those are used in R. The query takes the very same amount of time inside of Impala as it does not matter if you export the data into a CSV file or pass it via the JDBC driver, but parsing and loading it takes extremely long with the latter.

Mentioning these interesting results to Szilard at a lunch, he suggested me to give a try directly querying Impala with the RJDBC package. It sounded pretty insane to use a different R wrapper to the very the same jar files of the RImpala package, but I decided to do this extra test after all to make sure it’s a Java and not an R implementation issue — as per my proposal of keeping things simple (KISS) over using Java.

So I unzipped all the jar files used by the RImpala package above and created a new archive containing the merged content in a file named to impala-jdbc-driver.jar. Then loaded the RJDBC package and initialized a connection:

Then we can use the very convenient dbGetQuery method from the DBI package to fetch rows from Impala, with the following impressive results:

So my hypothesis turned out to be wrong! The JDBC driver performs pretty well, even better compared to the CSV hack. I was even tempted to revert our production R scripts to use JDBC instead of the below function using temporary files to read data from Impala, but decided to keep the non-Java approach for multiple reasons after all:

No memory issues when loading large amount of data. By default, the rJava packages starts JVM with 512 MB of memory, which might not be enough for some queries, so you have to update this default value via eg options(java.parameters = '-Xmx1024m') before loading the rJava package.

I prefer using SSH to access the data even if SSL encryption is available with the JDBC driver as well. This might sound silly, but managing users and authentication methods can be a lot easier via traditional Linux users/groups compared to Impala, especially with older CDH releases. Not speaking about in-database permissions here, of course.

Although JDBC can be perfect for reading data from Impala, writing to the database might be a nightmare. I am not aware of any bulk import feature via the JDBC driver, and separate INSERT statements are extremely slow. So instead of preparing SQL statements, I prefer creating an intermediary dump file to be imported by Impala on the command line — via a helper R function that does all these steps automatically. I did not prepare any benchmarks on this, but believe me, it’s a LOT faster. The same also stands for eg Redshift, where loading data from S3 or remote hosts via SSH and using COPY FROM instead of INSERT statements can result in multiple orders of magnitude speedup. This hack seems to be used by the Spark folks as well.

Proof-of-concept demo function to use intermediary CSV files to export data from Impala

If you find this useful, please see the below function to automate the required steps of using an intermediary file instead of JDBC to load data from Impala :

Share This Post

9 Comments

Hi, you should find sqlite useful in terms of speed. The problem here is likely to be JDBC which is very slow. Cloudera should provide better drivers like most of the db vendors. Maybe they already did but nobody wraps it into package?

Thanks for you comment, Jan! While I agree that sqlite is a very useful tool, I think it has very different purposes compared to Impala, which latter is to be run on a cluster of machines to store large amount of data, and with the ability to serve multiple remote clients. If using a local data storage fitting in the psychical memory, I’d go with data.table inside of R 🙂 Matt has a very cool video on this at https://www.youtube.com/watch?v=rvT8XThGA8o

Thanks for getting in touch with me. This is a brilliant blog post that you have got here. I guess the flaw in my approach is that I am converting the RJDBC result to matching R datatypes and that takes time. In my use case we didn’t require to retrieve large datasets back to R as we intended the bulk of the operations to be done on the cluster in Impala. I can probably add an option to the rimpala function call which will allow fetching RJDBC result set as such and let the user convert as required to R datatypes. I probably shouldn’t do drastic changes to maintain backwards compatibility.

Austin, thank you very much for your kind and prompt reply, also for working on RImpala — your hard work is much appreciated. I paste here the related GH ticket so that anyone can easily find and follow the future udpates: https://github.com/Mu-Sigma/RImpala/issues/8

Seems cool! I’ve only just managed to connect R to Impala with RImpala, which was already quite a hazard due to some stupid issues I encountered (yep, bit of a loser here…). Now that it is finally connected it seems to work fine, but it’s very slow when I try to load a little more data. (running R on my local computer might make it worse?)

So I’d like to try both your alternatives, but… would you maybe be so kind to share the new .jar file you made out of the other jar files for the RJDBC connection? ( impala-jdbc-driver.jar)
(“So I unzipped all the jar files used by the RImpala package above and created a new archive containing the merged content in a file named to impala-jdbc-driver.jar.”)

(Like I said, bit of a loser, I wouldn’t get much further than unzipping as I have never made a jar file… )