A Ruby Client for Impala

Thanks to Stripe’s Colin Marc (@colinmarc) for the guest post below, and for his work on the world’s first Ruby client for Cloudera Impala!

Like most other companies, at Stripe it has become increasingly hard to answer the big and interesting questions as datasets get bigger. This is pretty insidious: the set of potential interesting questions also grows as you acquire more data. Answering questions like, “Which regions have the most developers per capita?” or “How do different countries compare in how they spend online?” might involve hours of scripting, waiting, and generally lots of lost developer time.

Up to now, the answer has often been Apache Hive, which at least made it easy to express many of these queries. Unfortunately, Hive queries are typically very slow. Cloudera Impala provides a similar front-end while being orders of magnitude faster, and we’ve found it immensely useful in many different situations at Stripe. With the near real-time results, the notion of performing programmatic (and not just ad-hoc) queries has now become more attractive.

Programmatic Access with Ruby

We have a pretty hefty set of administrative and analytical tools and dashboards. Because most of Stripe is written in Ruby, we’ve had no way to integrate Impala into those tools, or even write basic scripts that make use of our Impala cluster.

To address that, I’ve spent some time over the last few weeks developing a Ruby client for Impala. It’s now available as an open-source gem, with documentation on rubydoc. It’s also available on GitHub.

Using the gem

To install the gem, run 'gem install impala'. Here’s what it looks like to run a query (all the examples use the sample data that Cloudera provides with its Impala demo VM):

1

2

3

4

5

6

require'rubygems'

require'impala'

Impala.connect('hostname',21000)do|conn|

conn.query('SELECT zip, income FROM zipcode_incomes ORDER BY income DESC LIMIT 5')

end

which returns an array of hashes:

1

2

3

4

5

6

7

[

{:zip=&gt;"10514",:income=&gt;189570},

{:zip=&gt;"98243",:income=&gt;188363},

{:zip=&gt;"10577",:income=&gt;187019},

{:zip=&gt;"11568",:income=&gt;184298},

{:zip=&gt;"94028",:income=&gt;181337}

]

If your result is too large to fit into memory, you can also use cursors:

1

2

3

4

5

conn=Impala.connect('hostname',21000)

cursor=conn.execute('SELECT zip, income FROM zipcode_incomes')

cursor.eachdo|row|

# do something with the row here

end

The idea is to make Impala just as simple to use as any other SQL-based datastore like MySQL or PostgreSQL.

What’s Next?

I look forward to seeing more tools and frameworks built on top of Impala. By way of example, I whipped up a sample project that provides a web interface for running queries and saving results for later inspection. It took about two hours to build with Sinatra and Bootstrap. The source, along with instructions for getting it running, are on GitHub.

If you build something cool with impala-ruby, or if you have any feedback or questions, I’d love to hear from you!