Impala is a distributed SQL database that runs
on a Hadoop cluster. While Impala enables you to
run highly scalable queries against massive data sets, Impala provides much the
same interface as a conventional SQL database.

If you’re using Impala, it’s likely that your Hadoop cluster also has Hue installed. Hue is a graphical interface for exploring
data and tables in Impala, among other things. It’s a good place to start to
find out what data is available in your cluster and to start developing
queries.

Get the web address and login credentials from your Hadoop cluster
administrator, and open the Impala Query Editor to start exploring tables and
testing queries:

Once you want to start bringing data out of the cluster and into SherlockML to
do some deeper analysis, follow the instructions below to pull the results of
queries into a notebook with Python.

the name of the database that you want to connect to. If this is a new
Impala instance, the only database on it is likely to be named default.

To connect to Impala from Python, we recommend using the Python module impyla. Install it on a SherlockML server
with:

$ pip install impyla # note the 'y'

You can then connect to Impala with:

importimpala.dbapiconnection=impala.dbapi.connect(host='customers.mydomain.com',# host on which the database is runningdatabase='database_name'# name of the database to connect to)cursor=connection.cursor()cursor.execute('SELECT * FROM customers')customers=list(cursor.fetchall())print('We have {} customers'.format(len(customers)))# This is data science!connection.close()

Note

We close the connection to allow the database server to reclaim
resources. This can be critical in a Jupyter notebook, since the
kernel remains alive for a long time.

The Impala connection object can also be used in a with statement to ensure
it gets closed automatically: