How Impala Fits Into the Hadoop Ecosystem

Impala makes use of many familiar components within the Hadoop ecosystem. Impala can interchange data with
other Hadoop components, as both a consumer and a producer, so it can fit in flexible ways into your ETL and
ELT pipelines.

How Impala Works with Hive

A major Impala goal is to make SQL-on-Hadoop operations fast and efficient enough to appeal to new
categories of users and open up Hadoop to new types of use cases. Where practical, it makes use of existing
Apache Hive infrastructure that many Hadoop users already have in place to perform long-running,
batch-oriented SQL queries.

In particular, Impala keeps its table definitions in a traditional MySQL or PostgreSQL database known as
the metastore, the same database where Hive keeps this type of data. Thus, Impala can access tables
defined or loaded by Hive, as long as all columns use Impala-supported data types, file formats, and
compression codecs.

The initial focus on query features and performance means that Impala can read more types of data with the
SELECT statement than it can write with the INSERT statement. To query
data using the Avro, RCFile, or SequenceFile file
formats, you load the data using Hive.

The Impala query optimizer can also make use of table
statistics and column statistics.
Originally, you gathered this information with the ANALYZE TABLE statement in Hive; in
Impala 1.2.2 and higher, use the Impala COMPUTE
STATS statement instead. COMPUTE STATS requires less setup, is more
reliable, and does not require switching back and forth between impala-shell
and the Hive shell.

Overview of Impala Metadata and the Metastore

As discussed in How Impala Works with Hive, Impala maintains information about table
definitions in a central database known as the metastore. Impala also tracks other metadata for the
low-level characteristics of data files:

The physical locations of blocks within HDFS.

For tables with a large volume of data and/or many partitions, retrieving all the metadata for a table can
be time-consuming, taking minutes in some cases. Thus, each Impala node caches all of this metadata to
reuse for future queries against the same table.

If the table definition or the data in the table is updated, all other Impala daemons in the cluster must
receive the latest metadata, replacing the obsolete cached metadata, before issuing a query against that
table. In Impala 1.2 and higher, the metadata update is automatic, coordinated through the
catalogd daemon, for all DDL and DML statements issued through Impala. See
The Impala Catalog Service for details.

For DDL and DML issued through Hive, or changes made manually to files in HDFS, you still use the
REFRESH statement (when new data files are added to existing tables) or the
INVALIDATE METADATA statement (for entirely new tables, or after dropping a table,
performing an HDFS rebalance operation, or deleting data files). Issuing INVALIDATE
METADATA by itself retrieves metadata for all the tables tracked by the metastore. If you know
that only specific tables have been changed outside of Impala, you can issue REFRESH
table_name for each affected table to only retrieve the latest metadata for
those tables.

How Impala Uses HDFS

Impala uses the distributed filesystem HDFS as its primary data storage medium. Impala relies on the
redundancy provided by HDFS to guard against hardware or network outages on individual nodes. Impala table
data is physically represented as data files in HDFS, using familiar HDFS file formats and compression
codecs. When data files are present in the directory for a new table, Impala reads them all, regardless of
file name. New data is added in files with names controlled by Impala.

How Impala Uses HBase

HBase is an alternative to HDFS as a storage medium for Impala data. It is a database storage system built
on top of HDFS, without built-in SQL support. Many Hadoop users already have it configured and store large
(often sparse) data sets in it. By defining tables in Impala and mapping them to equivalent tables in
HBase, you can query the contents of the HBase tables through Impala, and even perform join queries
including both Impala and HBase tables. See Using Impala to Query HBase Tables for details.